The 6th Workshop on Open-Source Arabic Corpora and Processing Tools (Hybrid)
with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation
Lingotto Conference Centre - Torino (Italia).
20-25 May, 2024.
Co-located with LREC-COLING 2024
In the computational linguistics (CL), natural language processing (NLP), and information retrieval (IR) communities, Arabic is considered to be relatively resource-poor compared to English. This situation was thought to be the reason for the limited number of language resources -based studies in Arabic. However, the past few years witnessed the emergence of new considerably large and free classical and Modern Standard Arabic (MSA) as well as dialectical corpora and to a lesser extent Arabic processing tools.
This workshop follows the footsteps of previous editions of OSACT to provide a forum for researchers to share and discuss their ongoing work. This workshop is timely given the continued rise in research projects focusing on Arabic Language Resources. The sixth workshop comes to encourage researchers and practitioners of Arabic language technologies, including CL, NLP and IR to share and discuss their latest research efforts, corpora, and tools. The workshop will also give special attention to Large Language Models (LLMs) and Generative AI, which is a hot topic nowadays. In addition to the general topics of CL, NLP and IR, the workshop will give a special emphasis on two shared tasks, namely: Arabic LLMs Hallucination and Dialect to MSA Machine Translation.
OSACT6 invites high-quality submissions written in English. Submissions of two forms of papers will be considered: * Excluding any number of additional pages for references, ethical consideration, conflict-of-interest, as well as data,
and code availability statements. Upon acceptance, final versions of long papers will be given one additional page – up to nine (9) pages of content plus
unlimited pages for acknowledgments and references – so that reviewers’ comments can be taken into account. Final versions
of short papers may have up to five (5) pages, plus unlimited pages for acknowledgments and references. For both long and
short papers, all figures and tables that are part of the main text must fit within these page limits. Furthermore, appendices or supplementary material will also be allowed ONLY in the final, camera-ready version, but not
during submission, as papers should be reviewed without the need to refer to any supplementary materials. Linguistic examples, if any, should be presented in the original language but also glossed into English to allow
accessibility for a broader audience. Note that paper types are decisions made orthogonal to the eventual, final form of presentation (i.e., oral versus
poster).
Submission due: March 1, 2024
Notification of acceptance: March 25, 2024
Camera-ready papers due: March 30, 2024
Workshop date: May 25, 2024
The language of the workshop is English and submissions should be with respect to LREC 2024 paper submission instructions (https://lrec-coling-2024.org/authors-kit/). All papers will be peer reviewed, possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system.
When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research.
Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).
Submission Link: https://softconf.com/lrec-coling2024/osact2024/
1. Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah Alrashoudi, Omar Said Alshahri and Hend Al-Khalifa
2. AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language
Seham Alghamdi, Youcef Benkhedda, Basma Alharbi and Riza Batista-Navarro
3. A Novel Approach for Root Selection in Dependency Parsing
Sharefah Al-Ghamdi, Hend Al-Khalifa and Abdulmalik AlSalman
4. Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Saied Alshahrani, Hesham Mohammed, Ali Elfilali, Mariama Njie and Jeanna Matthews
5. CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais, Hend Al-Khalifa and Abdulmalik AlSalman
6. Munazarat 1.0: A Corpus of Arabic Competitive Debates
Mohammad Khader, AbdulGabbar Al-Sharafi, Mohamad Hamza Al-Sioufy, Wajdi Zaghouani and Ali Al-Zawqari
7. TafsirExtractor: Text Preprocessing Pipeline preparing Classical Arabic Literature for Machine Learning Applications
Carl Kruse and Sajawel Ahmed
8. The Multilingual Corpus of World's Constitutions (MCWC)
Mo El-Haj and Saad Ezzini
9. AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models
Ashwag Alasmari, Sarah Alhumoud and Waad Alshammari
10. Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat, Hadi Khalilia, Gábor Bella and Fausto Giunchiglia
1. Sirius_Translators at OSACT6 2024 Shared Task: Fin-tuning Ara-T5 Models for Translating Arabic Dialectal Text to Modern Standard Arabic
Salwa Alahmari
2. OSACT 2024 Task 2: Arabic Dialect to MSA Translation
Hanin Atwany, Nour Rabih, Ibrahim Mohammed, Abdul Waheed and Bhiksha Raj
3. ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar, Abdullah Alharbi, Serry Sibaee, Samar Ahmed, Lahouari Ghouti and Anis Koubaa
4. LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task
AhmedElmogtaba Abdelaziz, Ashraf Elneima and Kareem Darwish
5. AraT5-MSAizer: Translating Dialectal Arabic to MSA
Murhaf Fares
6. ASOS at Arabic LLMs Hallucinations 2024: Can LLMs detect their Hallucinations :)
Serry Sibaee, Abdullah I. Alharbi, Samar Ahmed, Omar Nacar, Lahouri Ghouti and Anis Koubaa
Muhammad Abdul-Mageed, University of British Columbia, Canada
Title:
Towards Arab-Centric Large Language Models
Abstract:
The landscape of large language models (LLMs) is rapidly evolving, yet it continues to face significant challenges in computational efficiency and energy usage. Arabic-centric LLMs are particularly fraught with issues such as inadequate evaluations, cultural insensitivity, insufficient representation of the wide array of Arabic dialects, an absence of multimodal capabilities, designs that are too generic for specialized domains, and a disconnection from other low-resource languages. These problems are compounded by a general lack of detailed knowledge about the Arabic capabilities of existing LLMs. In this talk, we address these challenges by developing a host of models capable of understanding and generating content in a broad spectrum of Arabic languages and dialects. In particular, we present a suite of generative models tailored for text, speech, and image generation, designed to support and enhance the representation of Arabic in several domains. Our approach leverages cutting-edge machine learning methods and large-scale, diverse datasets to ensure our models achieve both high accuracy and cultural relevance. By focusing on critical areas such as archival work, cultural heritage and preservation, financial services, healthcare delivery, and education, our work aims to bridge the linguistic digital divide and foster equitable AI benefits. We discuss the methods we employ, the challenges we encounter, the solutions we propose, and the broader implications of our efforts.
Muhammad Abdul-Mageed is a Canada Research Chair in Natural Language Processing and Machine Learning, and Associate Professor with appointments in the School of Information, and the Departments of Linguistics and Computer Science at The University of British Columbia. He is also a Visiting Associate Professor at MBZUAI. His research is in deep learning and natural language processing, focusing on large language models in multilingual contexts, with a goal to innovate more equitable, efficient, and ‘social’ machines for improved human health, more engaging learning, safer social networking, and reduced information overload. Applications of his work span a wide range of areas across speech, language, and vision. He is director of the UBC Deep Learning & NLP Group, co-director of the SSHRC-funded I Trust Artificial Intelligence, and co-lead of the Ensuring Full Literacy Partnership. He is a founding member of the UBC Center for Artificial Intelligence Decision making and Action and a member of the Institute for Computing, Information, and Cognitive Systems. His work has been supported by Google, AMD, Amazon, Natural Sciences and Engineering Research Council of Canada, Social Sciences and Humanities Research Council of Canada, Canada Foundation for Innovation, and Digital Research Alliance of Canada.
Saturday 25 May 2024 | |||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
|