Workshop Description


In the computational linguistics (CL), natural language processing (NLP), and information retrieval (IR) communities, Arabic is considered to be relatively resource-poor compared to English. This situation was thought to be the reason for the limited number of language resources -based studies in Arabic. However, the past few years witnessed the emergence of new considerably large and free classical and Modern Standard Arabic (MSA) as well as dialectical corpora and to a lesser extent Arabic processing tools.

This workshop follows the footsteps of previous editions of OSACT to provide a forum for researchers to share and discuss their ongoing work. This workshop is timely given the continued rise in research projects focusing on Arabic Language Resources. The sixth workshop comes to encourage researchers and practitioners of Arabic language technologies, including CL, NLP and IR to share and discuss their latest research efforts, corpora, and tools. The workshop will also give special attention to Large Language Models (LLMs) and Generative AI, which is a hot topic nowadays. In addition to the general topics of CL, NLP and IR, the workshop will give a special emphasis on two shared tasks, namely: Arabic LLMs Hallucination and Dialect to MSA Machine Translation.

Shared Tasks


Task 1: Arabic LLMs Hallucination (contact Hamdy Mubarak)


The Arabic LLM Hallucination Shared Task challenges participants to address and mitigate issues related to hallucinated text generated by Arabic language models. It aims to enhance the reliability and trustworthiness of these models by fostering innovation in detection methods and strategies for reducing hallucination. This collaborative effort is vital for improving Arabic language model applications in information retrieval and text generation.

For this task, we share a dataset of 10,000 Arabic factual claims generated by ChatGPT and GPT4 and judged manually for factuality, and linguistic correctness by a wide range of annotators. If applicable, annotators provided verification links.

We challenge participants to predict whether a claim is correct or not, and if it is incorrect, how can they rewrite the claim to fix its factual errors?

Link: https://sites.google.com/view/arabic-llms-hallucination

Task 2: Dialect to MSA Machine Translation (contact Kareem Darwish)


The Dialect to MSA (Modern Standard Arabic) Machine Translation Shared Task offers an opportunity for researchers and practitioners to tackle the intricate challenge of translating various Arabic dialects into Modern Standard Arabic. With the rich linguistic diversity across Arabic-speaking regions, this task aims to advance machine translation capabilities and bridge the gap between colloquial spoken Arabic and the formal written language. Participants will work on developing and refining translation models that can accurately and fluently convert dialectal Arabic text into MSA, making it a crucial initiative for improving communication and comprehension in the Arabic-speaking world.

The shared task will cover multiple dialects, namely: Gulf, Egyptian, Levantine, Iraqi, and Maghrebi. For each dialect, a set of 500 sentences written in both MSA and dialect will be provided for finetuning, and the testing will be done on a set of 500 blind sentences. The participants are free to use whatever resources at their disposal to train and finetune their systems.

Link: https://codalab.lisn.upsaclay.fr/competitions/17118

Workshop Topics

Language Resources:

  • Pre-trained Arabic language models and their applications.
  • Surveying and evaluating the design of available Arabic corpora, their associated and processing tools.
  • Availing new annotated corpora for NLP and IR applications such as named entity recognition, machine translation, sentiment analysis, text classification, and language learning.
  • Evaluating the use of crowdsourcing platforms for Arabic data annotation.
  • Open source Arabic processing toolkits.

Tools and Technologies:

  • Language education, e.g., L1 and L2.
  • Language modeling and pre-trained models.
  • Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, etc.
  • Sentiment analysis, dialect identification, and text classification.
  • Dialect translation.
  • Fake news detection.
  • Web and social media search and analytics.
  • Issues in the design, construction, and use of Arabic LRs: text, speech, sign, gesture, image, in single or multimodal/multimedia data.
  • Guidelines, standards, best practices, and models for LRs interoperability.
  • Methodologies and tools for LRs construction and annotation.
  • Methodologies and tools for extraction and acquisition of knowledge.
  • Ontologies, terminology, and knowledge representation.
  • LRs and Semantic Web (including Linked Data, Knowledge Graphs, etc.).

Issues in the design, construction and use of Arabic LRs:

  • Guidelines, standards, best practices and models for LRs interoperability.
  • Methodologies and tools for LRs construction and annotation.
  • Methodologies and tools for extraction and acquisition of knowledge.
  • Ontologies, terminology and knowledge representation.
  • LRs and Semantic Web (including Linked Data, Knowledge Graphs, etc.).

Paper Types and Formats


OSACT6 invites high-quality submissions written in English. Submissions of two forms of papers will be considered:

  1. Regular long papers – up to eight (8) pages maximum*, presenting substantial, original, completed, and unpublished work.
  2. Short papers – up to four (4) pages*, describing a small focused contribution, negative results, system demonstrations, etc.

* Excluding any number of additional pages for references, ethical consideration, conflict-of-interest, as well as data, and code availability statements.

Upon acceptance, final versions of long papers will be given one additional page – up to nine (9) pages of content plus unlimited pages for acknowledgments and references – so that reviewers’ comments can be taken into account. Final versions of short papers may have up to five (5) pages, plus unlimited pages for acknowledgments and references. For both long and short papers, all figures and tables that are part of the main text must fit within these page limits.

Furthermore, appendices or supplementary material will also be allowed ONLY in the final, camera-ready version, but not during submission, as papers should be reviewed without the need to refer to any supplementary materials.

Linguistic examples, if any, should be presented in the original language but also glossed into English to allow accessibility for a broader audience.

Note that paper types are decisions made orthogonal to the eventual, final form of presentation (i.e., oral versus poster).

Important Dates


Submission due: March 1, 2024
Notification of acceptance: March 25, 2024
Camera-ready papers due: March 30, 2024
Workshop date: May 25, 2024

Submission guidelines

The language of the workshop is English and submissions should be with respect to LREC 2024 paper submission instructions (https://lrec-coling-2024.org/authors-kit/). All papers will be peer reviewed, possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system.

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research.

Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).
Submission Link: https://softconf.com/lrec-coling2024/osact2024/

Accepted Papers

Main Workshop Papers:

1. Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah Alrashoudi, Omar Said Alshahri and Hend Al-Khalifa

2. AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language
Seham Alghamdi, Youcef Benkhedda, Basma Alharbi and Riza Batista-Navarro

3. A Novel Approach for Root Selection in Dependency Parsing
Sharefah Al-Ghamdi, Hend Al-Khalifa and Abdulmalik AlSalman

4. Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Saied Alshahrani, Hesham Mohammed, Ali Elfilali, Mariama Njie and Jeanna Matthews

5. CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais, Hend Al-Khalifa and Abdulmalik AlSalman

6. Munazarat 1.0: A Corpus of Arabic Competitive Debates
Mohammad Khader, AbdulGabbar Al-Sharafi, Mohamad Hamza Al-Sioufy, Wajdi Zaghouani and Ali Al-Zawqari

7. TafsirExtractor: Text Preprocessing Pipeline preparing Classical Arabic Literature for Machine Learning Applications
Carl Kruse and Sajawel Ahmed

8. The Multilingual Corpus of World's Constitutions (MCWC)
Mo El-Haj and Saad Ezzini

9. AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models
Ashwag Alasmari, Sarah Alhumoud and Waad Alshammari

10. Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat, Hadi Khalilia, Gábor Bella and Fausto Giunchiglia

Shared tasks Papers:

1. Sirius_Translators at OSACT6 2024 Shared Task: Fin-tuning Ara-T5 Models for Translating Arabic Dialectal Text to Modern Standard Arabic
Salwa Alahmari

2. OSACT 2024 Task 2: Arabic Dialect to MSA Translation
Hanin Atwany, Nour Rabih, Ibrahim Mohammed, Abdul Waheed and Bhiksha Raj

3. ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar, Abdullah Alharbi, Serry Sibaee, Samar Ahmed, Lahouari Ghouti and Anis Koubaa

4. LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task
AhmedElmogtaba Abdelaziz, Ashraf Elneima and Kareem Darwish

5. AraT5-MSAizer: Translating Dialectal Arabic to MSA
Murhaf Fares

6. ASOS at Arabic LLMs Hallucinations 2024: Can LLMs detect their Hallucinations :)
Serry Sibaee, Abdullah I. Alharbi, Samar Ahmed, Omar Nacar, Lahouri Ghouti and Anis Koubaa

Keynote Speaker


Muhammad Abdul-Mageed, University of British Columbia, Canada

Title:
Towards Arab-Centric Large Language Models

Abstract:

The landscape of large language models (LLMs) is rapidly evolving, yet it continues to face significant challenges in computational efficiency and energy usage. Arabic-centric LLMs are particularly fraught with issues such as inadequate evaluations, cultural insensitivity, insufficient representation of the wide array of Arabic dialects, an absence of multimodal capabilities, designs that are too generic for specialized domains, and a disconnection from other low-resource languages. These problems are compounded by a general lack of detailed knowledge about the Arabic capabilities of existing LLMs. In this talk, we address these challenges by developing a host of models capable of understanding and generating content in a broad spectrum of Arabic languages and dialects. In particular, we present a suite of generative models tailored for text, speech, and image generation, designed to support and enhance the representation of Arabic in several domains. Our approach leverages cutting-edge machine learning methods and large-scale, diverse datasets to ensure our models achieve both high accuracy and cultural relevance. By focusing on critical areas such as archival work, cultural heritage and preservation, financial services, healthcare delivery, and education, our work aims to bridge the linguistic digital divide and foster equitable AI benefits. We discuss the methods we employ, the challenges we encounter, the solutions we propose, and the broader implications of our efforts.


Bio:

Muhammad Abdul-Mageed is a Canada Research Chair in Natural Language Processing and Machine Learning, and Associate Professor with appointments in the School of Information, and the Departments of Linguistics and Computer Science at The University of British Columbia. He is also a Visiting Associate Professor at MBZUAI. His research is in deep learning and natural language processing, focusing on large language models in multilingual contexts, with a goal to innovate more equitable, efficient, and ‘social’ machines for improved human health, more engaging learning, safer social networking, and reduced information overload. Applications of his work span a wide range of areas across speech, language, and vision. He is director of the UBC Deep Learning & NLP Group, co-director of the SSHRC-funded I Trust Artificial Intelligence, and co-lead of the Ensuring Full Literacy Partnership. He is a founding member of the UBC Center for Artificial Intelligence Decision making and Action and a member of the Institute for Computing, Information, and Cognitive Systems. His work has been supported by Google, AMD, Amazon, Natural Sciences and Engineering Research Council of Canada, Social Sciences and Humanities Research Council of Canada, Canada Foundation for Innovation, and Digital Research Alliance of Canada.

Committees

Organizing Committee

  • Hend Al-Khalifa, King Saud University, KSA
  • Hamdy Mubarak, Qatar Computing Research Institute, Qatar
  • Kareem Darwish, aiXplain Inc., US
  • Tamer Elsayed, Qatar University, Qatar
  • Mona Ali, Northeastern University, Canada,

Programme Committee

  • Ganesh Jawahar, University of British Columbia, Canada
  • Go Inoue, Mohamed bin Zayed University of Artificial Intelligence, UAE
  • Bassam Haddad, University of Petra, Jordan
  • Hamada Nayel, Banha University, Egypt
  • Ibrahim Abu Farha, The University of Sheffield, UK
  • Imed Zitouni, Google, USA
  • Almoataz B. Al-Said, Cairo University, Egypt
  • Mourad Abbas, Assistant Secretary-General of Al-Tnall Al-Arabi in Algeria
  • Nada Ghneim, Arab International University, Syria
  • Omar Trigui, University of Sousse, Tunisia
  • Salima Harrat, École Normale Supérieure de Bouzaréah (ENSB), Algeria
  • Salima Mdhaffar, Avignon University (LIA), France
  • Kamel Smaili, University of Lorraine, France
  • Violetta Cavalli-Sforza, Al Akhawayn University, Morocco
  • Wassim El-Hajj, American University of Beirut, Lebanon
  • Wissam Antoun, ALMAnaCH - INRIA Paris, France
  • Nada Almarwani, Taibah University, KSA
  • Samah Aloufi, Taibah University, KSA
  • Imene Bensalem, Constantine 2 University, Algeria
  • Abdelkader El Mahdaouy, Mohammed VI Polytechnic University, Morocco
  • Amr Keleg, University of Edinburgh, UK
  • Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar
  • Amr El-Gendy, Arab Academy, Egypt
  • Maha Alamri, AlBaha University, KSA
  • Saied Alshahrani, Clarkson University, USA
  • Lubna Alhenaki, Majmaah University, KSA
  • Fatimah Alqahtani, Jazan University, KSA
  • Eman Albilali, King Saud Univeristy, KSA
  • Ahmed Abdelali, SDAIA, KSA
  • Mohamed Al-Badrashiny, aiXplain Inc., US
  • Firoj Alam, QCRI, Qatar
  • Norah Alzahrani, SDAIA, KSA
  • Nadir Durrani, QCRI, Qatar
  • Ashraf Elneima, aiXplain Inc., US
  • Nizar Habash, NYU-AD, UAE
  • Walid Magdy, University of Edinburgh, UK
  • Zaid Alyafeai, KFUPM, KSA
  • Injy Hamed, NYU-AD, UAE
  • Fouzi Harrag, Ferhat Abbas University, Algeria

Workshop Program

Saturday 25 May 2024

                        

Session 1: Main Workshop

9:00 - 9:10    Workshop Opening
9:10 - 9:50    Keynote Talk: Towards Arab-Centric Large Language Models
Muhammad Abdul-Mageed
9:50 - 10:10   

AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language
Seham Alghamdi1, Youcef Benkhedda2, Basma Alharbi3, Riza Batista-Navarro4
1Department of Computer Science, University of Manchester, 2University of Manchester, 3University of jeddah, 4Department of Computer Science, The University of Manchester

10:10 - 10:30   

CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais1, Hend Al-Khalifa1, Abdulmalik AlSalman2
1King Saud University, 2King Saud Univ.

                        

Session 2: Main Workshop (Cont.)

11:00 - 11:20   

Munazarat 1.0: A Corpus of Arabic Competitive Debates
Mohammad M Khader1, AbdulGabbar Al-Sharafi2, Mohamad Hamza Al-Sioufy3, Wajdi Zaghouani4, Ali Al-Zawqari5
1QatarDebate Center, 2Sultan Qaboos University, 3Georgetown University - Qatar, 4Hamad Bin Khalifa University, 5ELEC department, Vrije Universiteit Brussel

11:20 - 11:40   

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Saied Alshahrani1, Hesham Haroon Mohammed2, Ali Elfilali3, Mariama Njie4, Jeanna Matthews1
1Clarkson University, 2smsmAI, 3Student, 4M&T Bank

11:40 - 12:00   

A Novel Approach for Root Selection in the Dependency Parsing
Sharefah Ahmed Al-Ghamdi1, Hend Al-Khalifa1, Abdulmalik AlSalman2
1King Saud University, 2King Saud Univ.

12:00 - 12:20   

AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models
Ashwag Alasmari1, sarah alhumoud2, Waad Alshammari3
1National Institutes of Health, 23Al Imam Mohammad ibn Saud Islamic University, 3King Salman Academy for Arabic Language

12:20 - 12:40   

The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj and Saad Ezzini
Lancaster University

12:40 - 13:00   

TafsirExtractor: Text Preprocessing Pipeline preparing Classical Arabic Literature for Machine Learning Applications
Carl Kruse1 and Sajawel Ahmed12
1Goethe-University Frankfurt am Main, 2University of California, Davis

                        

Session 3: Main Workshop (Cont.)

14:00 - 14:20   

Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat1, Hadi Mahmoud Khalilia1, Gábor Bella2, Fausto Giunchiglia1
1The University of Trento, 2Lab-STICC CNRS UMR 628, IMT Atlantique

14:20 - 14:40   

Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah A. Alrashoudi1, Omar Said Alshahri2, Hend Al-Khalifa1
1King Saud University, 2Islamic Sciences Institute, Diwan of the Royal Court.

                        

Session 4: Shared Tasks

14:40 - 14:55   

OSACT6 Dialect to MSA Translation Shared Task Overview
Ashraf Hatim Elneima1, AhmedElmogtaba Abdelmoniem Ali Abdelaziz1, Kareem Darwish2
1aiXplain, 2aiXplain Inc.

14:55 - 15:10   

OSACT 2024 Task 2: Arabic Dialect to MSA Translation
hanin atwany1, Nour Rabih1, Ibrahim Mohammed1, Abdul Waheed2, Bhiksha Raj3
1MBZUAI, 2Mohammad Bin Zayed University of Artificial Intelligence, 3Carnegie Mellon University

15:10 - 15:25   

ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar1, Abdullah Alharbi2, Serry Sibaee1, Samar Ahmed3, Lahouari Ghouti1, Anis Koubaa1
1Robotics and Internet-of-Things Lab, Prince Sultan University, Riyadh 12435, Saudi Arabia, 2King Abdulaziz University, 3Imam Mohammad Ibn Saud Islamic University

15:25 - 15:50   

LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task
AhmedElmogtaba Abdelmoniem Ali Abdelaziz1, Ashraf Hatim Elneima1, Kareem Darwish2
1Aixplain, 2aiXplain Inc.

15:50 - 16:00   

Sirius_Translators at OSACT6 2024 Shared Task: Fin-tuning Ara-T5 Models for Translating Arabic Dialectal Text to Modern Standard Arabic
Salwa Saad Alahmari
University of Leeds

                        

Session 5: Shared Tasks (Cont.)

16:30 - 16:45   

AraT5-MSAizer: Translating Dialectal Arabic to MSA
Murhaf Fares
Independent

16:45 - 17:00   

ASOS at Arabic LLMs Hallucinations 2024: Can LLMs detect their Hallucinations :)
Serry Taiseer Sibaee1, Abdullah I. Alharbi2, Samar Ahmed3, Omar Nacar1, Lahouri Ghouti4, Anis Koubaa1
1Robotics and Internet-of-Things Lab, Prince Sultan University, Riyadh 12435, Saudi Arabia, 2King Abdulaziz University, 3(Imam Mohammad Ibn Saud Islamic University), 4Prince Sultan University, Riyadh 12435, Saudi Arabia

17:00 - 17:05    Workshop Closing