Workshop Description

In the computational linguistics (CL), natural language processing (NLP), and information retrieval (IR) communities, Arabic is considered to be relatively resource-poor compared to English. This situation was thought to be the reason for the limited number of language resources -based studies in Arabic. However, the past few years witnessed the emergence of new considerably large and free classical and Modern Standard Arabic (MSA) as well as dialectical corpora and to a lesser extent Arabic processing tools.

This workshop follows the footsteps of previous editions of OSACT to provide a forum for researchers to share and discuss their ongoing work. This workshop is timely given the continued rise in research projects focusing on Arabic Language Resources. The sixth workshop comes to encourage researchers and practitioners of Arabic language technologies, including CL, NLP and IR to share and discuss their latest research efforts, corpora, and tools. The workshop will also give special attention to Large Language Models (LLMs) and Generative AI, which is a hot topic nowadays. In addition to the general topics of CL, NLP and IR, the workshop will give a special emphasis on two shared tasks, namely: Arabic LLMs Hallucination and Dialect to MSA Machine Translation.

Shared Tasks

Task 1: Arabic LLMs Hallucination (contact Hamdy Mubarak)

The Arabic LLM Hallucination Shared Task challenges participants to address and mitigate issues related to hallucinated text generated by Arabic language models. It aims to enhance the reliability and trustworthiness of these models by fostering innovation in detection methods and strategies for reducing hallucination. This collaborative effort is vital for improving Arabic language model applications in information retrieval and text generation.

For this task, we share a dataset of 10,000 Arabic factual claims generated by ChatGPT and GPT4 and judged manually for factuality, and linguistic correctness by a wide range of annotators. If applicable, annotators provided verification links.

We challenge participants to predict whether a claim is correct or not, and if it is incorrect, how can they rewrite the claim to fix its factual errors?

Link: https://sites.google.com/view/arabic-llms-hallucination

Task 2: Dialect to MSA Machine Translation (contact Kareem Darwish)

The Dialect to MSA (Modern Standard Arabic) Machine Translation Shared Task offers an opportunity for researchers and practitioners to tackle the intricate challenge of translating various Arabic dialects into Modern Standard Arabic. With the rich linguistic diversity across Arabic-speaking regions, this task aims to advance machine translation capabilities and bridge the gap between colloquial spoken Arabic and the formal written language. Participants will work on developing and refining translation models that can accurately and fluently convert dialectal Arabic text into MSA, making it a crucial initiative for improving communication and comprehension in the Arabic-speaking world.

The shared task will cover multiple dialects, namely: Gulf, Egyptian, Levantine, Iraqi, and Maghrebi. For each dialect, a set of 500 sentences written in both MSA and dialect will be provided for finetuning, and the testing will be done on a set of 500 blind sentences. The participants are free to use whatever resources at their disposal to train and finetune their systems.

Link: https://codalab.lisn.upsaclay.fr/competitions/17118

Workshop Topics

Language Resources:

Pre-trained Arabic language models and their applications.
Surveying and evaluating the design of available Arabic corpora, their associated and processing tools.
Availing new annotated corpora for NLP and IR applications such as named entity recognition, machine translation, sentiment analysis, text classification, and language learning.
Evaluating the use of crowdsourcing platforms for Arabic data annotation.
Open source Arabic processing toolkits.

Tools and Technologies:

Language education, e.g., L1 and L2.
Language modeling and pre-trained models.
Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, etc.
Sentiment analysis, dialect identification, and text classification.
Dialect translation.
Fake news detection.
Web and social media search and analytics.
Issues in the design, construction, and use of Arabic LRs: text, speech, sign, gesture, image, in single or multimodal/multimedia data.
Guidelines, standards, best practices, and models for LRs interoperability.
Methodologies and tools for LRs construction and annotation.
Methodologies and tools for extraction and acquisition of knowledge.
Ontologies, terminology, and knowledge representation.
LRs and Semantic Web (including Linked Data, Knowledge Graphs, etc.).

Issues in the design, construction and use of Arabic LRs:

Guidelines, standards, best practices and models for LRs interoperability.
Methodologies and tools for LRs construction and annotation.
Methodologies and tools for extraction and acquisition of knowledge.
Ontologies, terminology and knowledge representation.
LRs and Semantic Web (including Linked Data, Knowledge Graphs, etc.).

Paper Types and Formats

OSACT6 invites high-quality submissions written in English. Submissions of two forms of papers will be considered:

Regular long papers – up to eight (8) pages maximum*, presenting substantial, original, completed, and unpublished work.
Short papers – up to four (4) pages*, describing a small focused contribution, negative results, system demonstrations, etc.

* Excluding any number of additional pages for references, ethical consideration, conflict-of-interest, as well as data, and code availability statements.

Upon acceptance, final versions of long papers will be given one additional page – up to nine (9) pages of content plus unlimited pages for acknowledgments and references – so that reviewers’ comments can be taken into account. Final versions of short papers may have up to five (5) pages, plus unlimited pages for acknowledgments and references. For both long and short papers, all figures and tables that are part of the main text must fit within these page limits.

Furthermore, appendices or supplementary material will also be allowed ONLY in the final, camera-ready version, but not during submission, as papers should be reviewed without the need to refer to any supplementary materials.

Linguistic examples, if any, should be presented in the original language but also glossed into English to allow accessibility for a broader audience.

Note that paper types are decisions made orthogonal to the eventual, final form of presentation (i.e., oral versus poster).

Important Dates

Submission due: March 1, 2024
Notification of acceptance: March 25, 2024
Camera-ready papers due: March 30, 2024
Workshop date: May 25, 2024

Submission guidelines

The language of the workshop is English and submissions should be with respect to LREC 2024 paper submission instructions (https://lrec-coling-2024.org/authors-kit/). All papers will be peer reviewed, possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system.

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research.

Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).
Submission Link: https://softconf.com/lrec-coling2024/osact2024/

Accepted Papers

Main Workshop Papers:

1. Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah Alrashoudi, Omar Said Alshahri and Hend Al-Khalifa

2. AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language
Seham Alghamdi, Youcef Benkhedda, Basma Alharbi and Riza Batista-Navarro

3. A Novel Approach for Root Selection in Dependency Parsing
Sharefah Al-Ghamdi, Hend Al-Khalifa and Abdulmalik AlSalman

4. Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Saied Alshahrani, Hesham Mohammed, Ali Elfilali, Mariama Njie and Jeanna Matthews

5. CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais, Hend Al-Khalifa and Abdulmalik AlSalman

6. Munazarat 1.0: A Corpus of Arabic Competitive Debates
Mohammad Khader, AbdulGabbar Al-Sharafi, Mohamad Hamza Al-Sioufy, Wajdi Zaghouani and Ali Al-Zawqari

7. TafsirExtractor: Text Preprocessing Pipeline preparing Classical Arabic Literature for Machine Learning Applications
Carl Kruse and Sajawel Ahmed

8. The Multilingual Corpus of World's Constitutions (MCWC)
Mo El-Haj and Saad Ezzini

9. AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models
Ashwag Alasmari, Sarah Alhumoud and Waad Alshammari

10. Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat, Hadi Khalilia, Gábor Bella and Fausto Giunchiglia

Shared tasks Papers:

1. Sirius_Translators at OSACT6 2024 Shared Task: Fin-tuning Ara-T5 Models for Translating Arabic Dialectal Text to Modern Standard Arabic
Salwa Alahmari

2. OSACT 2024 Task 2: Arabic Dialect to MSA Translation
Hanin Atwany, Nour Rabih, Ibrahim Mohammed, Abdul Waheed and Bhiksha Raj

3. ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar, Abdullah Alharbi, Serry Sibaee, Samar Ahmed, Lahouari Ghouti and Anis Koubaa

4. LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task
AhmedElmogtaba Abdelaziz, Ashraf Elneima and Kareem Darwish

5. AraT5-MSAizer: Translating Dialectal Arabic to MSA
Murhaf Fares

6. ASOS at Arabic LLMs Hallucinations 2024: Can LLMs detect their Hallucinations :)
Serry Sibaee, Abdullah I. Alharbi, Samar Ahmed, Omar Nacar, Lahouri Ghouti and Anis Koubaa

Keynote Speaker

Muhammad Abdul-Mageed, University of British Columbia, Canada

Title:
Towards Arab-Centric Large Language Models

Abstract:

The landscape of large language models (LLMs) is rapidly evolving, yet it continues to face significant challenges in computational efficiency and energy usage. Arabic-centric LLMs are particularly fraught with issues such as inadequate evaluations, cultural insensitivity, insufficient representation of the wide array of Arabic dialects, an absence of multimodal capabilities, designs that are too generic for specialized domains, and a disconnection from other low-resource languages. These problems are compounded by a general lack of detailed knowledge about the Arabic capabilities of existing LLMs. In this talk, we address these challenges by developing a host of models capable of understanding and generating content in a broad spectrum of Arabic languages and dialects. In particular, we present a suite of generative models tailored for text, speech, and image generation, designed to support and enhance the representation of Arabic in several domains. Our approach leverages cutting-edge machine learning methods and large-scale, diverse datasets to ensure our models achieve both high accuracy and cultural relevance. By focusing on critical areas such as archival work, cultural heritage and preservation, financial services, healthcare delivery, and education, our work aims to bridge the linguistic digital divide and foster equitable AI benefits. We discuss the methods we employ, the challenges we encounter, the solutions we propose, and the broader implications of our efforts.

Bio:

Muhammad Abdul-Mageed is a Canada Research Chair in Natural Language Processing and Machine Learning, and Associate Professor with appointments in the School of Information, and the Departments of Linguistics and Computer Science at The University of British Columbia. He is also a Visiting Associate Professor at MBZUAI. His research is in deep learning and natural language processing, focusing on large language models in multilingual contexts, with a goal to innovate more equitable, efficient, and ‘social’ machines for improved human health, more engaging learning, safer social networking, and reduced information overload. Applications of his work span a wide range of areas across speech, language, and vision. He is director of the UBC Deep Learning & NLP Group, co-director of the SSHRC-funded I Trust Artificial Intelligence, and co-lead of the Ensuring Full Literacy Partnership. He is a founding member of the UBC Center for Artificial Intelligence Decision making and Action and a member of the Institute for Computing, Information, and Cognitive Systems. His work has been supported by Google, AMD, Amazon, Natural Sciences and Engineering Research Council of Canada, Social Sciences and Humanities Research Council of Canada, Canada Foundation for Innovation, and Digital Research Alliance of Canada.

Workshop Program

Saturday 25 May 2024

Session 1: Main Workshop

9:00 - 9:10

Workshop Opening

9:10 - 9:50

Keynote Talk: Towards Arab-Centric Large Language Models
Muhammad Abdul-Mageed

9:50 - 10:10

AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language
Seham Alghamdi¹, Youcef Benkhedda², Basma Alharbi³, Riza Batista-Navarro⁴
¹Department of Computer Science, University of Manchester, ²University of Manchester, ³University of jeddah, ⁴Department of Computer Science, The University of Manchester

10:10 - 10:30

CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais¹, Hend Al-Khalifa¹, Abdulmalik AlSalman²
¹King Saud University, ²King Saud Univ.

Session 2: Main Workshop (Cont.)

11:00 - 11:20

Munazarat 1.0: A Corpus of Arabic Competitive Debates
Mohammad M Khader¹, AbdulGabbar Al-Sharafi², Mohamad Hamza Al-Sioufy³, Wajdi Zaghouani⁴, Ali Al-Zawqari⁵
¹QatarDebate Center, ²Sultan Qaboos University, ³Georgetown University - Qatar, ⁴Hamad Bin Khalifa University, ⁵ELEC department, Vrije Universiteit Brussel

11:20 - 11:40

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Saied Alshahrani¹, Hesham Haroon Mohammed², Ali Elfilali³, Mariama Njie⁴, Jeanna Matthews¹
¹Clarkson University, ²smsmAI, ³Student, ⁴M&T Bank

11:40 - 12:00

A Novel Approach for Root Selection in the Dependency Parsing
Sharefah Ahmed Al-Ghamdi¹, Hend Al-Khalifa¹, Abdulmalik AlSalman²
¹King Saud University, ²King Saud Univ.

12:00 - 12:20

AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models
Ashwag Alasmari¹, sarah alhumoud², Waad Alshammari³
¹National Institutes of Health, ²3Al Imam Mohammad ibn Saud Islamic University, ³King Salman Academy for Arabic Language

12:20 - 12:40

The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj and Saad Ezzini
Lancaster University

12:40 - 13:00

TafsirExtractor: Text Preprocessing Pipeline preparing Classical Arabic Literature for Machine Learning Applications
Carl Kruse¹ and Sajawel Ahmed¹²
¹Goethe-University Frankfurt am Main, ²University of California, Davis

Session 3: Main Workshop (Cont.)

14:00 - 14:20

Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat¹, Hadi Mahmoud Khalilia¹, Gábor Bella², Fausto Giunchiglia¹
¹The University of Trento, ²Lab-STICC CNRS UMR 628, IMT Atlantique

14:20 - 14:40

Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah A. Alrashoudi¹, Omar Said Alshahri², Hend Al-Khalifa¹
¹King Saud University, ²Islamic Sciences Institute, Diwan of the Royal Court.

Session 4: Shared Tasks

14:40 - 14:55

OSACT6 Dialect to MSA Translation Shared Task Overview
Ashraf Hatim Elneima¹, AhmedElmogtaba Abdelmoniem Ali Abdelaziz¹, Kareem Darwish²
¹aiXplain, ²aiXplain Inc.

14:55 - 15:10

OSACT 2024 Task 2: Arabic Dialect to MSA Translation
hanin atwany¹, Nour Rabih¹, Ibrahim Mohammed¹, Abdul Waheed², Bhiksha Raj³
¹MBZUAI, ²Mohammad Bin Zayed University of Artificial Intelligence, ³Carnegie Mellon University

15:10 - 15:25

ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar¹, Abdullah Alharbi², Serry Sibaee¹, Samar Ahmed³, Lahouari Ghouti¹, Anis Koubaa¹
¹Robotics and Internet-of-Things Lab, Prince Sultan University, Riyadh 12435, Saudi Arabia, ²King Abdulaziz University, ³Imam Mohammad Ibn Saud Islamic University

15:25 - 15:50

LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task
AhmedElmogtaba Abdelmoniem Ali Abdelaziz¹, Ashraf Hatim Elneima¹, Kareem Darwish²
¹Aixplain, ²aiXplain Inc.

15:50 - 16:00

Sirius_Translators at OSACT6 2024 Shared Task: Fin-tuning Ara-T5 Models for Translating Arabic Dialectal Text to Modern Standard Arabic
Salwa Saad Alahmari
University of Leeds

Session 5: Shared Tasks (Cont.)

16:30 - 16:45

AraT5-MSAizer: Translating Dialectal Arabic to MSA
Murhaf Fares
Independent

16:45 - 17:00

ASOS at Arabic LLMs Hallucinations 2024: Can LLMs detect their Hallucinations :)
Serry Taiseer Sibaee¹, Abdullah I. Alharbi², Samar Ahmed³, Omar Nacar¹, Lahouri Ghouti⁴, Anis Koubaa¹
¹Robotics and Internet-of-Things Lab, Prince Sultan University, Riyadh 12435, Saudi Arabia, ²King Abdulaziz University, ³(Imam Mohammad Ibn Saud Islamic University), ⁴Prince Sultan University, Riyadh 12435, Saudi Arabia

17:00 - 17:05

Workshop Closing

Keynote Speaker

Welcome to OSACT6

Workshop Description

Shared Tasks

Task 1: Arabic LLMs Hallucination (contact Hamdy Mubarak)

Task 2: Dialect to MSA Machine Translation (contact Kareem Darwish)

Workshop Topics

Language Resources:

Tools and Technologies:

Issues in the design, construction and use of Arabic LRs:

Paper Types and Formats

Important Dates

Submission guidelines

Accepted Papers

Main Workshop Papers:

Shared tasks Papers:

Committees

Organizing Committee

Programme Committee

Workshop Program