Ahmad Hussein | AI Systems Architect

Abstract

During the COVID-19 pandemic, the spread of misinformation posed a significant public health threat. This research addresses the "infodemic" in Arabic-speaking communities by developing an automated detection system.

Background

The NLP4IF 2021 shared task focused on fighting the COVID-19 infodemic across multiple languages. Our team tackled the Arabic track, leveraging the AraBERT pretrained model.

Methodology

Data Preparation

We worked with the official dataset containing:

2,000+ labeled Arabic tweets

Binary labels: reliable vs. unreliable

Mixed content: claims, questions, and statements

Model Architecture

Our approach used a fine-tuned AraBERT-large model with:

Custom classification head

Gradient accumulation for memory efficiency

Label smoothing for robustness

Training Strategy

Learning rate: 2e-5 with linear warmup

Batch size: 16 (effective: 64 with accumulation)

Early stopping based on validation F1

Results

Model

F1-Score

Accuracy

-------

----------

Baseline (TF-IDF + SVM)

0.72

0.74

mBERT

0.81

0.83

AraBERT (Ours)

0.87

0.88

Impact

This work contributed to the broader effort of combating COVID-19 misinformation and demonstrated the effectiveness of language-specific pretrained models for Arabic NLP tasks.