AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AfriHate: A Multilingual Hate Speech Dataset for African Languages

A groundbreaking new multilingual dataset, AfriHate, is set to revolutionize the study and detection of hate speech and abusive language in African contexts. Developed by a large international team of researchers, the dataset comprises over 50,000 annotated instances across 15 African languages, addressing a critical gap in existing resources. This detailed analysis moves beyond simple keyword spotting, capturing the nuances of hate speech as it manifests within specific socio-cultural contexts across the continent.

The paper detailing AfriHate highlights the inherent challenges in creating such a resource. Traditional methods, relying on keywords and hashtags, often fail to capture the subtleties of hate speech, which is heavily context-dependent and can vary significantly between languages and cultures. For example, a word considered innocuous in one region might be highly offensive in another. The researchers overcame this by employing several strategies:

  • Crowdsourced keywords: The team worked with native speakers and social media influencers to gather comprehensive lists of hate speech keywords in each language. This strategy ensured that culturally relevant terms were included.
  • Manual data collection: For some languages, manual collection was necessary to supplement the keyword-based approach, particularly focusing on tweets from accounts known for posting hate speech.
  • Existing datasets: Where available, the researchers incorporated and re-annotated data from existing datasets, ensuring consistency in the annotation scheme.

Each instance in AfriHate is carefully annotated by native speakers, ensuring accuracy and cultural sensitivity. This is a significant improvement over previous datasets where annotation might have been done by non-native speakers lacking the necessary cultural context. For example, a tweet in Hausa containing seemingly innocuous words could be laden with hate speech given its local context and cultural background, and only a native speaker would be able to reliably identify this. The annotations further specify targets of hate speech, classifying instances based on attributes like ethnicity, politics, gender, religion, disability, or other categories.

The paper details the challenges of building such a dataset, such as the difficulties in identifying tweets in target languages among multilingual tweets, and the class imbalance, which is common in hate speech datasets. They addressed these issues by implementing a pre-annotation step for imbalanced classes, allowing the identification of hate and abusive content.

The researchers evaluated their dataset using various machine learning techniques. They compared the performance of monolingual (training and testing on the same language) and multilingual models (training across multiple languages and testing on each individually). They found that multilingual models generally outperformed monolingual models, particularly for low-resource languages, showcasing the advantages of leveraging data from multiple languages. The study also compared fine-tuning pre-trained language models and prompt engineering techniques and achieved a considerable success with prompting large language models, indicating promise in a wider accessibility and ease of use of the dataset.

AfriHate’s open-access nature makes it an invaluable resource for researchers, developers, and organizations working to combat hate speech in Africa. The dataset, annotation guidelines, and baseline results are publicly available, providing a foundation for future research and the development of effective hate speech detection tools tailored to African languages and cultures. The researchers acknowledge that their dataset, though substantial, cannot capture the full spectrum of hate speech expressions and highlight important ethical considerations, stressing human-in-the-loop approaches and avoiding automated content removal based on this dataset alone. AfriHate’s release marks a vital step towards creating a more inclusive and safer online environment across the diverse linguistic landscape of Africa.