New Dataset Aims to Make AI Toxicity Detection More Equitable
Researchers have developed a new dataset, ModelCitizens, designed to improve the accuracy and fairness of AI systems that detect toxic language online. The dataset addresses a critical flaw in existing systems: their tendency to overlook the nuances of how different communities perceive harmful content.
Current AI models for detecting toxic language are typically trained on data where annotations are averaged across all users. This approach can erase important context, such as when a group reclaims a slur or uses language in a way that is not intended to be harmful within their community. The paper argues that this “one-size-fits-all” approach can inadvertently lead to the marginalization of voices from historically vulnerable communities.
To tackle this, ModelCitizens includes over 6,800 social media posts with more than 40,000 toxicity annotations from individuals belonging to eight different identity groups: Asian, Black, Jewish, Latino, LGBTQ+, Mexican, Muslim, and Women. Crucially, the dataset collects annotations from both “ingroup” members (those who identify with the targeted group) and “outgroup” members (those who do not). This allows researchers to highlight significant differences in how toxicity is perceived.
For example, the study notes that ingroup annotators might find certain language benign within their community, while outgroup annotators might label it as toxic. Conversely, for content targeting Black individuals, LGBTQ+ individuals, and women, outgroup annotators sometimes rated the content as more toxic than ingroup annotators did. These discrepancies, the researchers argue, are crucial for building more sensitive AI.
The dataset also incorporates conversational context, as real-world online interactions rarely happen in isolation. The researchers used large language models (LLMs) to generate realistic conversational scenarios around the sampled posts. This contextual information further challenges existing toxicity detection models. State-of-the-art tools, including OpenAI’s Moderation API and GPT-4-mini, performed poorly on ModelCitizens, with their accuracy dropping even further on context-augmented posts.
To address these limitations, the team also introduced two new fine-tuned models, LLaMACITIZEN-8B and GEMMACITIZEN-12B, based on LLaMA and Gemma architectures. These models, trained on the ModelCitizens dataset, demonstrated a significant improvement in performance, outperforming baseline models by an average of 5.5% and showing particular gains when handling context-augmented data.
The research underscores the importance of community-informed annotation and modeling for creating more inclusive and equitable online safety tools. By prioritizing the diverse perspectives of targeted communities, the ModelCitizens dataset and the models developed from it represent a significant step towards AI that better understands and respects the complexities of online communication.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.