AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Large Language Models Struggle to Capture Human Nuance in Annotation

Large Language Models (LLMs) are increasingly being used to automate data annotation tasks in Natural Language Processing (NLP). While these models can reduce human effort, current evaluations often focus on their ability to predict the most common “ground truth” label. A new study reveals that LLMs are not as adept at capturing the more subtle, yet crucial, human disagreements that often arise in annotation.

Human annotators, when labeling text, don’t always agree. This disagreement isn’t necessarily due to errors; it can signal important nuances like subjectivity, ambiguity in language, or diverse perspectives stemming from different backgrounds. For instance, identifying hate speech can be challenging, with annotators sometimes disagreeing on whether a particular post crosses the line due to varying interpretations of “offensive language” or cultural context. Even seemingly objective tasks like part-of-speech tagging can lead to disagreement when words have multiple meanings, such as the word “duck” in the sentence “I saw her duck.”

The research, published in arXiv, investigated whether LLMs can actually replicate these human disagreements. The study found that LLMs, especially those trained with reinforcement learning from human feedback (RLHF) and instructed to “reason” through tasks, often struggle to predict these variations in human judgment.

The researchers evaluated LLMs across various tasks, including hate speech detection and emotion classification, using two key metrics: variance correlation (how well the LLM’s uncertainty aligns with human uncertainty) and distributional alignment (how closely the LLM’s predicted distribution of labels matches the human distribution).

A key finding was that while LLMs trained using “reinforcement learning with verifiable rewards” (RLVR) tend to perform well on tasks with a single correct answer, this approach actually harms their ability to predict human disagreements. This is likely because RLVR pushes the models towards a deterministic output, overlooking the inherent variability in human perception.

Conversely, LLMs utilizing RLHF and employing “chain-of-thought” (CoT) reasoning – where the model explicitly outlines its thought process – showed improvement in predicting disagreements. This suggests that encouraging LLMs to “think step-by-step” can help them better understand and represent diverse human opinions.

The study also explored other factors like the size of the LLM and the use of few-shot steering (providing examples within the prompt). While larger LLMs generally showed some improvement in capturing disagreement, this effect was less pronounced than when using RLHF with CoT. The impact of few-shot steering was found to be task-dependent, sometimes helping and sometimes hindering disagreement prediction.

The paper concludes that current evaluations of LLM annotators, which often overlook human disagreement, might be incomplete. The findings highlight the need for future work to focus on developing LLMs that can not only produce accurate labels but also understand and represent the rich spectrum of human opinions and disagreements that are often present in annotated data. This is crucial for building more reliable and nuanced AI systems, especially in subjective domains.