AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LongEmotion: A New Benchmark for Measuring Emotional Intelligence in Large Language Models

Researchers have developed “LongEmotion,” a novel benchmark designed to assess how well large language models (LLMs) understand and express emotional intelligence (EI) in long, complex conversations. Current EI benchmarks often fall short because they focus on short, simple interactions, failing to capture the nuances of real-world dialogue. LongEmotion aims to bridge this gap by introducing a diverse set of tasks that simulate extended, multi-turn interactions, requiring LLMs to maintain emotional coherence and demonstrate a deeper understanding of human emotions.

The LongEmotion benchmark comprises six key tasks:

  • Emotion Classification: Identifying the emotional category within a lengthy text, even when surrounded by irrelevant information.
  • Emotion Detection: Pinpointing a single unique emotion from a set of otherwise similar emotional expressions.
  • Emotion QA (Question Answering): Answering questions based on psychological literature, testing the model’s knowledge and application of emotional concepts.
  • Emotion Conversation: Simulating a multi-turn dialogue where the LLM acts as a psychological counselor, providing empathetic support.
  • Emotion Summary: Summarizing crucial aspects of a psychological pathology report, such as causes, symptoms, and treatment.
  • Emotion Expression: Generating a long-form emotional self-narrative in response to a given emotional context and psychometric assessment.

To tackle the challenges of long-context EI, the researchers introduced two innovative frameworks: Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (COEM). Unlike traditional RAG that relies on external knowledge bases, LongEmotion’s RAG method uses the conversation history itself as a dynamic source for information retrieval. COEM further enhances this by dividing the context into manageable chunks, using collaborating AI agents to enrich and re-rank information before generating a final emotional response. This multi-agent approach allows for a more nuanced understanding and generation of emotions.

Experiments conducted on the LongEmotion benchmark showed that both RAG and COEM significantly improved the EI-related performance of various LLMs across most tasks. The researchers highlight that these advancements are crucial for developing LLMs that can engage in more practical and real-world EI applications, such as providing mental health support or acting as more empathetic conversational partners. The study also includes a comparative analysis of different versions of GPT models, revealing their varying strengths and weaknesses in handling long-context emotional interactions.