Large Language Models Need a New Evaluation Framework

🔊

💬 Ask

Large language models (LLMs) are changing the world, but evaluating their capabilities is a challenge. Current approaches often rely on single-item assessments, asking a model a single question and judging its response. This method, however, fails to capture a model’s full understanding and can be easily manipulated by models that simply memorize answers.

A new paper, “StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation,” proposes a novel evaluation framework that goes beyond single-item assessments. The researchers, from the Chinese Academy of Sciences and ByteDance, call their method StructEval.

StructEval introduces a two-pronged approach:

Deepen the Assessment: For each test objective, StructEval generates multiple instances aligned with different cognitive levels (e.g., remembering, understanding, applying) from Bloom’s Taxonomy. This ensures that a model demonstrates a thorough understanding of a topic, not just the ability to memorize specific answers.
Broaden the Assessment: StructEval focuses on identifying the critical concepts related to a test objective and constructing instances based on a knowledge graph. This ensures that the model is not merely memorizing surface-level information but possesses a deeper understanding of the underlying concepts.

Think of it like this: instead of simply asking a student, “What is the capital of France?” StructEval would ask a series of questions:

Remembering: “What is the capital of France?”
Understanding: “Why is Paris the capital of France?”
Applying: “If you were planning a trip to France, which city would you fly into?”
Analyzing: “How has Paris’s role as the capital of France evolved over time?”
Evaluating: “Compare Paris to other European capitals. What makes it unique?”
Creating: “Imagine a new city that could replace Paris as the capital of France. Describe what it would be like.”

This multi-level, multi-concept approach is designed to be more robust and reliable. It helps to determine whether a model genuinely understands a concept or is simply relying on superficial knowledge.

StructEval was tested on several widely-used benchmarks for LLMs, including MMLU, ARC, and OpenBook QA. The results demonstrated that:

StructEval effectively resists data contamination: Models that memorized answers from training data performed poorly when tested on StructEval.
StructEval provides more consistent and accurate assessment: The ranking of models remained stable even when the test data was contaminated.
StructEval can automatically construct large-scale benchmarks: The framework can generate a wide range of test instances, making it more scalable and efficient than traditional benchmark creation methods.

The authors conclude that StructEval is a promising framework for assessing the capabilities of LLMs. It offers a more comprehensive, robust, and consistent way to evaluate LLMs, contributing to the development of trustworthy and reliable AI systems.

The research team behind StructEval is working to further develop the framework and expand its applications. They are also exploring the use of StructEval for evaluating other types of AI models, such as those used for code generation and image recognition. As LLMs continue to evolve and become increasingly powerful, new evaluation tools like StructEval will be crucial for ensuring that these technologies are developed and used responsibly.

AI Papers Reader

Personalized digests of latest AI research

Large Language Models Need a New Evaluation Framework

Chat about this paper