EVALTREE: New Method Profiles AI Language Model Weaknesses with Hierarchical Trees
A new paper introduces EVALTREE, a novel method for automatically identifying and profiling the weaknesses of large language models (LLMs). This approach aims to provide actionable guidance for improving model performance by pinpointing specific capabilities where the model struggles.
The core idea behind EVALTREE is to construct a hierarchical “capability tree” for a given LLM benchmark dataset. Each node in this tree represents a specific capability, described in natural language, and is associated with a subset of benchmark instances that test this capability. The tree structure allows for varying levels of granularity, from broad categories to highly specific sub-capabilities.
For example, consider the MATH benchmark, which tests mathematical problem-solving skills. EVALTREE might create a tree where the root node represents “Mathematical Reasoning.” This node could have children like “Algebra” and “Geometry.” Further down the tree, “Geometry” might branch into “Analyzing Geometric Relationships Using Trigonometric Principles,” a more specific capability. The paper shows that GPT-4 mini achieves 75.1% accuracy on Calculating combinations and arrangements of elements but only 49.1% on Analyzing geometric relationships using trigonometric principles.
The key steps of EVALTREE are:
- Capability Annotation: An LLM is prompted to identify the specific capability required for each instance in the benchmark. For example, for a math problem involving trigonometry, the LLM might respond with “Applying trigonometric identities.”
- Capability Embedding: The annotated capabilities are then converted into vector representations (embeddings) using a sentence embedding model.
- Recursive Clustering-Based Construction: The instances are recursively clustered based on the similarity of their capability embeddings, building the hierarchical tree structure. The K-means clustering algorithm is used at each node to partition the instances into subsets corresponding to different sub-capabilities.
- Capability Description: Finally, each node is assigned a natural language description summarizing the capabilities of its children, ensuring interpretability.
Once the capability tree is built, EVALTREE evaluates the LLM’s performance at each node. To generate a weakness profile, EVALTREE extracts tree nodes with statistically low performance and takes their capability descriptions.
The authors demonstrate that EVALTREE outperforms existing methods for weakness profiling on the MATH and WildChat benchmarks. Specifically, EVALTREE identifies weaknesses more precisely and comprehensively. Moreover, training data collection guided by EVALTREE-identified weaknesses leads to improved LLM performance compared to other data collection strategies.
One key advantage of EVALTREE is its ability to uncover flaws in human-based evaluation practices. In experiments with Chatbot Arena, a platform that uses human voters to rank LLMs, EVALTREE revealed instances where models that provided toxic responses were preferred over models that refused to answer, highlighting potential biases in the evaluation process.
The authors release their code and an interactive web interface, allowing practitioners to explore the capability trees built by EVALTREE. This promises to be a valuable tool for researchers and developers seeking to understand and improve LLM performance.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.