OmniGIRL: New Benchmark Exposes Limitations of LLMs in Resolving Diverse GitHub Issues
Large language models (LLMs) are increasingly being used to automate the resolution of issues reported on GitHub. However, current benchmarks used to evaluate these models have several limitations: they typically focus on a single programming language, cover a narrow range of domains, and rely solely on textual information, ignoring potentially important multimodal information such as images.
To address these limitations, researchers at Sun Yat-sen University and Huawei Cloud Computing Technologies have introduced OmniGIRL, a new benchmark designed to evaluate the ability of LLMs to resolve GitHub issues in a more comprehensive and realistic way. The paper describing OmniGIRL is available on arXiv: https://arxiv.org/abs/2505.04606.
OmniGIRL is multilingual, multimodal, and multi-domain, encompassing 959 task instances collected from 15 repositories across four programming languages (Python, JavaScript, TypeScript, and Java) and eight different domains. The benchmark includes issues with textual descriptions, as well as images and links to online code execution platforms, reflecting the diverse types of information developers use to describe problems.
Concrete Examples of OmniGIRL’s Diversity:
- Multilingual: An issue in a Python repository might involve a bug in a data science library, while an issue in a Java repository could relate to performance problems in a large-scale system.
- Multimodal: An issue description might include a screenshot of an error message, or a link to a website where the bug can be reproduced. For instance, a user might share a link to Tailwind Play, an online IDE for debugging Tailwind CSS code.
- Multi-domain: The benchmark includes issues from areas like code quality (e.g., related to formatting code), web development (e.g., related to UI components), network tools and developer utility, and statistical modelling.
The researchers evaluated several state-of-the-art LLMs on OmniGIRL, including GPT-4o and Claude 3.5 Sonnet using an approach called the agentless-X method, which involves a hierarchical process for code location identification and patch generation. The results revealed that even the best-performing models struggled to resolve the issues, with the GPT-40 model resolving only 8.6% of issues and Claude-3.5-Sonnet resolving only 10.5% of issues which include images.
Key Findings:
- Current LLMs demonstrate limited overall performance across multiple programming languages.
- LLMs have limited capabilities in resolving issues that require visual information. For instance, they can struggle with understanding images of unexpected output.
- LLMs often fail to generate outputs that follow the required format, and tend to modify only a single file, even when multiple files need to be changed to fix the issue.
The creation of OmniGIRL provides a valuable resource for researchers seeking to improve the ability of LLMs to automate GitHub issue resolution. The benchmark’s diversity and complexity will help drive future research into more effective and robust methods for tackling this important task.
The dataset and related resources are publicly available at: https://github.com/DeepSoftwareAnalytics/OmniGIRL.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.