HackerRank-ASTRA: A New Benchmark for Evaluating Large Language Models in Multi-File Software Projects
Large language models (LLMs) are rapidly changing software development, enabling tasks like code generation and bug fixing. However, evaluating their real-world effectiveness remains challenging. Existing benchmarks often focus on isolated coding problems or specific libraries, neglecting the complexities of real-world, multi-file software projects. A new benchmark, HackerRank-ASTRA, tackles this gap by evaluating LLMs on 65 diverse, multi-file, project-based coding problems drawn from HackerRank’s proprietary library. A research paper published on arXiv details the benchmark and its initial findings.
Real-World Scenarios, Rigorous Evaluation
HackerRank-ASTRA’s strength lies in its focus on realistic project-based scenarios. Problems aren’t isolated functions but rather involve multiple files and mimic tasks like new feature development within common front-end frameworks such as React.js, Angular.js, Node.js, and Django. Each problem comes with an average of 12 source code files and a problem statement (around 718 characters on average), requiring models to demonstrate understanding of context and dependencies.
The evaluation goes beyond simple accuracy measures. The benchmark employs a robust approach that runs each model 32 times per problem, creating a distribution of results to assess not just correctness but also consistency. Metrics include:
- Mean Score: The average percentage of test cases passed across the 32 runs for each problem, aggregated across all problems.
- Mean Pass@1: The percentage of runs where the model achieved a perfect score (all test cases passed) on a given problem, aggregated across all problems.
- Consistency (Median Standard Deviation): Measures the variability in model performance across the 32 runs for each problem, providing insights into its reliability. A lower median standard deviation reflects higher consistency.
Initial Findings and Model Performance
The initial evaluation of HackerRank-ASTRA (version 1) included five state-of-the-art LLMs: ol
, ol-preview
, Claude-3.5-Sonnet-1022
, GPT-40-0513
, and Gemini-1.5-pro
. The top three models (ol
, ol-preview
, and Claude-3.5-Sonnet-1022
) achieved comparable average scores (around 75%), showing no statistically significant difference in overall performance. However, Claude-3.5-Sonnet-1022
stood out with remarkably high consistency (standard deviation of 0.0497), demonstrating better reliability for real-world application.
Skill-Based Analysis and Formatting Impacts
The benchmark categorized problems into 10 primary skill domains (e.g., React Frontend Development, Node.js Backend Development) and 34 sub-skill categories (e.g., API Integration, Data Filtering). Analysis revealed that the best-performing model varied depending on the specific skills tested. For example, while ol
, ol-preview
, and Claude-3.5-Sonnet-1022
performed similarly on common skills, Claude-3.5-Sonnet-1022
excelled in problems involving skills with less frequent occurrences.
Interestingly, the study also highlights a significant impact of output format. Using XML consistently outperformed JSON, suggesting potential biases in LLM training data or inherent strengths of XML for multi-file code generation.
Limitations and Future Directions
The current version of HackerRank-ASTRA focuses primarily on front-end development. Future iterations plan to incorporate backend tasks and a broader range of technologies. The researchers also aim to incorporate agent-based approaches, allowing LLMs to iteratively refine their solutions, and expand the benchmark to include more LLMs and allow for community contributions.
Conclusion
HackerRank-ASTRA provides a much-needed, rigorous benchmark for evaluating LLMs in realistic multi-file software projects. The findings highlight the importance of assessing both correctness and consistency, and underscore the need for further research in improving LLMs’ reliability and handling of real-world complexities. The open-sourcing of the benchmark’s questions promises wider collaboration and faster progress in this crucial area of AI development.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.