Large Language Models Tackle Exception Handling in Code Generation

🔊

💬 Ask

Large language models (LLMs) are revolutionizing software development, generating code from natural language descriptions. However, a critical aspect often overlooked is exception handling—the mechanisms that prevent software crashes when unexpected errors occur. A new paper, “Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework,” addresses this gap by proposing a novel multi-agent framework that significantly improves the robustness of LLM-generated code.

The paper highlights three major issues in current LLMs’ ability to handle exceptions: insensitivity in detecting fragile code (code prone to errors), inaccurate capture of exception blocks, and distorted handling solutions. Imagine an LLM generating code to read a file: it might fail silently if the file doesn’t exist, instead of gracefully handling the error. This is an example of poor exception handling.

To address these issues, the researchers developed Seeker, a framework that leverages a multi-agent system inspired by expert developer strategies. Seeker is not a standalone LLM, but a framework designed to work with existing LLMs to improve their exception-handling capabilities. It decomposes the task into five distinct agent roles:

Scanner: Breaks down the code into smaller, manageable units.
Detector: Identifies potentially fragile code segments.
Predator: Pinpoints specific exceptions that might be thrown within these segments.
Ranker: Prioritizes the exceptions based on their likelihood and impact, selecting the most critical ones to address first.
Handler: Generates the appropriate try-catch blocks and error-handling logic.

Crucially, Seeker uses a Common Exception Enumeration (CEE), a curated database of common exception types and their associated handling strategies. The CEE provides a standardized and interpretable knowledge base for the agents, enabling better understanding of exception types and appropriate responses. This is particularly beneficial for less frequent or custom exceptions which are otherwise difficult for LLMs to handle. Further, Seeker incorporates a novel Deep-RAG (Deep Retrieval-Augmented Generation) algorithm. This algorithm helps improve retrieval in complex inheritance trees of exceptions, a common challenge for LLMs.

The researchers evaluated Seeker’s performance on a dataset of 750 real-world Java code snippets prone to exception-handling issues. They compared it against several baselines, including LLMs prompted directly for exception handling, traditional retrieval augmented generation (RAG) methods, and state-of-the-art techniques like KPC and Nexgen. Seeker significantly outperformed all baselines across multiple metrics including:

Automated Code Review Score (ACRS): A measure of code quality, showing Seeker generated considerably cleaner and more robust code.
Coverage (COV) and Coverage Pass (COV-P): These metrics demonstrate Seeker’s accuracy and effectiveness in identifying and handling problematic code segments.
Accuracy (ACC): Showing Seeker’s ability to correctly identify the types of exceptions.

Moreover, Seeker demonstrated robustness over time and across various code complexities, maintaining high performance levels even with more complex projects. Ablation studies also confirmed the importance of each agent within the framework. Finally, the study shows that using more advanced LLMs as the base model further enhances Seeker’s performance, highlighting the importance of the underlying LLM’s capabilities.

In conclusion, Seeker provides a promising approach to address the limitations of LLMs in exception handling. Its multi-agent framework, leveraging a structured knowledge base (CEE) and improved retrieval techniques (Deep-RAG), significantly improves the robustness and reliability of LLM-generated code. The results suggest a path towards more reliable and maintainable software generated by LLMs.

AI Papers Reader

Personalized digests of latest AI research

Large Language Models Tackle Exception Handling in Code Generation

Chat about this paper