AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Llama 3 Gets a Software Engineering Upgrade Through Reinforcement Learning

Meta AI researchers have unveiled a new approach called SWE-RL that significantly enhances the ability of Large Language Models (LLMs) to tackle real-world software engineering tasks. This technique, detailed in a new paper, uses reinforcement learning (RL) to train LLMs on a massive dataset of open-source software evolution data.

The core idea behind SWE-RL is to enable LLMs to learn from the entire lifecycle of software projects, including code snapshots, code changes, issues, and pull requests on platforms like GitHub. This is a departure from previous RL methods that primarily focused on competitive coding or math problems, which often rely on synthetic datasets or execution feedback.

“Instead of directly giving the model a ‘right’ or ‘wrong’ based on whether the code executes correctly, we’re giving it a more nuanced signal based on how similar its proposed code change is to a real-world fix submitted by a human developer,” explains Dr. Sida Wang, a researcher at Meta AI. “This allows the model to learn from a much broader range of scenarios and develop a better understanding of the reasoning process involved in software development.”

Here’s how SWE-RL works in practice:

  1. Data Collection: The researchers curate a large dataset from GitHub, containing information about software issues, code contexts (the relevant code snippets), and “oracle patches” (the actual code fixes submitted by developers).
  2. Policy LLM: A policy LLM (in this case, the Llama 3 model) is given the issue description and code context. The LLM’s task is to generate a code change that resolves the issue. Imagine the LLM being presented with a description of a bug report and the relevant code from a Python library, such as ‘TypeError: unsupported operand type(s) for +: ‘int’ and ‘str’’. The model analyzes the code and generates a replacement statement using a SEARCH/REPLACE structure, indicating where to fix the code.
  3. Reward Calculation: The LLM’s generated code change (a “predicted patch”) is compared to the actual code change submitted by a developer (the “oracle patch”). A similarity score is calculated. If the format of LLM’s response is incorrect, a strong negative reward (-1) is applied.
  4. Reinforcement Learning: The reward is used to update the weights of the LLM, encouraging it to generate code changes that are more similar to real-world fixes. This process is repeated over thousands of iterations.

The resulting model, named Llama3-SWE-RL-70B, achieved a 41.0% solve rate on SWE-bench Verified, a human-verified benchmark of real-world GitHub issues. This performance is comparable to leading proprietary LLMs, despite being trained solely on open-source data.

One particularly exciting finding is that Llama3-SWE-RL-70B demonstrates improved general reasoning skills beyond software engineering. It shows better performance on tasks like function coding, library use, mathematics, and general language understanding, suggesting that the RL process on software evolution data leads to more generalizable reasoning abilities.

“It’s like learning to fix cars and suddenly understanding how all machines work,” notes Dr. Wang. “By training the LLM to solve specific software engineering problems, we’re inadvertently improving its ability to reason about other types of problems as well.”

This work opens up new avenues for improving the capabilities of LLMs and has the potential to revolutionize the field of software engineering.