SWEET-RL: New Algorithm Helps LLMs Excel at Collaborative Tasks
Large language models (LLMs) are rapidly becoming capable decision-making agents, but mastering tasks that require multiple turns of interaction with humans has remained a challenge. A new research paper proposes a novel reinforcement learning (RL) algorithm, SWEET-RL, that significantly improves LLMs’ ability to collaborate with humans on complex reasoning tasks.
The key innovation lies in a more effective way to train LLMs using reinforcement learning. The traditional RL methods struggle with credit assignment, meaning it’s hard to determine which actions in a long sequence led to a good or bad outcome. SWEET-RL addresses this by giving the LLM additional information available during training that it wouldn’t have during real-world interactions. The paper introduces a “critic” model that has access to this information and thus can provide turn-by-turn rewards, leading to better performance.
To evaluate SWEET-RL, the researchers introduced a new benchmark called ColBench, which comprises a variety of collaborative tasks, including backend programming and frontend website design. In these tasks, an LLM agent interacts with a “human simulator” (another LLM, but one that has access to ground truth information for the task) to create a final product that meets specific requirements.
Concrete Examples:
- Backend Programming: The LLM might be tasked with writing a Python function to process events in a user’s life and summarize them. The LLM has to ask clarifying questions to the simulated human to understand the details, constraints, and edge cases. SWEET-RL helps the LLM determine which questions are most useful and which code changes lead to correct outputs as determined by unit tests.
- Frontend Design: The LLM could be asked to design a website for a shop with a green palette. The LLM would generate HTML snippets, render them as web pages, and receive feedback from the simulator about how well the design matches the desired look and feel. SWEET-RL enables the LLM to make better design choices based on the simulator feedback, for example, choosing the right color palette or text formatting.
The results show that SWEET-RL achieves a 6% improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms. Impressively, a Llama-3.1-8B model trained with SWEET-RL was able to match or even exceed the performance of GPT-4.0 in these realistic collaborative tasks.
“We find that the use of asymmetric information during training and appropriate learning objectives result in a superior multi-turn agent,” the researchers state in their paper.
SWEET-RL’s success demonstrates the power of carefully designed RL algorithms that leverage LLMs’ reasoning capabilities and highlights the importance of effective credit assignment in multi-turn interaction tasks. This could lead to more capable and collaborative AI assistants in the future, for activities like coding, content creation, and more. The code and data for ColBench and SWEET-RL are publicly available to facilitate further research in this area.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.