AI Papers Reader

Personalized digests of latest AI research

View on GitHub

KODCODE: A New Synthetic Dataset for Training Large Language Models to Code

A team of researchers from Microsoft GenAI, the University of Washington, and the University of Texas at Austin have introduced KODCODE, a novel synthetic dataset designed to improve the training of Large Language Models (LLMs) for coding tasks. The paper detailing KODCODE, available on arXiv, addresses the critical challenge of acquiring high-quality, verifiable training data in coding, a problem that has hampered the development of truly robust coding LLMs.

Existing code-focused datasets often fall short in one of two key areas: breadth of coverage or verifiable correctness. Many datasets focus on specific coding problems, lacking the diversity needed to train models across various levels of difficulty and coding domains. Others, while diverse, lack rigorous verification of solution correctness, hindering the reliability of training. KODCODE tackles both issues head-on.

KODCODE is built using a three-step pipeline:

1. Coding Question Synthesis: This stage generates a wide range of coding questions using diverse methods. Instead of relying on a single method, the researchers employed twelve distinct approaches, drawing from sources like algorithm problem sets (LeetCode, Codeforces), technical documentation of popular Python libraries, and by leveraging several large language models (LLMs) to generate questions. This ensures breadth in difficulty, topic and question type. For example, one method starts with simple prompts, like “Write a Python function that…”, and uses an LLM to complete the query. Another leverages existing coding assessment datasets as seed corpora, using an LLM to generate similar yet unique questions.

2. Solution & Test Generation: For each generated question, the pipeline automatically generates solutions and corresponding unit tests. A key innovation is the use of a self-verification mechanism. The generated solution and unit tests are executed. If the tests fail, additional attempts are made to generate a correct solution and tests. This process dramatically increases the reliability of the dataset by ensuring each question-solution-test triplet is verified, unlike many existing synthetic datasets.

3. Post-training Data Synthesis: To further improve the quality and diversity of the data, a post-processing step rewrites questions into different formats and generates chain-of-thought (CoT) responses using a reasoning model. This creates additional data suitable for fine-tuning LLMs using techniques like reinforcement learning.

The resulting KODCODE dataset contains 447,000 verified question-solution-test triplets. Fine-tuning experiments on standard coding benchmarks (HumanEval(+), MBPP(+), Big-CodeBench, LiveCodeBench) show that models trained on KODCODE achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

The researchers also conducted comprehensive analyses to evaluate the dataset’s quality, including checks for contamination with existing benchmarks (ensuring no accidental duplication). The results demonstrate KODCODE’s effectiveness and broad coverage.

In summary, KODCODE offers a significant advancement in the field of LLM training for coding. Its systematic approach to data generation, rigorous verification, and diversity makes it a valuable resource for researchers working to develop more powerful and reliable coding assistants. The dataset is open-source, further promoting community collaboration in advancing AI in code generation.