AI Papers Reader

Personalized digests of latest AI research

View on GitHub

New Benchmark Reveals Challenges in Automated C-to-Safe-Rust Code Translation

A new benchmark, called CRUST-Bench, has been introduced to comprehensively evaluate the performance of automated systems in translating C code into safe Rust code. Developed by researchers from the University of Texas at Austin and New York University, CRUST-Bench aims to address the limitations of existing benchmarks, which often focus on isolated functions or lack rigorous correctness validation. The study was published on ArXiv on April 21, 2025.

The Need for Safe Rust Translation

Modernizing legacy C codebases is crucial for enhancing software security and enabling interoperability with Rust ecosystems. Rust, known for its memory safety guarantees, eliminates entire classes of memory bugs without relying on garbage collection. However, fully automated C-to-safe-Rust transpilation remains an open challenge.

Introducing CRUST-Bench

CRUST-Bench comprises 100 C repositories, each paired with manually crafted Rust interfaces and test cases. These interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns. The accompanying test cases validate the functional correctness of the transpiled code. Unlike previous benchmarks that focus on isolated functions, CRUST-Bench considers entire repositories, capturing the complexities of translating projects with dependencies across multiple files.

Concrete Examples

Consider a function in C that uses raw pointers to traverse a binary tree. This function might be responsible for finding a value based on a given key. When translating this to safe Rust, CRUST-Bench provides a Rust interface that uses safe, structured types like Vec<u8> for keys. This forces the translation system to generate memory-safe code that avoids raw pointers and adheres to Rust’s borrowing and ownership rules. Furthermore, test cases are designed to ensure that the translated Rust code correctly implements the original C function’s behavior.

Findings and Challenges

The researchers evaluated state-of-the-art large language models (LLMs) on CRUST-Bench. The results revealed that safe and idiomatic Rust generation remains a challenging problem. The best-performing model, OpenAI’s “o1”, could only solve 15 out of the 100 tasks in a single-shot setting. Common errors included type mismatches, borrowing violations, and unimplemented functions. These errors highlight the difficulty in reasoning about lifetimes, mutability, and type compatibility in Rust.

The study also explored iterative self-repair strategies and agent-based systems. Iterative repair, where the model refines its output based on compiler and test feedback, showed some improvements. However, these approaches often introduced new compilation errors or failed to address deeper semantic issues related to Rust’s ownership model.

Implications

CRUST-Bench provides a rigorous framework for evaluating C-to-safe-Rust transpilation systems. The benchmark’s comprehensive nature and focus on safety and correctness make it a valuable tool for driving innovation in this field. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety.