AI Papers Reader

Personalized digests of latest AI research

View on GitHub

"CASS" Aims to Democratize GPU Computing by Bridging CUDA and AMD

A new dataset and model suite called CASS aims to break down the barriers between Nvidia and AMD GPU ecosystems, potentially saving companies considerable resources when migrating code. CASS, short for CUDA-AMD Assembly and Source Mapping, is the first large-scale resource for cross-architecture GPU code translation.

Nvidia’s CUDA platform has been the dominant force in GPU programming for years, however, its tight integration with proprietary hardware has created a significant vendor lock-in. As AMD GPUs become more competitive, a growing need exists to run legacy CUDA code on AMD hardware, often requiring costly code rewrites. CASS directly addresses this challenge.

CASS consists of a massive 70,000 verified code pairs, mapping CUDA code to its AMD HIP counterpart at both the source code and assembly levels (Nvidia SASS to AMD RDNA3). This dual-level approach is crucial because it allows for translation not just of high-level code, but also of low-level optimizations that are specific to Nvidia hardware.

Concrete Examples

  • Source Code Translation: CASS enables automatic conversion of CUDA code like a kernel for matrix multiplication into its equivalent HIP code. A developer could take CUDA code and, using CASS-trained models, automatically generate the corresponding HIP code. This means legacy CUDA codebases can potentially be ported to AMD without significant manual effort.

  • Assembly Level Translation: Imagine an expert CUDA developer carefully optimizes some matrix operations for the Nvidia hardware. In this case CASS can translate SASS to RDNA3 to transfer all the changes to the HIP environment. This unlocks optimization transfer.

To validate CASS, the researchers developed a family of domain-specific language models fine-tuned on the dataset. Impressively, these models achieved 95% accuracy in source translation and 37.5% in assembly translation, significantly outperforming commercial tools like GPT-4o, Claude, and Hipify. Furthermore, 85% of the translated code maintained native performance levels.

For robust testing, the team introduces CASS-Bench, a curated benchmark spanning 16 GPU domains with execution-verified results. All data, models, and evaluation tools have been released as open source to encourage further research and development.

The creation of CASS involved a complex pipeline that scrapes CUDA code from public repositories, synthesizes new samples using language models, translates the code using Hipify, and compiles it for both Nvidia and AMD architectures. This process ensures that the resulting dataset contains semantically aligned code across vendors.

This project opens new opportunities for GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. By providing a solid foundation for cross-architecture GPU code translation, CASS can contribute to democratizing GPU computing and breaking down vendor lock-in.