Who Wrote That Code? How a Smart Hybrid Search Catches AI "Plagiarism" at Scale

🔊

💬 Ask

As AI code assistants like GitHub Copilot and CodeLlama become ubiquitous, developers face a thorny question: where did this code actually come from? Large language models (LLMs) often memorize and “recite” chunks of their training data. Sometimes they copy it word-for-word. Other times, they perform “domain adaptation”—retaining the exact logic but changing variable names, like swapping user_age for customer_years. For companies trying to avoid copyright lawsuits and ensure open-source license compliance, finding the original author in a database of billions of code files is like looking for a needle in a digital haystack.

Traditionally, we used “fingerprinting” algorithms like Winnowing, the engine behind the famous plagiarism-detector MOSS. Winnowing slices code into small, hashed pieces and looks for exact matches. It is incredibly robust, but it scales linearly. If you have a training set of billions of files, scanning them one by one for every single line of generated code is computationally impossible.

To break this bottleneck, researchers from Italy and France have developed HybridSourceTracker (HST). It is a two-stage pipeline that combines the speed of modern AI vector searches with the precision of classic fingerprinting.

To understand how HST works, imagine searching for a specific recipe in a massive national archive of billions of documents.

In the first stage, HST uses SourceTracker, a custom-tailored 300-million-parameter neural network. SourceTracker translates the semantic “vibe” and structure of the generated code fragment into a mathematical vector. By querying a vector database, it instantly narrows down the search space from billions of files to just the top 100 most similar “candidate” files. This takes less than a second, operating in logarithmic time.

In the second stage, HST runs the classic Winnowing algorithm—but only on those 100 candidates. This fine-grained, fingerprint-level comparison re-ranks the candidates to find the exact match. Because Winnowing only looks at a fixed pool of 100 files, this step takes a fraction of a millisecond.

The researchers tested their hybrid system on a massive 10-million-snippet database. For code fragments of 60 tokens or more (the standard threshold for plagiarism checks), HST actually outperformed pure Winnowing by up to 5.4% while maintaining lightning-fast query times. Even when the system missed the exact “ground truth” source, evaluation using an LLM-based judge revealed that the retrieved files were still highly similar, offering programmers a clear lineage of inspiration.

As open-source code remains the lifeblood of software engineering, tools like HST are vital. They bridge the gap between AI generation and ethical authorship, ensuring that developers can confidently build the future while respecting the past.

AI Papers Reader

Personalized digests of latest AI research

Who Wrote That Code? How a Smart Hybrid Search Catches AI "Plagiarism" at Scale

Chat about this paper