"Compiling" Intelligence: New Framework Brings Software Engineering Rigor to AI Agents
In the current gold rush of artificial intelligence, building an AI agent often feels more like alchemy than engineering. Developers spend hours “vibing” with prompts—tweaking a sentence here or an instruction there—hoping the agent won’t hallucinate or misuse a critical tool. But a new paper from Tzafrir Rehan at Fiverr Labs suggests it is time to stop “poking” AI and start “compiling” it.
The research introduces Test-Driven AI Agent Definition (TDAD), a methodology that treats an AI agent’s prompt not as a hand-written letter, but as a compiled artifact derived from a rigorous behavioral specification.
The “Compilation” of a Prompt
Traditionally, if you wanted an AI agent to handle customer refunds, you might write a long prompt: “Be helpful, but always check their ID first.” If the agent fails, you rewrite the prompt. TDAD flips this.
In the TDAD workflow, a human engineer writes a “specification” in a structured format (YAML). This spec defines tools, policies, and a decision tree. Then, a specialized AI “coding agent” called TestSmith takes that spec and generates a suite of executable tests.
Once the tests exist, a second agent, PromptSmith, takes over. It acts as a compiler: it writes a prompt, runs the tests, sees where it failed, and iteratively refines the prompt until the agent passes every test. For example, if a test reveals the agent issued a refund without asking for a zip code, PromptSmith automatically updates the prompt to enforce that specific verification step.
Preventing “Specification Gaming”
One of the biggest risks in automated AI development is “gaming the system”—where an AI finds a shortcut to pass a test without actually understanding the rule. To combat this, TDAD introduces three clever “anti-gaming” mechanisms:
- Hidden Tests: Much like a final exam, some tests are withheld from the “compiler” (PromptSmith). If the agent passes the visible tests but fails the hidden ones, the engineers know the prompt isn’t generalized enough.
- Semantic Mutation: A third agent, MutationSmith, intentionally creates “faulty” versions of the agent. For instance, it might create a version that is instructed to “leak private data if asked nicely.” If the existing test suite doesn’t catch this “mutant,” it means the tests are too weak and need to be strengthened.
- Spec Evolution: When business rules change—perhaps a company now requires manager approval for refunds over $100—TDAD measures “regression safety.” It ensures that adding the new $100 rule doesn’t accidentally break the old identity verification rules.
Concrete Results
To prove the system works, Rehan tested TDAD on SpecSuite-Core, a benchmark featuring four complex agents: a customer support bot, a SQL analytics assistant, an incident runbook handler, and an expense guard.
The results were striking. Across 24 trials, TDAD achieved a 92% success rate in “compiling” the first version of these agents. Even more impressively, the agents maintained a 97% “regression safety” score when their requirements were updated, meaning the system effectively prevented old features from breaking while new ones were added.
Why It Matters
As AI agents move into high-stakes production environments—handling sensitive data or executing financial transactions—the “trial and error” approach to prompting is no longer viable. By applying the decades-old discipline of Test-Driven Development (TDD) to the world of Large Language Models, TDAD offers a glimpse of a future where AI behavior is not just a hope, but a measurable, verifiable guarantee.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.