July 9, 2025

How to Perform Eval-Driven Development While Shipping Fast

Image

Marcel Tan

Image

Why Eval-Driven Development

What differentiates the hundreds of AI-powered receptionists in the market? The dozens of AI coding agents making the rounds on X? Or the wave of AI writing tools?

It’s simple: their evals.

For AI-native companies, your evals are your product. Which leads to a controversial opinion: you should not outsource your evals to an external framework.

If you want to understand the underlying qualities of any AI-native product, the best place to start is looking at how the company evaluates its system. As an AI engineer, you can’t improve what you can’t measure. So without reliable evals, you’re basically flying blind.

At the same time, the rate of improvement we're seeing with LLMs means that heavy, rigid evals can’t keep up with the speed at which modern AI teams have to iterate. Your evals have to be lightweight if you want to do this at scale across a constantly evolving system.

We realized this early while building Tusk, our AI unit test generation agent. However, when we started building our evals, we ran into three challenges:

1. We’re a rapidly evolving startup

  • Our AI workflow, codebase, and features are changing by the day. Writing rigid evals today could be out of date tomorrow.
  • Lightweight, flexible evals aren’t just a nice-to-have. They are essential to knowing that we're improving the product without regressions.

2. Our AI agent is complex

  • Tusk is not a single prompt → response product. It’s a system with dozens of interconnected tools, workflow steps, and logical branches.
  • While end-to-end evals are useful sanity checks, they don’t help us iterate on specific components of our AI test generation workflow.

3. We deal with unstructured I/O

  • Each part of the system has its own inputs and outputs that vary in structure.
  • And back to point 1, the inputs and outputs are constantly changing.

If you’re building an agentic AI product today, I'm certain that you're hitting at least one of these walls. Traditional eval tools are too rigid for fast moving teams. So we sought to solve this internally.

One of the biggest unlocks came from how we wrote our code.

Better With Functional Programming

One reason our evals worked so well was because we made our system as functional as possible. This meant actively avoiding object-oriented programming.

Whenever possible, each part of the AI system was built as a function: input → output. No global state, dependency injection, or complex class hierarchies.

Why does this matter for evals?

  • It’s easy to isolate a component and run it on a dataset.
  • You don’t need to spin up the whole app or fake an entire service—just call the function.

When your code is written this way, writing evals is just like writing unit tests. If your system is a web of classes and injected dependencies, it becomes painful to extract clean inputs/outputs for evals. Eval-driven development (EDD), in this sense, is analogous to test-driven development (TDD) in traditional programming.

Example: Test File Incorporation

Let’s look at one particular part of our Tusk workflow as an example of how we performed EDD: test file incorporation.

For context, this refers to how Tusk consolidates individual tests into a single test file. Tusk doesn’t just blindly generate unit tests. Instead, our AI tool takes your existing test files and intelligently incorporates the new test cases, making sure the formatting, imports, and structure stay intact.

The incorporation step is an important part of our value prop—developers want their test files clean, readable, and executable.

To get this right, we needed to try multiple approaches and compare them on reliability and latency. Without a flexible eval setup, it was almost impossible to know if one approach was better than another.

Diagram of Tusk's test file incorporation
Tusk's test file incorporation workflow

The Lightweight Eval

To get test file incorporation right, we needed to try multiple approaches and compare them on reliability and latency. Without a flexible eval setup, it was almost impossible to know if one approach was better than another.

The breakthrough came when we realized we didn’t need some massive evaluation infrastructure. We just needed:

  1. A clear contract: What goes in, what should come out?
  2. A way to evaluate the outputs
  3. A way to view the results quickly

So we wrote simple scripts that would:

  • Take the part of the workflow we want to evaluate (say, test incorporation)
  • Create a dataset of inputs and outputs
  • Run the component across this dataset
  • Use Cursor or Claude, to quickly generate a visual HTML report showing:
    • The input
    • The agent’s output
    • Eval results

We dropped the report into a folder, opened it in our browser, and boom—we had a clear visual snapshot of how that component was performing across the dataset. We could now easily compare the results to see if the approach in question incorporated more tests and at a lower latency than the original output.

Table showing Tusk's lightweight evals for test file incorporation
HTML report for evaluating test file incorporation

Why This Works

These HTML evals aren’t fancy, but they work for three main reasons.

  1. Fast to set up. With AI-powered code generation, creating the eval scripts takes under 5 minutes. You’ll spend most of your time curating the dataset.
  2. Easy to modify. When your data format changes (which could happen often), update the dataset with the new inputs/outputs and regenerate a new HTML report with the help of AI-powered IDEs.
  3. Cost-efficient: Just scripts and files. No additional SaaS fees from LLM eval provider.

Once we had this pattern down, we started using it everywhere when building our LLM-powered product. Here are some scenarios where it’s been very helpful:

  • Trying three different strategies for formatting incorporated test cases to see which strategy had the most tests incorporated and the lowest latency
  • Determining if a new model helps or hurts a step in our test generation pipeline based on a predetermined success measure
  • Debugging edge cases in a workflow step to see if we can replicate it consistently for specific types of inputs

Instead of relying on intuition or scattered examples, we had a source of truth for quick decision-making. It's become our default way to answer, "Is this change actually better?"

Limitations

It's worth calling out some downsides of using this approach, namely:

  • Still a semi-manual process (albeit AI helps with script generation)
  • No historical tracking in one dashboard (though you could run these evals on merges to main or periodically and then store results in a DB)
  • If you didn’t already start out with functional programming, refactoring your code can be painful

For us, the tradeoffs were well worth it since it meant less overhead setting up infra and tooling to do eval-driven development. Your mileage may vary.

Final Takeaway

Your evals are your product. With vibe coding becoming more prevalent, there will easily be hundreds of random AI copycats of your product. Good evals is a moat for AI companies.

If there's one thing to take away from this post, it's that LLM evals don’t have to be heavyweight. They just need to be useful and treated as first-class citizens in your codebase. This way your team can maintain high product quality without slowing down your development cycle.

If you’re stuck on how to evaluate your agent, try starting with these lightweight evals. You'll be surprised at how far you get without excessive overhead.