July 9, 2025
What differentiates the hundreds of AI-powered receptionists in the market? The dozens of AI coding agents making the rounds on X? Or the wave of AI writing tools?
It’s simple: their evals.
For AI-native companies, your evals are your product. Which leads to a controversial opinion: you should not outsource your evals to an external framework.
If you want to understand the underlying qualities of any AI-native product, the best place to start is looking at how the company evaluates its system. As an AI engineer, you can’t improve what you can’t measure. So without reliable evals, you’re basically flying blind.
At the same time, the rate of improvement we're seeing with LLMs means that heavy, rigid evals can’t keep up with the speed at which modern AI teams have to iterate. Your evals have to be lightweight if you want to do this at scale across a constantly evolving system.
We realized this early while building Tusk, our AI unit test generation agent. However, when we started building our evals, we ran into three challenges:
If you’re building an agentic AI product today, I'm certain that you're hitting at least one of these walls. Traditional eval tools are too rigid for fast moving teams. So we sought to solve this internally.
One of the biggest unlocks came from how we wrote our code.
One reason our evals worked so well was because we made our system as functional as possible. This meant actively avoiding object-oriented programming.
Whenever possible, each part of the AI system was built as a function: input → output. No global state, dependency injection, or complex class hierarchies.
Why does this matter for evals?
When your code is written this way, writing evals is just like writing unit tests. If your system is a web of classes and injected dependencies, it becomes painful to extract clean inputs/outputs for evals. Eval-driven development (EDD), in this sense, is analogous to test-driven development (TDD) in traditional programming.
Let’s look at one particular part of our Tusk workflow as an example of how we performed EDD: test file incorporation.
For context, this refers to how Tusk consolidates individual tests into a single test file. Tusk doesn’t just blindly generate unit tests. Instead, our AI tool takes your existing test files and intelligently incorporates the new test cases, making sure the formatting, imports, and structure stay intact.
The incorporation step is an important part of our value prop—developers want their test files clean, readable, and executable.
To get this right, we needed to try multiple approaches and compare them on reliability and latency. Without a flexible eval setup, it was almost impossible to know if one approach was better than another.
To get test file incorporation right, we needed to try multiple approaches and compare them on reliability and latency. Without a flexible eval setup, it was almost impossible to know if one approach was better than another.
The breakthrough came when we realized we didn’t need some massive evaluation infrastructure. We just needed:
So we wrote simple scripts that would:
We dropped the report into a folder, opened it in our browser, and boom—we had a clear visual snapshot of how that component was performing across the dataset. We could now easily compare the results to see if the approach in question incorporated more tests and at a lower latency than the original output.
These HTML evals aren’t fancy, but they work for three main reasons.
Once we had this pattern down, we started using it everywhere when building our LLM-powered product. Here are some scenarios where it’s been very helpful:
Instead of relying on intuition or scattered examples, we had a source of truth for quick decision-making. It's become our default way to answer, "Is this change actually better?"
It's worth calling out some downsides of using this approach, namely:
For us, the tradeoffs were well worth it since it meant less overhead setting up infra and tooling to do eval-driven development. Your mileage may vary.
Your evals are your product. With vibe coding becoming more prevalent, there will easily be hundreds of random AI copycats of your product. Good evals is a moat for AI companies.
If there's one thing to take away from this post, it's that LLM evals don’t have to be heavyweight. They just need to be useful and treated as first-class citizens in your codebase. This way your team can maintain high product quality without slowing down your development cycle.
If you’re stuck on how to evaluate your agent, try starting with these lightweight evals. You'll be surprised at how far you get without excessive overhead.