March 9, 2026

Scaling Parallel Agents with the Context-Verification-Harness Framework

Sohil Kshirsagar

Co-Founder & CTO

I was a senior engineer at Aspire before starting Tusk, which I often joke makes me a boomer in San Francisco. It's not without basis since I've had to force myself to unlearn habits that made sense before coding agents (but are now harmful).

The most consequential one to unlearn is treating code as a precious resource.

The grumpy senior engineer in me will harp on about how duplicated work is wasteful. But agent hours are basically free at this point, so I'd be remiss to not let agents build out 2-3 competing implementations at the same time and then review it later.

That said, there is a ceiling on how much parallel work you can review as an individual engineer before your brain gets fried. From experience, I've landed on 1 primary coding task (usually full-stack), 1 auxiliary task that's scoped well, and 3 exploratory plans as "max parallelism." Past that, you get diminishing returns.

Is there a way to expand the scope of parallel work beyond these 5 tasks? It's clear there are code quality nits for a human to tweak today. However, if we extrapolate current foundational model capabilities 18 months out, we can expect that taste will not be a blocker for long.

So then the actual bottleneck is code verification in a more deterministic sense, i.e., ensuring functional and regression-free code. I like to break this problem down to three main buckets.

Does the agent have enough context?

Poor final output usually traces back to the agent not understanding how the system behaves right now. Before proposing changes, agents should intimately understand the existing system and create a great plan.

Any engineer worth their salt inherently knows this. Before coding, you should do a technical spike, create an engineering design, and write specs. But doing all this assumes a level of understanding of business logic that's not in the codebase (the best PMs and EMs provide this readily).

One pet peeve I have is that when AI influencers talk about context, they often talk about shallow context, like telling the agent what an app does or what architecture it uses. This is static context. Your database schema tells the agent that a field is a string, but it doesn’t tell you what those values look like in prod.

Most agent setups stop at static context. However, the agents that produce good output on the first pass are the ones that also have dynamic context, which helps answer questions that can’t be answered from looking at the code alone. Things like "What does the real API response in this scenario look like?" or "What are some edge cases that happened in prod in the last 14 days?"

Providing helpful context to the agent looks like this:

Giving agents read-only access to dev/prod databases
Giving them access to logs and observability data
Writing comprehensive agents.md files

Can the agent verify its own work?

This is pretty self-explanatory and has become more popular with computer use and long-running background agents on the web.

I've had good success with the traditional method of breaking tasks into frontend and backend, then having the agent confirm both are working independently before I come in, verify the summary, and connect the two.

On the frontend, give the coding agent access to Chrome (easy with Claude with Chrome extension). Tell it to mock the backend or point to a shared dev backend, and provide instructions to test the full flow. One neat trick when doing this is explicitly asking the agent to mutate presentation logic, like forcing error components to render temporarily. Some edge cases are hard to simulate with mock data, so it's easier to just simulate it on a browser.

On the backend, let the agent write small scripts that call backend logic and test against real dependencies. Leaning into functional programming helps a lot here. On an adjacent note, our founding engineer, Sohan, wrote a blog post on how we use this to do lightweight evals for our LLM workflows.

Do you have a harness to catch regressions?

Everyone's talking about "guardrails" for coding agents. If you dig into what people mean when they say guardrails, it's usually sandboxing, permissions, linters, static analysis. All of that is about constraining the agent, but it rarely helps guide agents to the correct output, regression-free.

We need a better harness for automated testing. Tests define how your system behaves, and without them agents are still flying blind.

Most growth-stage companies deal with massive coverage debt (<30% project coverage), so a low-hanging fruit is to backfill unit tests before starting work on new features. Writing and maintaining unit tests used to be like eating your veggies, but is easy to do now with AI coding tools like Cursor, Tusk, and Codex. Similarly, you should be using tools to generate API and Playwright/Cypress tests to let you know that critical flows are regression-free going forward.

But tests can still miss out on behavioral regressions, as in cases where you get a clean 200 with valid JSON. Observability tools won't alert on this, and your CI stays green because it’s difficult to anticipate writing a test for this exact scenario. The only way to catch it is by comparing against what the system was actually doing before the change.

---

Our team has been big on using tooling for code verification like Cubic and Bugbot. And as our usage increases, so too is our bandwidth for parallel agent work.

We also use our own Tusk Drift SDK to record live app traffic to provide our coding agents an understanding of how our product behaves in the real world (helps with Bucket 1), and then use that same traffic to generate realistic unit and API tests (helps with Bucket 3). Because we're testing against real traffic, we catch those silent behavioral regressions that traditional tests miss.

Teams that solve the code verification bottleneck will have a major advantage in this new economy where code is cheap but novel ideas are precious. There is a massive opportunity to unlock more innovation for the world by simply providing smart engineering teams the ability to fearlessly expand the scope of their work.

‍

Scaling Parallel Agents with the Context-Verification-Harness Framework

Does the agent have enough context?

Can the agent verify its own work?

Do you have a harness to catch regressions?

How Promptfoo Scaled Test Coverage Without Slowing Development

What if Your Production Traffic Was Your Test Suite?

Subscribe to newsletter