January 17, 2026

The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material.
- Michelangelo
Here’s a quick thought experiment. Let’s say you were given a SQL querySELECT * from orders WHERE id = 999
...and you needed to find a matching query from a list of queries below:
SELECT * FROM users WHERE id = 999SELECT * FROM teams WHERE id = 456SELECT * FROM orders WHERE id = 789Most engineers, particularly those coming from an ML background, see mock matching and come up with this approach - collect all possible candidates, score them all against the original query by cosine similarity, pick the query with the highest score.
Similarity scoring might give you these results:
users query: 0.92 similarityteams query: 0.89 similarityorders query: 0.94 similarityBy this logic, you would pick the orders query with 0.94 similarity.
But to throw a spanner in the works, what if the orders query is “used up” and unavailable to be matched? Or what if it's for a POST when you need a GET? Similarity scoring doesn't know about any of this context.
Over the course of building Tusk Drift, our CLI tool that records/replays live traffic for testing, the core technical challenge we faced was - when your service makes an outbound call during replay, which recorded mock should we return to serve as test data?
You might approach this thinking, "Use Levenshtein or embedding search." We thought that too initially, but it turned out to be sub-optimal. You solve this problem by expanding constraints, not similarity scoring.
Tusk Drift’s core value proposition is that it removes the need to manually mock API responses or set up complicated dependencies for integration tests. Therefore, a lot was riding on Tusk’s ability to select the correct recorded trace to replay against the developer’s changes when its tests run.
Tusk has to do this while encountering traces containing noisy inputs, repeated calls, and multiple protocols. At the same time, it needs to make sure matching happens fast enough to run inside every test (there can be thousands of tests in the suite).
After dogfooding Tusk Drift, we found that pure similarity scoring would fail obviously at times due to semantic constraints (GET vs POST), lifecycle constraints (pre-app-start vs. runtime), and temporal constraints (”used” vs “unused”).
We realized we needed to instead work backwards from exclusions. In other words, could we look at all the ways obviously-wrong matches could happen and create guardrails around them?
Some examples of guardrails to enforce:
/users to /teamsWhile working backwards to a solution isn’t always scalable, we are fortunate that the universe of ways a match could be obviously wrong is limited. So with another constraints in place for a match, you will very quickly get to an answer that is precise.
This then begs the question of the relative importance of these constraints. There are going to be stricter constraints that, if a match meets, should be considered a clear signal that the match is good. Then there’ll be looser constraints that are meant to help as a catch-all.
To get the full priority cascade (P1-14) above, we further split our matching algorithm into two orthogonal dimensions. In each case, the cascade of constraints prioritizes the most granular to the least granular.
The first axis is scope (where to search), where Tusk goes from Trace → Suite, while the second axis is criteria (what to match on), where it goes from Exact value → Reduced value → Schema. These compose independently. You can have "exact value at suite scope" or "schema match at trace scope."
Getting this right was important not just for running idempotent tests, but also for determining if Tusk Drift had uncovered an API regression or simply an intended schema change.
The match level (based on the above priority cascade) between the current and recorded request provides valuable signal as to whether a deviation is a regression. Consider these three scenarios:
The constraint metadata in each case tells us why this mock was chosen, which allows Tusk to make better judgement on how much the resulting deviation matters.
We’ve been throwing a lot of shade at similarity scoring, but we haven’t entirely forgone it. Where we find it most helpful is in playing the role of a tiebreaker.
At the end of the day, Tusk still computes Levenshtein distance to make an intelligent choice in the following scenarios where there are multiple matches for the constraint.
With that said, the reality is that in production, 89.6% of Tusk Drift requests never reach similarity scoring. They hit P1-P6 (exact/reduced input value hash matching) in the priority cascade before the best mock is chosen for replay testing.
Similarity scoring doesn't understand semantics. Constraints, on the other hand, encode this semantic knowledge. They're the boring part of your matching engine; you might not get excited about lifecycle filters or HTTP shape validation but they’ve earned their place.
If you're building any kind of complex matching system, whether its mocks, search, or recommendations, start first from figuring out what constraints prevent catastrophically wrong matches.
Build those constraints in first. Make them fail fast if possible. And layer them in a priority cascade so you can progressively relax them to find the best match. Similarity can then serve as the tiebreaker when multiple candidates pass the same constraints.
In our case, 9 out of 10 requests never reach similarity scoring. They hit exact or reduced input value hash matches at O(1). The algorithm degrades gracefully for the remaining 10%, but only after constraints have filtered the search space down from thousands of candidates to a handful.
Chisel away at the dataset, my fellow devs. The match is already in there.