The bench earns its keep

The agent bench was built to answer one question: which model is best at driving our tools? It runs scenarios with deterministic verifiers — write a util, refactor a symbol, debug a failing test — and reports pass rates, tool counts, and wall times across providers. The first findings were the obvious ones: gpt-4.1-mini was 300× cheaper per passing run than Claude Opus, qwen3.5 had a "code-review-by-default" prior that ate our add-temperature scenario every time. Useful, but mostly a leaderboard.

What we didn't expect was that the bench would start fixing the agent.

The first thing it surfaced was an Edit-tool failure mode hiding inside the leaderboard. gpt-4.1-mini was passing scenarios but burning ten or fifteen tool calls per run on what should have been four-call jobs. The transcripts showed it: every multi-line old_string the model passed to Edit had some risk of a single corrupted character — an em-dash mistokenized into a vertical tab, a smart quote pasted as a different smart quote — and Edit would reject the whole thing with "old_string not found". The model would re-Read, retry, get a different mistokenization, retry again. Strong models thrashed; weak models gave up. The bug had been there since the agent shipped, but nothing had told us how often it cost us until the bench counted the wasted tool calls.

The fix was a two-stage fallback inside Edit. Stage two does a line-aware match after stripping trailing whitespace and normalizing CRLF/LF — leading indent kept exact because Python indentation is semantic. Stage three does a per-line similarity check using difflib.SequenceMatcher with a 0.85 threshold, which is forgiving enough to catch transcoding errors but tight enough that it won't accept a different line that just shares some words. We also had to write our own line-splitter, because Python's splitlines() happily splits on vertical tabs and form feeds — which is exactly what trips when a corrupted control character lands in the middle of old_string. Re-running the bench showed the wins: edit_not_found errors dropped from ten-plus per matrix to zero, and delete-with-callers runs went from 14 tool calls down to 11.

Then the bench surfaced something subtler. find-missing-tests was a deterministic 0/3 failure on gpt-4.1-mini — same pattern every time, two tool calls and out. The transcript was almost too short to read: the model called Glob with an absolute path, got back "Non-relative patterns are unsupported", called Glob again with another absolute path, got the same error, gave up. We checked the bench's system prompt and found the contradiction: the prompt explicitly tells the model to pass absolute paths to be unambiguous about location, but pathlib.Path.glob() rejects absolute patterns by design. A model that followed instructions correctly hit a hard error and lost. The fix was small — detect absolute patterns and route them through stdlib glob.glob instead — and the bench result was emphatic: 0/3 became 5/5.

The third thing the bench told us was that delete-with-callers thrashes regardless of how good the Edit tool gets. Deleting a function via Edit means constructing a multi-line old_string matching the entire def block, which is brittle in proportion to the function's length. We added an AST-driven primitive — DeleteFunction(path, name) — that takes the name and uses Python's ast module to find the def's boundary including any decorators, then splices it out. Eight tools instead of seven. The first run showed a regression we hadn't predicted: gpt-4.1-mini would call DeleteFunction to remove the def, then Grep for the function name in the caller file, see the now-broken references, and declare the task done without cleaning them up. The new tool was so satisfying to use that the model treated using it as completing the task. We baked a one-line reminder into the success message — callers in OTHER files are now broken; you must Grep and Edit each call site before declaring this task complete — and the regression went away. The lesson there is one we keep relearning: tool descriptions are read once at registration, but tool results are read in-context every time. The right place to put the warning is where the warning becomes relevant.

What we did not fix is the qwen3.5 ceiling. Several scenarios show the same pattern — the model ignores explicit negative constraints in the task ("do not call any of the app's actual methods, only enumerate them") and starts calling them anyway, racking up bad_args errors until it hits the iteration cap. One run died with "XML syntax error on line 9: element <function> closed by </parameter>" — qwen3.5 emitted malformed function-call markup that ollama's wire parser rejected before the agent ever saw it. Neither of those is a tool problem. The path forward there is either a one-shot example in the system prompt for non-cloud providers, or accepting that some scenarios are above this model's capability ceiling — both honest answers.

The bigger thing, looking back, is that we now have a feedback loop the system didn't have before. Tool changes used to land on a "looks fine in manual testing" basis. Now they land with a before-and-after pass rate, an error-category histogram, and a stable run-group ID we can A/B against. Three of the fixes in this session would not have been visible without the bench, and the fourth — the DeleteFunction reminder — would not have been correctly designed without watching the model's behaviour change between runs. The bench was supposed to grade providers. It turned out to grade the tools too. Worth the build.

Related Posts

← Back to posts