Orchestration

Harness vs harness: why agent scaffolds differ so much at coding

Harness vs harness: why agent scaffolds differ so much at coding

If you've watched two coding agents work the same task with the same underlying model and gotten dramatically different results, you've already seen the thing this post is about. The model is one variable; the harness — the scaffolding around the model that decides what it sees, when, and how it's allowed to act — is another, and it's the one I've been spending most of my time on.

What I mean by "harness"

I'm using harness loosely to cover everything that isn't the model weights:

  • The tool grammar — what tools exist, how they're described, how outputs are echoed back.
  • The control flow — single loop vs planner/executor vs critic-augmented, and where each one cuts.
  • The memory — what gets remembered between steps, how summarization is triggered, and what's silently dropped.
  • The environment surface — sandbox, filesystem view, allowed network egress, latency budget.

Any one of these can change a benchmark score by more than swapping models.

The experiment

I ran the same model across three harness variants on a fixed task set:

  1. Single-loop ReAct — one model, one tool grammar, no planning.
  2. Planner / executor split — a planning pass that produces a structured plan, then an executor that follows it tool-call by tool-call.
  3. Planner / executor / critic — same as #2 with a per-step critic that can interrupt and re-plan.

Same model. Same tools. Same task list. Same temperature.

The spread between the worst and best harness was wider than the spread between this model and the next size up. That's not a new result — it's been quietly true in the literature for a year — but living it changes how you think about where to invest engineering effort.

What actually moved the score

Three things, in roughly this order of magnitude:

1. Tool error echoes

When a tool fails, what the model sees next matters more than almost anything else. Truncated tracebacks lose context; verbose tracebacks blow the context window. The harness that won wasn't the smartest — it was the one whose tool errors read like a senior engineer's terminal output.

2. Memory eviction policy

Naive truncation ("keep the last N turns") loses the bug repro the model was about to fix. Summary-based memory loses the exact line numbers it needs. Hybrid approaches — keep verbatim tool outputs from the last K steps, summarize everything before — outperformed both.

3. Where the plan lives

A plan stored as text in the conversation gets re-encoded on every turn and quietly mutates. A plan stored as structured state, edited via tool calls, doesn't. The structured version sometimes feels heavier, but it stops the "drift" failure mode where the model forgets what it was originally doing.

What didn't matter as much as I expected

  • Number of critic passes. One critic is good. More critics mostly add latency.
  • Temperature. Below ~0.4, the score curve is nearly flat for these tasks.
  • System prompt verbosity. Past a certain point of clarity, more words hurt.

Where I'm going next

The next deep dive is on speculative decoding — same theme, different layer of the stack: how to keep the latency-per-step budget low enough that the harness can afford to do more interesting things between calls. Linked from the timeline once it's up.