Where eval variance actually lives.
Why we replicate the judge three times per case, what we found when we measured it, and how that reshapes what a meaningful prompt-edit delta looks like.
We run our eval judge three times per case in parallel and report the median. The reason we started doing this was the textbook one — language models are non-deterministic, so replicate the noisy component to defang its noise. The reason we kept doing it is more interesting: in our pipeline, the judge turned out not to be the noisy component at all. Replication now functions less as a defence and more as a running sanity check.
What replication measured.
On a fixed agent output, with a fixed judge prompt and fixed ground truth, the spread across three parallel judge calls is effectively zero on the large majority of cases. Three identical scores, three for three. On the small minority of cases where the three judges produce slightly different scores, the median is the same as any single draw. Our judge, with our prompt and our model selection, is functionally deterministic.
That is a claim about our setup, not a universal one. A chattier judge prompt, a higher-temperature judge model, or a judge that has to make finer-grained distinctions than ours does could easily produce a real spread. Replication is how a team finds out which world they are in; we ran it expecting to discover the second world and discovered we were in the first.
Where the variance actually lives.
Our agent is not a single model call. It is a chain of typed sub-workflows: pull some features out of the inputs, reason on those features, choose what to do next, run a second call against a prompt the first call helped shape, and so on for several stages. Each call samples its own output. Each sample steers the prompt for the next call. By the time the chain finishes, two runs of the same case with identical inputs can produce visibly different intermediate states.
Composite variance is our name for that. It is the sum of the small probabilistic decisions a chain makes along its way, compounded by the fact that an early decision shapes the context the later decisions see. On our pipeline composite variance is larger than judge variance by enough margin that the comparison stops being interesting.
What this does to a single prompt edit.
A single run after a prompt change reports the sum of two things: the effect of your change, and one draw from the composite distribution. A plus-three the next morning is consistent with a prompt that genuinely improved the pipeline. It is also consistent with the composite happening to roll high on this particular run. Without a second run, the two cannot be separated.
That implication moved the bar for us. A delta from one run, no matter how cleanly it lands, is now treated as a single sample of a noisy random variable, not as evidence that a change worked.
What we do now.
The judge still runs three times per case. The median is still the reported number. We keep that step because we want a continuous check that the judge variance stays where we measured it; if the three-call spread starts widening, the judge has drifted, and finding that out continuously is much cheaper than finding it out after a mysterious regression.
The agent also gets run more than once per case on any change someone wants to call a win or a loss. Not on every commit — the bill would be unmanageable — but on prompt edits, model swaps, and sub-workflow changes that are intended to move the headline. We maintain a rough per-case sense of how wide the composite distribution is, and a delta that lands inside that band does not count as a result.
Most prompt edits, scored honestly, land inside the band. The ones that don't are the ones we ship, and we ship far fewer than we used to.
When the other shape applies.
This is not an argument that the judge does not matter. If your replicated judge calls produce a real spread, that is itself useful information — it points the next chunk of methodology work at the judge prompt, not at the agent. Both worlds exist, and the only way to know which one you are in is to measure. The mistake to avoid is assuming, in either direction, without checking.
Once a team has confirmed that the judge variance is small, the centre of gravity of the eval conversation should move upstream. Most prompt-edit arguments we used to have were about the judge prompt. They were the wrong arguments. The interesting questions, once the judge is known stable, are about the agent's intermediate states: which sub-workflows are sensitive to sampling, where in the chain do small differences amplify, what stages need their own structural fixes rather than prompt tweaks.
The cost.
Replicating the agent is not free, and the budget conversation is real. Tripling the cost of a benchmark run was something we resisted for a while. What changed our minds was the realisation that the compute cost of repeated runs is smaller than the team-hours cost of arguing for two weeks about whether a particular noisy delta was meaningful. The math goes the right way on the scale we care about.
Replication also produces a side benefit we did not budget for: a per-case composite-distribution width is itself a useful artefact. It tells us which cases are stable and which are fragile, and the fragile ones tend to be the cases where a sub-workflow is doing something genuinely unreliable. Targeting those is a clearer signal than chasing a moving headline.
Close.
The score is one component of an eval. The harness, the replication strategy, the noise model, and the rule for declaring a delta meaningful are the rest. In our setup, the judge is the most stable element in the apparatus, and the agent is the loud one. Most of the methodology work since landing on that has been on the agent side of that line, not the judge side.