evals.
2 posts tagged evals.
-
Where eval variance actually lives.
Why we replicate the judge three times per case, what we found when we measured it, and how that reshapes what a meaningful prompt-edit delta looks like.
Read → -
Building a strict benchmark for AI in patent law.
A guest post from the tu-po team on grounding an AI legal agent in EPO Board of Appeal decisions — and what 'correct' even means when the experts disagree.
Read →