Building a strict benchmark for AI in patent law.

A guest post from the tu-po team on grounding an AI legal agent in EPO Board of Appeal decisions — and what 'correct' even means when the experts disagree.

By tu-po Team, Patent automation engineering 14 May 2026

customer-stories evals legal-ai

We build AI for patent workflows on top of the Podesta platform. Evaluating that agent honestly is the hard part of building it. Patent reasoning is multi-dimensional — the outcome reached, the legal basis cited in support, the route from inputs to conclusion, and how the system handles the cases experts themselves disagree on. An eval that collapses those onto a single right-or-wrong axis would miss the failure modes our customers most need us to catch. We built ours to score across those dimensions, and we built both the agent and the eval runner that scores it on the same platform: the agent is an Akribes workflow, and so is the judge. This post is about that benchmark — why we anchored it on the European Patent Office's Boards of Appeal, what multi-axis scoring looks like in practice, and what we got out of running the eval pipeline on the same primitives the agent uses.

Why patent law is a brutal domain for AI eval.

Most public benchmarks for legal AI are quiz-shaped. A question, a multiple-choice answer, sometimes a short free-text rationale. They travel well, they compare easily across systems, and they bear almost no resemblance to the artefact our customers actually produce.

A real piece of EPO work is not an answer. It is a structured document — a set of claim amendments with their basis in the application as filed, a novelty argument over a specific prior-art disclosure, an inventive-step chain under the problem-and-solution approach, a reply to a communication that does all three at once and remains admissible under the Rules of Procedure. Correctness in that world is a conclusion, plus the reasoning that justifies it, plus the legal basis cited in support, plus the procedural posture that makes the whole thing arguable in the first place.

Worse, experts disagree. The European Patent Convention is a half-century-old treaty interpreted through tens of thousands of written decisions. Two competent attorneys will read the same set of claims and the same prior art and reach different conclusions on whether an amendment is allowable under Article 123(2) EPC. Sometimes two Boards of Appeal reach different conclusions. A benchmark that does not acknowledge that texture is not measuring patent reasoning; it is measuring something easier and pretending.

Why the EPO Boards of Appeal are useful.

The Boards of Appeal are the final instance for most decisions taken by the EPO's examining and opposition divisions. Their decisions are written, public, and reasoned in detail. The Board states the facts, states the legal question, walks through the applicable case law, and renders a conclusion with the precise legal basis cited. For an AI eval, that is rare: the ground truth has already been written, by people whose job it is to write it.

We are not the first to notice that BoA decisions look like training-or-eval material. We think we are unusual in treating them as eval material specifically, and in declining to use them for training. The whole point of using a Board decision as ground truth is that the system has not seen the answer. Our cases are partitioned accordingly: the agent has access to the application documents and the prior art the Board considered, never to the decision itself.

How we structured the bench.

Each bench entry is a single, narrow legal question grounded in a single BoA decision. A typical entry asks something like: given this set of claims as amended and the application as originally filed, is the amendment allowable under Article 123(2) EPC? Or: given this claim and this prior-art disclosure, is the claim novel under Article 54?

The shape of a case on disk is deliberately boring. A metadata.json locating the decision in the EPO's case-law system; the inputs the agent sees (claims under review, application as filed); and a ground_truth.json we derive from the Board's reasoning. The ground truth is not a copy of the decision. It is a structured distillation: the conclusion the Board reached, the features or amendments the Board considered determinative, the articles cited, and the line of case law relied on. The decision text sits next to it for traceability, never as input to the agent.

The agent's output is also structured. Our pipeline is built in Akribes (Podesta's typed workflow language), and every stage returns a typed value: a list of claim features here, a problem-solution analysis there, a final conclusion with citations. That matters for the judge: we are not comparing free-text essays, we are comparing fields.

The scoring rubric.

A case score is not a single number. The judge produces a structured breakdown across several axes — the outcome the agent reached, the legal basis cited in support, the route from inputs to conclusion, and how the agent handled the cases that even the Board flagged as borderline. The headline composite is built from those, and we publish the per-axis decomposition alongside the headline whenever we publish a score at all. A pipeline can move the headline up by getting better at citing the right authorities without changing its outcome accuracy, and that has to read as a different signal than the same headline move from better outcome accuracy.

The rubric is "strict" in the sense that the legal-basis and reasoning-route axes act as severe multipliers on the outcome axis rather than as independent additions. Right answer, fabricated citation lands much lower than right answer, real citation, even though the answer is identical. We have seen enough models reach the correct outcome via case law that does not exist or via an article that does not say what they think it says that letting that count as a win would teach us to ship a system that bluffs convincingly — the failure mode our customers most need us to catch.

Calibration matters too. A confident wrong outcome on a case the Board called clearly is scored differently from a confident wrong outcome on a case the Board itself flagged as unusual on its facts. We want the agent to register the difference between "the answer is hard and the system should hedge" and "the answer is clear and the system should commit"; the rubric rewards the former and penalises the cases where the agent picks the wrong one of the two.

The headline target is a high bar on the composite under that rubric: a large majority of cases in which the agent reached the right conclusion, by reasoning a Board would accept, citing the authorities the Board itself relied on. That is not a system anyone should let near a real file unsupervised, and we say so. It is the bar we ratchet towards.

The judge is also an LLM. We know.

We will not pretend otherwise. The judge that scores agent outputs against ground truth is itself a model. Three things keep that honest. The judge prompt is versioned alongside the cases, so a score is reproducible against a specific judge version and case set. We run replicated judge calls and take the median, with outliers logged to disk. And we have characterised, separately, where the variance in our composite score actually comes from. For our current pipeline, the judge is not the loud term. The composite is. We wrote that up in The judge isn't the variance. The composite is.

The hard cases.

Some cases in the bench are genuinely hard, and we put them there on purpose. There are Board decisions where the Board itself notes that the outcome is unusual on its facts. There are decisions where the applicable case law has shifted across the Boards over the years. There are decisions where two Boards have ruled differently on what looks, at first reading, like the same legal question, and the divergence has not yet been resolved by the Enlarged Board.

A bench that excludes those is easier to score well on, and a system tuned to score well on it will quietly learn to bluff through edge cases at deployment. We want the opposite. The hard cases are where we want to see the agent hesitate, qualify, cite the relevant divergence, or flag the question as one a human attorney should resolve. A confident wrong answer on a hard case costs us more than a hedged answer that turned out right.

Akribes for the agent — and for the runner.

We build the agent itself in Akribes because legal reasoning, looked at without the storytelling, is a typed pipeline. Extract the relevant claim features from the application as filed. Identify the disclosed features in the cited prior art. Reason on novelty over each independent claim. If the claim is novel, reason on inventive step under problem-and-solution. Assemble the response in the procedural posture the case is actually in.

Each of those steps is a sub-workflow with a typed output. When the case law on a particular interpretive question shifts — and it does, about as often as you would expect for a treaty interpreted by a standing tribunal — we change one stage. The type system catches the downstream stages that no longer fit before the next eval run, not in production three weeks later. That is the difference between a refactor that lands cleanly and a refactor that quietly degrades the bench score in a way nobody notices for a month.

The eval runner that scores the agent is also an Akribes workflow. The agent runs as one script, the judge — given the agent's structured output and the ground-truth distillation — runs as another, and a small harness composes them across cases. We did not plan it that way; we backed into it because the alternative was maintaining a second copy of the model-provider abstractions, the streaming protocol, the token budgeting, and the cost accounting we already had load-bearing on the agent side. By the time the harness was real, the side benefit was that the judge prompt sits in a script a patent attorney can read. A judge buried in Python with bespoke plumbing is opaque to anyone who does not also write Python; a judge as an Akribes script reads as the sequence of checks it actually is, and our domain experts can suggest edits to it in the same way they suggest edits to the agent.

The newer version of this loop will run on Podesta Studio's bench panel, which adds visual case-by-case views, per-stage score breakdowns, and the same live event stream a Studio run already uses. The current version — the one this post describes — runs on the same primitives a platform customer has access to. We did not lean on internal tooling nobody else has.

What this is and is not.

The eval we have built is not a solution to patent AI; it is a scoring system we are willing to be measured against. The headline number will move when the pipeline improves. It will also move when we expand the case set, when the Enlarged Board issues a precedent-shifting decision, or when the judge prompt itself is revised to catch a failure mode it was previously letting through. We publish those deltas with their causes attached, because attribution is what separates a strict eval from a slogan — and because, in a domain where a confidently wrong brief is a patent application sunk on appeal, attribution is the part of the methodology that most directly translates to whether a customer can rely on the system at all.

← All posts Talk to us