Podesta — Blog

What's worth writing when code is cheap.

2026-05-14T00:00:00.000Z

The naive view is that since AI lowers the cost of writing code, we should write more of it. Why deliberate over a feature when you can prototype five versions by lunch? The conclusion seems to follow, but I think it gets the economics backwards.

Writing code has never really been the bottleneck for most software. Reading it, debugging it, deploying it safely, integrating it with other systems, keeping it working as the world around it changes — these dominate the lifecycle cost, often by an order of magnitude. AI compresses the writing step dramatically and the others much less. So if anything, the ratio of write-cost to live-with-cost has gotten worse, not better. Code that was barely worth writing before is now trivially easy to write and just as expensive to own. The temptation to generate it has gone up; the wisdom of generating it has gone down.

This makes scope discipline more important, not less. Every feature, endpoint, config option, and abstraction is a small ongoing tax — a thing that can break, that someone has to understand later, that constrains future changes, that has to be kept secure. When writing was expensive, the cost of writing acted as a natural filter; a lot of bad ideas died because nobody could be bothered to implement them. That filter is gone. Whatever replaces it has to be deliberate: a clearer sense of what the system is for, a higher bar for what gets in, more willingness to delete.

Small scope also matters because verification hasn't gotten meaningfully cheaper. AI can produce a 2,000-line module in a minute, but a human still has to convince themselves it's correct, and that takes about as long as it always did — longer, sometimes, because the code doesn't carry the author's intent the way human-written code does. A small, sharply-scoped system you can hold in your head is verifiable. A sprawling one generated in an afternoon is just a liability with good test coverage.

There's a more subtle thing happening with abstractions, too. Good abstractions used to emerge partly from the pain of repetition — you wrote something three times, noticed the shape, and factored it out. When code is cheap, the pain signal weakens, and you get either premature abstractions (AI-suggested patterns imposed too early) or the opposite, sprawling parallel implementations that never get unified because each one was cheap to produce. Both are worse than what a thoughtful human under cost pressure would have built.

Two regimes

"Code being cheap" isn't a uniform shift, though. It's a wedge that splits the world of software into two regimes with very different economics.

In the disposable regime, code's purpose is to run once or a few times and then vanish. Data migrations, scratch analyses, scrapers for a specific moment, glue between two systems for a weekend, internal dashboards for a question nobody will ask again next quarter. Maintainability is genuinely a non-issue because there's nothing to maintain — the code's lifespan is shorter than the half-life of the assumptions it encodes. Here, AI is close to pure upside. The traditional advice ("write it well, you'll thank yourself later") was always partly wrong for this regime; it just wasn't worth distinguishing, because writing throwaway code carefully cost almost the same as writing it carelessly. Now the cost gap is huge, and the right move is to lean fully into disposability — accept the mess, get the answer, delete the artifact.

In the durable regime, everything above applies and then some. And the wedge gets sharper because the same productivity boost that lets you spin up a one-off script in five minutes also tempts you to spin up a "small service" in an afternoon that will then live for eight years. The danger isn't writing more code per se; it's misclassifying durable code as disposable at the moment of creation, when the cost of being wrong is invisible. A lot of bad long-lived systems started life as someone's quick prototype that worked well enough to ship.

The underrated capability

The most underrated effect, I think, is on abstractions inside existing codebases. Previously, improving an abstraction across a large codebase was a project: you'd weigh whether the win justified the churn, and often the answer was no, so cruft accumulated. Now you can do it in an afternoon, which means the equilibrium shifts. Abstractions that were "good enough given migration cost" no longer are. Codebases that don't actively improve their internal vocabulary will feel increasingly creaky against ones that do. Refactoring stops being a periodic cleanup event and becomes closer to a continuous practice — you fix the shape of the code as soon as you notice the shape is wrong, because the cost of fixing it has collapsed.

Internal tools, the interesting middle

Internal tools sit somewhere interesting on the disposable/durable spectrum. They're often durable in calendar time but disposable in commitment — they're allowed to be ugly, narrow, and brittle in ways customer-facing code isn't. AI is unusually well-suited here, because the quality bar is "solves the problem for the five people who use it" rather than "withstands adversarial input from millions." A lot of organizations underbuilt internal tooling for decades because the labor cost couldn't be justified against a small user base; that math has now flipped entirely, and I'd expect the internal-tools surface area inside good engineering orgs to expand significantly over the next few years.

What not to write at all

There's a deeper question upstream of all of this, though. The regime distinction is useful, but the binding constraint isn't really writing or even refactoring — it's attention. Every piece of code you own takes a permanent slice of finite human bandwidth: someone has to read the PR that touches it next year, debug it at 3am when a dependency changes, decide what to do when a CVE drops against a transitive dep, hold its shape in mind when designing the thing next to it. AI helps with all of those locally — patches, fixes, summaries — but the deciding still routes through a human, and the human's bandwidth is fixed. Token spend lands on the same bill: an agent traversing 200kloc to do a small thing costs more than one traversing 20kloc, forever.

So the question that probably matters most isn't "which regime is this code in" or even "should I write this well or quickly," but "should this code exist in my life at all, or am I better off taking someone else's slightly-worse version and never thinking about the problem again." AI makes the surface appeal of building obvious — we could just build our own now — but the deeper analysis usually argues the other way. The interesting failure mode of cheap code isn't bad code. It's a quietly expanding surface area of things you're now responsible for, none of which individually seemed like a bad idea at the time.

The line of "obviously don't build this yourself" has been moving down the stack for a long time, and AI is moving it faster. Things teams comfortably built in-house a few years ago — auth, search, feature flags, job queues, admin panels, observability glue, large chunks of internal platform engineering — are now better consumed than built for most. Not because building them is hard. Because owning them is. The gap between what we could build and what we should own is probably the most important thing to keep an eye on right now.

The synthesis

The old default was treat all code as if it might become durable, because you can't afford to rewrite. The new default has two layers. Upstream: don't write the code at all if a workable version of it exists in the world and you can live with workable. Downstream, for whatever survives that filter: decide which regime it's in, and commit to that regime's discipline. Disposable code gets fully disposable treatment — fast, ugly, gone. Durable code gets more care than before, not less, because the surrounding ocean of cheap code makes the durable parts harder to keep clean.

It's true that AI also lowers the cost of converting between regimes — promoting a prototype, breaking up a monolith, rewriting once you know what it should have been. Within the regime layer, misclassification is more recoverable than it used to be. Conversion costs don't fall by the same factor that writing costs do — the human decisions that separate a prototype from a production system (error handling, security model, ownership, runbooks) resist the same compression typing has undergone — but they fall enough that "commit at creation" is weaker than it might first sound at this layer.

The upstream layer is a different story. Once you've built a thing and woven it into your stack, the decision to keep owning it is mostly already made: migration costs are concrete, ongoing ownership costs are diffuse, sunk cost and status quo bias both pull in the wrong direction. Migrating off your own auth onto a vendor, or off your own job queue onto a hosted one, is the kind of project that gets perpetually deferred. Build-vs-buy misclassifications are close to one-way doors. That's the layer where "commit at creation" actually does most of its work — and AI doesn't just fail to help you back out, it actively pushes you in. Tasks that used to be obviously too much to take on now look feasible, because the writing demo is fast and concrete while the ownership cost is slow and invisible. Build-vs-buy is exactly where AI makes its most seductive case, and exactly where that case is least to be trusted. The writing got cheap. The owning didn't. The decision has to be made on the owning.

The skill, maybe, is twofold: knowing what not to take on at all, and being honest about which regime everything that survives belongs to.

Self-hosting our developer toolchain.

2026-05-14T00:00:00.000Z

Our developer toolchain runs on open source, self-hosted in our own Kubernetes cluster. Forgejo for git, issues, and PRs; Forgejo Actions for CI; Forgejo's container registry for the images we ship; and CSS of our own on top so the UI reads as a continuation of the rest of the product. All four pieces share a namespace with the product itself, all synced from a single GitOps repository. This post walks through the pieces — and through the things that went wrong getting there.

Why self-host.

Three reasons.

Open source. Forgejo is an open-source fork of Gitea that we can extend, theme, and pair with our own runner image without asking anyone's permission. That is not abstract — it is the reason a custom theme repository and a custom runner Containerfile sit in our GitOps tree at all, and the reason a Renovate PR can bump every pinned tool inside the CI image on its own schedule.

Cost. Our entire CI workload runs on the same dedicated Hetzner servers as the rest of the product, scheduled into capacity that would otherwise sit idle. The hosted-GitHub equivalent for a team our size on a Rust-heavy CI profile — Team seats, Actions minutes at the runner sizes our compiles actually need, and Packages storage for the images we push — comes out to several hundred euros a month before usage spikes. Self-hosting puts us at low double-digit euros per server per month.

Customization. Two surfaces. The git UI is themed to match the rest of the product — Forgejo lets us drop CSS into a known location and pick a default theme via config, so a PR review page reads as a continuation of the product rather than a vendor screen inside it. The runner image is ours — pre-warmed with the toolchain baked in — which means CI startup time is bounded by image pull rather than by setup-bun hitting api.github.com's rate limit at the wrong hour. Neither surface is available on hosted CI at any price.

What's in the stack.

Git, issues, PRs. Forgejo runs as a Deployment in the cluster, backed by a CloudNativePG Postgres cluster declared next to it in the same chart. Upgrade story is the cluster's: bump the image tag, ArgoCD rolls it.

CI. Forgejo Actions, with two runner StatefulSets — one per dedicated server — pinned by nodeSelector so per-pod CPU limits add up to exactly the host's thread count. No oversubscription. KEDA scales replicas inside that envelope.

Container registry. The one that ships with Forgejo. The runner pushes its own image there on every change; the apps in the cluster pull from there. One TLS cert, one auth surface, one set of credentials.

A pre-warmed runner image. Our own Containerfile — rustup, sccache, uv, bun, Playwright with the browsers baked in, on a catthehacker/ubuntu:act-24.04 base. Every pinned tool has a # renovate: marker so Renovate keeps it current. Workflows opt in with a custom runs-on label and setup-bun / setup-uv / dtolnay/rust-toolchain become no-ops — no GitHub API calls, no downloads, no rate-limit issues during "Set up environment".

Shared compile cache. sccache, configured against an in-cluster MinIO bucket. The runner config sets the endpoint and bucket as default env; Rust workflows opt in by exporting RUSTC_WRAPPER=sccache and pick up cache from every other runner that has touched the same crate graph.

A custom git UI. A separate repository holds three CSS files — light, dark, auto. The Forgejo Deployment mounts them via a Kustomize ConfigMap that pulls the theme repository in as a git submodule, with a strategic-merge patch flipping the default theme via Forgejo's FORGEJO__ui__DEFAULT_THEME env var. Bumping the submodule pointer is the deploy.

Each runner pod also carries a docker-in-docker sidecar on a raw-block ext4 zvol, plus a separate raw-block volume for the act_runner cache server. We tried virtiofs first; bolt's mmap(MAP_SHARED|PROT_WRITE) returned ENODEV and the cache server quietly died. Raw block, ext4 inside the VM, mmap works.

The theme.

Forgejo ships with a terracotta primary (#c2410c). We retuned --color-primary-* to ink-500 (#1A2540) and remapped the entire --zinc-* ramp to a warm stone equivalent, so every downstream semantic token — buttons, cards, nav background, repo file rows — inherits the warmer surface without per-selector overrides. The terracotta is gone on purpose; the brand calls for stone plus ink plus a single oxblood mark, not a third heat color.

The semantic signal palette (red / green / yellow / blue) was retuned to AA-on-stone equivalents. Diff hunks are washed — closer to a redlined research paper than a SaaS pull-request UI. Headings get Source Serif 4, body Inter, code JetBrains Mono.

Three CSS files, one strategic-merge patch, one submodule. The ConfigMap is under 100 KB and Forgejo rolls in about ten seconds on strategy: Recreate.

The unglamorous bits.

Three things bit us hard enough to be worth naming.

The Forgejo registry truncates large blob uploads — anything in the hundreds-of-MB range hits a 499 or 504 from the ingress and never finishes. We hit it pushing the runner image (Playwright browsers ship as one fat layer per browser). The fix isn't on the image side, it's the ingress proxy timeouts and the registry's own upload limits; that ticket is still open. Until it lands, big images get built and pushed from a workstation with a non-home storage root.

Kata sandboxes occasionally drop a SandboxChanged event from cloud-hypervisor on builds north of ten minutes. The Rust toolchain layer is the long pole, so we keep RUN steps small — every Dockerfile layer is a checkpoint — and CI builds that hit this get retried from the latest cached layer.

Rootless podman build of the catthehacker base fails to commit with history lists N non-empty layers, but we have M layers on disk. The base has ~18 layers with empty-layer history markers that podman's overlay driver mishandles under home-directory storage. CI builds work because the Kata VM has a clean overlay store. Local builds, when needed, fall back to --storage-driver=vfs — slow, disk-hungry, fine.

One more, cheap but it cost us an afternoon: the slim :act-24.04 catthehacker variant doesn't ship node in PATH, and Playwright's CLI is a #!/usr/bin/env node shebang. playwright install failed with env: 'node': No such file or directory. apt install nodejs in the same RUN, move on.

A nod to depth.

CI jobs run inside microVMs — every workflow gets a fresh Kata-backed sandbox with its own kernel, its own docker daemon, and a graceful drain window so SIGTERM doesn't kill jobs mid-run. The mechanics of that (and why the per-pod CPU limits add up to exactly the host's thread count, no oversubscription) is its own piece: Kata-VM runners for self-hosted CI.

Where we drew the seams.

Self-hosting isn't "build your own everything." We pull Forgejo from upstream, the runner from code.forgejo.org, the CI base from catthehacker. We pay for OS images, kernels, and the bits of the registry we haven't yet fixed. What we own is the parts where ownership changes the answer: the runner image that decides what "set up environment" costs, the sccache backend that decides whether a Rust rebuild is two minutes or twenty, the CSS that decides what our git looks like, and the kustomization that wires all of it into one ArgoCD-managed Application. The bill came out smaller than the equivalent on hosted GitHub for our usage profile, the feedback loop into the cluster is shorter, and the UI looks like the rest of the product — all of which has paid back the ongoing maintenance work it costs to keep that surface intact.

Our internal prediction market.

2026-05-14T00:00:00.000Z

We run an internal prediction market. Anyone in the company can open a market on a falsifiable claim about our work, anyone can take either side, the currency is play money, and the leaderboard is visible to everyone inside the company. The point is to invert the usual workplace incentive on disagreement. The default move in a meeting is to stay agreeable and not commit to anything checkable — saying nothing falsifiable feels like the safe play, because a falsifiable claim is one you might later be visibly wrong about. A market flips the trade. Hedging contributes nothing to your leaderboard rank; taking a position contributes signal whether you turn out right or wrong on any individual question. Being occasionally wrong is the cost of contributing at all, not the failure mode.

The mechanic.

A market is a question with a resolution rule and a date. "We will ship feature A by month-end." "Our next bench run moves the headline metric by at least N." "Provider P releases its next model before date D." You write the question, define how it resolves, and seed the price by taking the first position. Others trade against you, prices move, and the question eventually resolves against a fact in the world. Balances update, the leaderboard updates, and the market page archives the full history.

Trades are attributed by default. The order book on every market is visible inside the company — current price, full trade history, who is on each side, and the rule that will resolve it.

Why the order book is public.

The most consequential decision in the design was making trades attributed and the order book company-visible, rather than running a private market where each trader only sees their own positions. The private version is easier to sell internally — nobody has to put their name on a No — but it loses what makes a market useful in a company setting. A private trade is a guess; a public trade is a forecast on the record, the same shape as a forecast in a meeting except that the resolution rule and the date were agreed on up front.

The visibility is not there to discipline people for being wrong. Most of our active traders have been wrong on plenty of markets — that is how a real prediction market is supposed to feel, and the leaderboard reflects long-run calibration rather than any individual miss. The reason visibility matters is the inverse: without a public record, taking a contrarian view in a meeting and saying nothing at all are indistinguishable a quarter later. With a public record, the position you took is in your history regardless of how it resolved, and the fact that you committed to it at all is the signal the market is built to surface.

What it gets used for.

Three shapes of question show up most.

Project deadlines. "Feature A by month-end?" "Customer Z signs by end of next quarter?" These markets do two useful things at once. The first is genuinely predictive: the price five days out is a better estimator of "are we actually going to ship" than what anyone said at the last standup, because the traders are people who have been watching the work and have skin in the game. The second is upstream of the prediction — translating a meeting claim into a yes/no with a specific date forces the speaker to commit to what they actually meant, and the version that survives that translation is often meaningfully more conservative than the version that was said out loud.

Quantitative metrics. "Headline bench score above X on next iteration?" These work well when the metric is well-defined and the resolution date is close. Markets on far-out metric movements thin out fast.

External events that move us. "Provider P releases its next model before date D?" "The next round of pricing changes in the API tier we use lands before the end of the quarter?" These are usually the most heavily traded markets, because the information edge is genuinely distributed across the company and the resolution is unambiguous.

A fair fraction of opened markets attract zero counterparties. That outcome is also informative — it usually means the question is too inside-baseball for anyone outside one team to have an edge on it. Useful to know before the next planning meeting takes a position on the same question with no skin in the game.

What works and what doesn't.

Markets fail in a few predictable ways.

Subjective resolutions are the most common failure. If a market's resolution depends on one person's reading of an ambiguous situation, that person becomes the most consequential trader without taking a position. Good markets resolve on something pointable: a shipped commit, a signed contract, a value on a dashboard, a public announcement. We have a small set of resolution adapters — short Akribes workflows that read from a system of record — for the common cases.

Calibration takes practice. In their first month, most people are overconfident on markets adjacent to their own work and underconfident on everyone else's. That sorts itself out on the leaderboard, which is part of why the leaderboard exists in the first place.

The build.

A modest Rust service. The crates are auth, core, db, market, metrics, social, trading, and web. Authentication via OIDC in production with a mock-OIDC service for local development. Resolution adapters plug in as Akribes workflows, so a market can resolve automatically against a system of record — a deployed commit, a value from a dashboard, a public announcement — instead of depending on a human to remember to press the button. We will open-source it once a few of the rough edges are filed off; other companies that want to invert the same workplace incentive should not have to build this themselves.

What surprised us.

Three things we did not predict going in.

The leaderboard rewards passing. The traders at the top are not the ones who take a position on every market; they are the ones who skip the markets where they have no edge. That is not how the room reads in a meeting, where the people who talk most are usually the ones whose opinions are remembered.

Markets are still valuable when nobody trades them. The act of writing a market — picking the question, the resolution rule, the date — is where a lot of the value sits. A vague meeting claim either survives that translation as a real question or it doesn't, and both outcomes are more information than the meeting on its own produced.

Betting against a senior person turned out to be uneventful. We were braced for friction the first time someone publicly took the No side on a director-level pitch. Nothing happened. The market resolved, the leaderboard updated, the director kept being a director. After a few more like that, taking the No side on a director's pitch stopped being something the company noticed at all. Each public, uneventful resolution against a senior person's market is a small demonstration of what disagreement looks like here, and the cumulative effect of those demonstrations is the actual point. We were trying to build a place where criticising someone senior is part of the work — not an act that requires unusual bravery — and the only way to build that is to repeatedly, publicly, make the act unremarkable until it stops being brave and starts being a normal part of company culture.

MicroVM isolation for our CI runners.

2026-05-14T00:00:00.000Z

Our CI runs against a self-hosted Forgejo Actions setup in the same Kubernetes cluster as the product. Every job runs inside a microVM (Kata Containers on Kubernetes), not a regular container. Here is why we made that bet, what it cost, and what broke along the way.

Why microVM, not container.

A container shares the host kernel. A CI job runs arbitrary code from a PR — anyone with write access can land a workflow change, anyone with a fork can open the PR that runs it. Even with rootless containers and a tight set of Linux capabilities, the threat model assumes someone's build script never finds a kernel exploit. That assumption has been wrong, in public, often enough that we did not want to make it.

A microVM changes the shape of the worst case. The build script runs against a guest kernel inside a hypervisor (cloud-hypervisor in our case); a kernel escape now buys you root on a tiny throwaway VM, not on the node that schedules every other job in the cluster. We are not a security boutique. We just did not want a compromised dependency PR to be a cluster-level event when the alternative — ~6 GiB of guest RAM and a few seconds of startup per job — was bearable.

The other consideration is operational. CI is the workload most likely to do something strange to its kernel: mount, mkfs.ext4, raw netlink, weird capability sets in dind. A microVM contains every one of those experiments to a sandbox that will be destroyed when the job exits.

The shape of the deployment.

Kata is wired in as a Kubernetes RuntimeClass. The runner pods set runtimeClassName: kata in their PodSpec; Kubernetes schedules them through the Kata shim instead of the default container runtime. From the cluster's point of view they are still pods — kubectl logs, events, lifecycle hooks all work — but the container inside the pod runs inside a cloud-hypervisor VM, with the OCI image's rootfs presented to a guest kernel.

Two runner StatefulSets handle the actual work, one per dedicated host, both pinned by nodeSelector so the per-pod CPU limits add up to exactly the host's thread count. Each pod contains a Forgejo Actions runner sidecar plus a docker-in-docker sidecar; the runner uses the dind socket to spin up the per-step containers a workflow asks for. A workflow step is a container, inside a dind daemon, inside a Kata VM, inside a pod, on a node. It sounds layered because it is. The boundary that matters for isolation is the VM.

Workflows opt in to the pre-warmed image with a label exposed in the runner config:

labels:
  - "microvm-runner:docker://registry.internal/forgejo-microvm:latest"

A workflow that wants the baked-in toolchain says runs-on: microvm-runner. Jobs that ask for plain ubuntu-24.04 still work — they get a stock catthehacker image without the pre-warming.

Why Kata and not the alternatives. Firecracker directly would have worked, but it does not speak the OCI image format the way Kata does. gVisor would have given us syscall-level isolation in userspace, but the performance cost on a Rust compile is much worse than the hypervisor cost we did pay, and it would not have helped against a kernel escape — the gVisor sentry is the kernel surface. Kata won because an existing Containerfile is also a microVM image.

What it cost.

This is the honest section. Kata jobs are slower to start than runc containers, and the gap is real. A Kata pod boots a guest kernel, hand-shakes with the guest agent, sets up the rootfs inside the VM, and only then starts your container. Even with cloud-hypervisor — the fastest of the supported Kata hypervisors — that is several extra seconds of cold start on top of the usual image pull. For a sub-minute test job, that is a meaningful fraction of total wall time.

For our Rust builds it stops mattering, because those builds spend most of their wall time in rustc and the linker, with sccache pulling cached object files from in-cluster object storage. The absolute startup overhead is the same in both worlds, but it is now amortized over a much longer compile. Same shape for Playwright shards, where the dominant cost is browser launch and navigation.

We did not benchmark this rigorously enough to publish a number. The qualitative read: jobs measured in seconds feel slower; jobs measured in minutes do not. Anyone considering this for short, cache-hit-heavy CI should know the overhead does not amortize.

The other cost is memory. Kata adds per-VM overhead — guest kernel, agent, virtiofsd, cloud-hypervisor's working set — that we budgeted at 6 GiB on top of the workload's own request. With two VMs per node at CI capacity, that is ~12 GiB per node permanently committed to "Kata exists." We have the headroom; smaller fleets might not.

The image: why we baked everything in.

Pre-warming the runner image is a self-hosting concession, not a developer-ergonomics one. The actions everyone reaches for — setup-bun, setup-uv, dtolnay/rust-toolchain — all hit api.github.com/repos/*/git/refs/tags unauthenticated to resolve which release to pull. Self-hosted Forgejo jobs are not authenticated against GitHub, so during busy hours every one of those actions returns 403 and the job fails before the build starts. The official setup-actions are not infrastructure we control; depending on them inside a self-hosted runner is depending on someone else's rate limit budget.

So the image bakes in rustup, sccache, uv, bun, node, the MinIO client, and a full Playwright browser set, all with Renovate-tracked pins:

# renovate: datasource=github-releases depName=rust-lang/rust
ARG RUST_VERSION=1.94.0
# renovate: datasource=github-releases depName=mozilla/sccache
ARG SCCACHE_VERSION=0.8.2
# renovate: datasource=npm depName=playwright
ARG PLAYWRIGHT_VERSION=1.59.1

Renovate's custom-manager picks up the # renovate: markers and opens a PR whenever any of those pins goes stale. Bumping a tool is reviewing a Renovate PR; building the new image is the existing weekly CI cron.

Other pre-baked details that earn their keep: RUSTUP_HOME=/opt/rust and CARGO_HOME=/opt/cargo chmod'd world-writable so any UID can use them; PLAYWRIGHT_BROWSERS_PATH=/opt/pw-browsers so playwright install is a no-op; SCCACHE_IDLE_TIMEOUT=0 so the sccache daemon survives the entire job.

What broke.

Three war stories from the last six months. There are more in the runner-image README; these are the most informative.

Long-running builds and SandboxChanged. Cloud-hypervisor on the Kata runtime would occasionally emit a SandboxChanged event mid-build on jobs north of ten minutes — Kubernetes sees the pod's sandbox effectively replaced and the running container disappears. The Rust toolchain layer was always the long pole. The fix on the image side is to keep RUN steps small, because each one is a checkpoint. The fix on the diagnosis side is to capture dmesg, the containerd-shim-kata-v2 processes, and recent containerd logs from the affected node before they roll over, which is what our kata-capture.sh exists for. Usual suspects: hypervisor OOM, a kernel trace in vhost-net or virtio-blk, or the shim itself crashing.

Rootless podman + the catthehacker base. Building the image locally with rootless podman against the home-directory overlay store reproducibly fails on the final COMMIT step with history lists N non-empty layers, but we have M layers on disk. The catthehacker base ships with ~18 layers, some carrying empty-layer history markers that podman's overlay driver mishandles under home-directory storage. --squash alone does not fix it. CI builds work because the Kata VM that runs the build has a clean overlay store. Local builds either run on a workstation with a non-home storage root, or fall back to --storage-driver=vfs — slow and disk-hungry, but produces a valid image.

The Forgejo registry truncating large blob uploads. This one is unresolved. Pushing the runner image to our own Forgejo container registry fails on the Playwright browsers layer (roughly a GB of compressed binaries — chromium, firefox, webkit and their OS deps, rendered as a single OCI blob). The failure mode is either 499 Client Closed Request on the client side or 504 Gateway Timeout from the ingress, and it reproduces from both CI and a workstation push over the same public ingress — which rules out the image and rules in the proxy in front of the registry. The fix lives in the ingress chart: bump proxy-read-timeout and proxy-send-timeout past the time it takes to upload the biggest single blob, raise the body-size cap, and check the registry's own upload limits. There is no reasonable image-side workaround; Playwright lays each browser down as one directory tree and the layer compresses as a single blob. Until the ingress timeouts move, the workaround is to push from somewhere with enough patience to retry.

What sccache buys us.

The image installs sccache but does not set RUSTC_WRAPPER globally — the runner-level config exposes the S3 endpoint and bucket name as default env vars, and Rust workflows opt in by exporting RUSTC_WRAPPER=sccache themselves. The opt-in matters: when the object store is unreachable or the credentials drift, an opted-in job fails loudly with "Server startup failed" rather than every Rust job in the cluster falling over silently. The backend is an in-cluster object store, so the round-trip is one Linux bridge away.

Combined with SCCACHE_IDLE_TIMEOUT=0, this is where the wall-time win for repeat Rust jobs comes from. Cold compiles still pay the full cost, but everything downstream of a touched crate pulls cached objects rather than recompiling. Pair that with the microVM startup overhead being amortized over a multi-minute compile, and the performance picture stops looking bad — the boundary cost only shows up on the short jobs.

The companion post.

For the broader self-hosting story see Self-hosting our developer toolchain. This post is the deep dive on the runtime; that one is the elevator pitch for the whole stack.

Closing.

The microVM-per-job choice is one of those decisions made for security reasons that ends up justifying itself with performance numbers six months later. The CI threat model — arbitrary code from a PR running on infrastructure that also runs the product — is the load-bearing reason; the performance overhead has stayed bearable because the long jobs that benefit most from microVM isolation are also the ones that amortize the per-job startup cost over a multi-minute compile. The remaining operational pain is downstream of the registry and the ingress, not the runtime, and it is being closed out as we get to it. We would make the same call again.

Where eval variance actually lives.

2026-05-14T00:00:00.000Z

We run our eval judge three times per case in parallel and report the median. The reason we started doing this was the textbook one — language models are non-deterministic, so replicate the noisy component to defang its noise. The reason we kept doing it is more interesting: in our pipeline, the judge turned out not to be the noisy component at all. Replication now functions less as a defence and more as a running sanity check.

What replication measured.

On a fixed agent output, with a fixed judge prompt and fixed ground truth, the spread across three parallel judge calls is effectively zero on the large majority of cases. Three identical scores, three for three. On the small minority of cases where the three judges produce slightly different scores, the median is the same as any single draw. Our judge, with our prompt and our model selection, is functionally deterministic.

That is a claim about our setup, not a universal one. A chattier judge prompt, a higher-temperature judge model, or a judge that has to make finer-grained distinctions than ours does could easily produce a real spread. Replication is how a team finds out which world they are in; we ran it expecting to discover the second world and discovered we were in the first.

Where the variance actually lives.

Our agent is not a single model call. It is a chain of typed sub-workflows: pull some features out of the inputs, reason on those features, choose what to do next, run a second call against a prompt the first call helped shape, and so on for several stages. Each call samples its own output. Each sample steers the prompt for the next call. By the time the chain finishes, two runs of the same case with identical inputs can produce visibly different intermediate states.

Composite variance is our name for that. It is the sum of the small probabilistic decisions a chain makes along its way, compounded by the fact that an early decision shapes the context the later decisions see. On our pipeline composite variance is larger than judge variance by enough margin that the comparison stops being interesting.

What this does to a single prompt edit.

A single run after a prompt change reports the sum of two things: the effect of your change, and one draw from the composite distribution. A plus-three the next morning is consistent with a prompt that genuinely improved the pipeline. It is also consistent with the composite happening to roll high on this particular run. Without a second run, the two cannot be separated.

That implication moved the bar for us. A delta from one run, no matter how cleanly it lands, is now treated as a single sample of a noisy random variable, not as evidence that a change worked.

What we do now.

The judge still runs three times per case. The median is still the reported number. We keep that step because we want a continuous check that the judge variance stays where we measured it; if the three-call spread starts widening, the judge has drifted, and finding that out continuously is much cheaper than finding it out after a mysterious regression.

The agent also gets run more than once per case on any change someone wants to call a win or a loss. Not on every commit — the bill would be unmanageable — but on prompt edits, model swaps, and sub-workflow changes that are intended to move the headline. We maintain a rough per-case sense of how wide the composite distribution is, and a delta that lands inside that band does not count as a result.

Most prompt edits, scored honestly, land inside the band. The ones that don't are the ones we ship, and we ship far fewer than we used to.

When the other shape applies.

This is not an argument that the judge does not matter. If your replicated judge calls produce a real spread, that is itself useful information — it points the next chunk of methodology work at the judge prompt, not at the agent. Both worlds exist, and the only way to know which one you are in is to measure. The mistake to avoid is assuming, in either direction, without checking.

Once a team has confirmed that the judge variance is small, the centre of gravity of the eval conversation should move upstream. Most prompt-edit arguments we used to have were about the judge prompt. They were the wrong arguments. The interesting questions, once the judge is known stable, are about the agent's intermediate states: which sub-workflows are sensitive to sampling, where in the chain do small differences amplify, what stages need their own structural fixes rather than prompt tweaks.

The cost.

Replicating the agent is not free, and the budget conversation is real. Tripling the cost of a benchmark run was something we resisted for a while. What changed our minds was the realisation that the compute cost of repeated runs is smaller than the team-hours cost of arguing for two weeks about whether a particular noisy delta was meaningful. The math goes the right way on the scale we care about.

Replication also produces a side benefit we did not budget for: a per-case composite-distribution width is itself a useful artefact. It tells us which cases are stable and which are fragile, and the fragile ones tend to be the cases where a sub-workflow is doing something genuinely unreliable. Targeting those is a clearer signal than chasing a moving headline.

Close.

The score is one component of an eval. The harness, the replication strategy, the noise model, and the rule for declaring a delta meaningful are the rest. In our setup, the judge is the most stable element in the apparatus, and the agent is the loud one. Most of the methodology work since landing on that has been on the agent side of that line, not the judge side.

Adding email as the front door to our patent agent.

2026-05-14T00:00:00.000Z

We build an AI agent for European patent work, and the interface we built for our users to actually talk to it is email. They send a message to a dedicated address, attach the documents their patent process already produced — an examination report from the EPO, a draft set of claims, the prior art the examiner cited — and the agent reads the message, runs its workflow, and replies in the same thread. This post is about how we wired that channel using the Podesta SDKs, and why email turned out to be a much better answer than the dashboard we briefly considered building instead.

Why email.

Patent attorneys live in email. Every document they reason against arrives there already — from the EPO, from their clients, from their own colleagues drafting in parallel. A separate dashboard would have asked them to download an attachment from one inbox, log into a new tool, upload the attachment there, and read the response in a different window. Three operations added to a workflow that did not need any of them. Forwarding the email they were going to forward anyway adds zero.

The second reason became clear once we shipped: email already has the conversational primitives we needed — threads to maintain context, replies to preserve intent, attachments as the obvious surface for documents. The work, in the end, was writing a parser for a format the user already knew, not designing a UI.

What the channel does end to end.

A message arrives at our inbound address. A small Rust service receives it, parses the MIME envelope, and extracts attachments. PDFs go through Podesta's document ingestion — the same path Studio uses for documents uploaded into the editor — and come back as structured markdown the workflow can reason against. The message body and metadata are normalised into the inputs the Akribes workflow expects.

The workflow itself is the same composite Akribes script that powers every other piece of our agent's reasoning: classification of what the user is asking for, the legal analysis appropriate to that classification, and the assembly of a structured response. None of that changes when the input arrives over email instead of over some hypothetical API.

The result comes back as a structured object — typed response sections, citations, any uncertainties the workflow flagged. The service renders that into a readable HTML reply, sends it from the address the user originally wrote to, and uses the standard mail-threading headers to keep the reply in the conversation the user started.

What the Podesta SDKs did the work of.

Almost everything that is not email-specific.

The agent itself. The Akribes script that handles the reasoning is the same script every other front-door of our product calls. We never maintained a second copy of any prompt, any sub-workflow, or any provider configuration.
Document ingestion. Patent attachments are PDFs, often hundreds of pages, often with figures that matter for the analysis. Podesta's document conversion is a VLM-based PDF-to-markdown pipeline we plugged into with a single SDK call; we never assembled our own OCR, layout analysis, or structure recognition. The same call handles .docx and .html for the messages that arrive in those shapes.
Streaming execution. Workflow events stream through the SDK in real time. The email channel does not render a progress bar — there is no progress bar in an email — but we hook the event stream for structured logging, partial-result handling, and the long-running-job monitoring that would otherwise mean polling.
Auth and audit. Every email-driven workflow run carries the inbound sender as the user identity in our metrics, via a short-lived scoped token the SDK mints per request. Per-customer cost accounting, per-customer rate limits, and a clean offboarding path all fall out of that — none of which we built the underlying infrastructure for.

What we did write.

The service is the mail-specific glue: MIME parsing edge cases, address normalisation, the threading-header dance, attachment deduplication, sender allow-listing per customer, and the rendering layer that turns a structured workflow output into a readable HTML email. None of that is AI work. The agent itself, the document conversion, the workflow engine, the streaming protocol, the auth model, the cost accounting, the eval harness that scores changes to the underlying agent — those were all on the platform side of the line.

The detail that mattered most to adoption.

The single channel-side decision that did the most for how users took to the system was replying from the address the user wrote to rather than from a single canonical agent address. Patent firms typically write to whichever address lives in the email signature of the colleague who last forwarded them something, or to an alias that maps to a particular case. The expectation is that the response comes back from the same place; threading depends on it, and so does intent. A reply from a single canonical address reads as a system. A reply from the address the user actually wrote to reads as a continuation of the thread they started. The Akribes workflow does not care which it is; the channel does. The difference in how users responded to the version that got this right was disproportionate to the size of the code change.

What it took to make production-grade.

The interesting work, after the integration itself, was the boring kind:

Idempotency on retries. SMTP retries silently; the same message can land twice. We dedupe on message-id with a small window.
Quotas per customer. A token attached to a customer means the rate limit is per customer; deciding what limits to set was its own conversation.
Failure paths that say something useful. When the workflow runs out of budget, or a document fails to convert, or the message body is empty, the user gets a reply that explains what to fix — not silence and not a stack trace.
A replay log. Every inbound message and every outbound reply is archived in a form that lets us replay it through a newer version of the workflow when the agent improves. That replay surface turned out to be the most valuable artefact the channel produced, because it gives us a real corpus of customer cases to regression-test against without ever touching a customer's mailbox.

Close.

We picked Podesta so that our engineering effort could go to legal reasoning rather than to workflow orchestration and document conversion. The email channel is a concrete instance of that bet paying off: a usable front door for our users, shipped with most of our engineering still pointed at the patent-law work we are actually trying to be good at.

n8n, LangGraph, Akribes: an honest comparison.

2026-05-14T00:00:00.000Z

Most "X vs Y" posts in this space pick a winner in the first paragraph and back-fill the rest. After shipping production work on all three of n8n, LangGraph, and our own Akribes, the honest take is that there is no universal winner — there are three different shapes of problem, and the tools sort cleanly onto them once you stop pretending they're competitors. This post is an attempt to be specific about which is which.

Three shapes of problem.

The shape of the problem decides the tool. Three shapes show up most often:

Glue between SaaS tools. "When a Stripe charge succeeds, post a message in Slack, write a row to Notion, and email the customer." A long tail of integrations, each individually trivial. LLMs do show up as one node in the chain — n8n has perfectly good LangChain integration — but they are not the centre of gravity; the work is plumbing.
A custom agent a single developer fully understands and maintains. "Inside our Python service, scripted by one of our engineers, build a multi-step agent for a job we already know how we want done — pull a row, call a tool, branch on the result, iterate until done. We are the only consumers, we wrote it, we maintain it the same way we maintain the rest of the service."
A domain-expert AI workflow you ship as part of your product. "We are productising something that requires real subject-matter expertise — legal analysis, claims adjudication, regulatory triage, underwriting. The value of the workflow IS the embedded domain knowledge. The domain experts who supply that knowledge need to be able to read and edit the workflow. The quality of the output has to be measurable on a real case set, not vibes-checked. And it ships behind our own product, not as a generic chatbot."

Most of what people argue about online is people in problem (1) telling people in problem (3) to just use n8n, and vice versa. Both sides are wrong.

n8n, fairly.

n8n is the right answer for problem (1), and it isn't close.

The connector library is enormous — hundreds of integrations covering the obvious SaaS surface (Slack, Notion, Postgres, Stripe, Salesforce, HubSpot, every major queue and storage system) and a long tail beyond that. If your job is "wire X to Y on event Z," n8n already has X, Y, and Z as nodes. You'll spend your time on the business rule, not on HTTP plumbing.

The visual editor is also a genuine asset, and not in a hand-wavy way. Non-engineers can read it, sometimes modify it, and definitely audit it. For an ops or growth team that owns the workflow but doesn't own the platform, that matters more than any language-design argument.

n8n is also fair-code and self-hostable, which is a real differentiator once you start moving customer data around.

Where n8n is awkward is once the problem stops looking like "glue" and starts looking like software. Workflows are JSON graphs, so version control sees opaque blobs, code review is mostly reading screenshots, and refactoring a node used in fifteen workflows is a manual exercise. The Code node exists as an escape hatch, and its existence is itself a signal: when the visual surface stops being expressive enough, you drop into JavaScript with no type system bridging the two sides. n8n's LangChain integration is mature, but multi-step LLM orchestration with branching, retries, and sub-workflow composition starts to feel like fighting the medium. That's fine — it just isn't the medium's job.

LangGraph, fairly.

LangGraph is the right answer for problem (2), and it's also not close.

If you are a developer building an agent for something you fully understand — your own internal automation, a tool you'll maintain end-to-end, a workflow whose audience is yourself and your team — and your stack is already Python, LangGraph is what you reach for. The library gives you state, branching, checkpoint/resume, and a way to stream events as the agent runs, all expressed as code. You define the nodes, you define the edges, you debug it the same way you debug the rest of your service. LangSmith gives you tracing for free if you opt in. For developer-controlled agents in a developer-controlled codebase, that surface is the right shape.

A minimal LangGraph node graph is roughly this:

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver

graph = StateGraph(AgentState)
graph.add_node("analyze", analyze)
graph.add_node("chatbot", chatbot)
graph.add_edge(START, "analyze")
graph.add_edge("analyze", "chatbot")
graph.add_edge("chatbot", END)

app = graph.compile(checkpointer=InMemorySaver())

That's a real ergonomic win for a real class of problem. The checkpointer is genuinely useful — you can resume a long-running agent across process restarts and inspect intermediate states. The streaming model (stream_mode="updates" / "values") is mature.

Where LangGraph gets awkward is the moment the workflow has to be readable, evaluable, or shippable as a product feature in its own right. The workflow IS Python. Refactoring it is refactoring Python. Asking a non-engineer subject-matter expert to read or edit it is asking them to read Python. Versioning the workflow separately from the surrounding service is whatever you build yourself. The eval story is whatever you stitch together with LangSmith plus your own harness. None of this is a flaw — LangGraph is a library, not a platform — but the gap shows up the moment you have to defend the quality of a workflow with a number rather than a vibe, or hand the workflow to the domain experts who actually know what good output looks like.

Where Akribes lives.

Akribes is what we built for problem (3): AI workflows shipped as part of a product, where the value lives in the embedded domain expertise and "did this change make the system better" has to be a question you can answer with a number rather than a feeling.

The first thing that has to be true is that the workflow is readable by the domain experts who supply the expertise. Akribes' syntax is small on purpose — a script declares its inputs, declares a few agents, declares a few tasks, and a workflow body that wires them together. A patent attorney, a regulatory lead, or an underwriter can read an Akribes script; the type system catches what they would miss. That is not a nicety. When the value of the workflow IS the encoded domain knowledge, you cannot afford to translate it into Python and lose the only people who would notice if it were subtly wrong.

Concretely, the same "summarize and pass through another model" task looks roughly like this:

use summarize

input topic: str

agent Analyst:
  model: gpt_4o_mini
  system: "You are a market analyst who turns topics into structured notes."

task expand(t: str) -> str:
  agent: Analyst
  prompt: "Expand this topic into 3 bullet points of analyst notes: {t}"

workflow
  raw = expand(topic)
  short = summarize(text=raw)
  return short

That use summarize line is the load-bearing part. summarize is a separate script, owned independently, published independently, versioned independently. The analyzer resolves its ScriptSignature (inputs and workflow -> T return type) at parse time. If summarize changes its workflow return from str to a record, every script that uses it fails to compile, not at the next eval run.

The same "summarize, then have a follow-up model shape JSON" task that needs six-ish nodes plus a Code node in n8n, and around thirty lines of StateGraph plus a TypedDict in LangGraph, is on the order of fifteen lines of two task blocks and a workflow in Akribes. The savings show up once "summarize" is one of forty shared sub-scripts across five teams.

The second thing that has to be true is that you can tell, with a number, whether a change made the workflow better. Akribes' eval harness is part of the platform — cases live as fixtures, the judge is written as another Akribes script (multi-axis rubrics are normal: outcome correctness, cited authorities, reasoning route, calibration on hard cases), and the score moves visibly when you change the workflow. The same score that drives our internal go/no-go on a prompt change is the one a customer can be pointed at when they ask whether the system is getting better. This is the part most "ship an AI feature" projects do not have, and it is the part that separates "we have an AI feature" from "we have an AI feature whose quality we can defend."

The other things that fall out of treating workflows as first-class artifacts:

Modularity, properly. Scripts import each other with use, share record types, and version independently. The same shared-sub-script pattern that makes a single team productive scales across teams when an analyzer enforces the contract.
A real LSP. akribes-lsp is a proper language server — go to definition across script boundaries, hover types on used workflows, find references on a task. Not Python's LSP being helpful about a Python file that happens to contain a graph.
MCP in both directions. A workflow consumes external MCP servers as typed tools (databases, third-party SaaS, anything that speaks the protocol). A workflow can also be exported as an MCP tool, which means an Akribes script you wrote for your own pipeline becomes a tool another team's agent — or another company's, if you publish it — can call without touching your code.
An SDK for the client side. Studio is the editor; the SDKs (TypeScript, Python, Rust) are how a workflow gets called from a real application. The same workflow that runs from an internal cron, a Studio test panel, and a customer's frontend speaks the same event protocol on all three.
A streaming event model that is the protocol, not an afterthought. The engine emits WorkflowStart, NodeStart, TaskStart, TaskPrompt, AgentOutput (token-streaming), TaskEnd, Suspended, Resumed, WorkflowEnd, Error. Every SDK consumes the same stream; Studio renders it; your service can subscribe to it.
Checkpoints as a language feature. A task that fails structured output validation can route directly to a checkpoint block, which suspends the workflow, emits Suspended, and waits for a typed resume payload. The shape of the resume value is checked by the analyzer before publish.
Input resolvers. A workflow declares input sources: SourceSet by fetch_sources(...), and the caller never threads sources through — the server resolves it from another script's output at execution time. Composite chains keep their public surface to two or three scalars even when the underlying graph pulls from a dozen upstream documents.

A more recent addition: a visual benchmark builder in Studio. The premise is that the people with the test cases — domain experts — are not always the people with the patience to write a workflow language. Upload a set of example input/output pairs, Studio drafts a workflow and a starter judge, scores the workflow against the examples, and suggests refinements that would move the score on those specific cases. Domain experts get faster iteration; engineers get a starting point that already passes the cases the expert cares about most.

The shorthand contrast is with the typical "ship an AI feature" approach in 2026 — a chat box on a website, a few prompts behind it, whatever ad-hoc evaluation the engineer can fit in around the chat-box work. Four things tend to be missing from that approach. Iteration is slow, because there is no shared eval to compare versions against. Quality is hard to defend, because there is no score that captures whether the model got the underlying job right. The interaction model is exhausted by the chat turn — no checkpoints, no structured intermediate state, no human-in-the-loop for the cases that need it. And the result is, in practice, hard for the customer to distinguish from running the same query against a generic LLM chat themselves. The properties above — eval-as-platform, typed structured outputs, language-level checkpoints, typed cross-script composition — exist because each of them addresses one of those failure modes head-on.

Where Akribes is rough.

Akribes is younger. The integration zoo is smaller than n8n's — we lean on MCP for the connector surface rather than building a Slack node, a Notion node, a Stripe node, and so on. That's a deliberate bet on the MCP ecosystem reaching critical mass, but if you need "trigger when a new row appears in Airtable" today, we are not the right call yet.

The visual surface is the Studio editor — the Akribes script next to a live event stream, an inline debugger, an eval panel. It is text-forward by design. We lead with a small DSL, not a node-graph canvas, and after a year of customer work this has been less of a barrier for domain experts than we braced for. Reading a small, declarative DSL aimed at the workflow they already know is a different ask than reading a Python state machine, and after the first hour or two of orientation the language has mostly stopped being the friction.

That choice isn't accidental. We like code. The things that make code worth keeping — tests you can run against a change, iteration at typing speed, a workflow you can share by sending one file, versions you can branch and merge — keep mattering once an LLM is involved, not less. There are reasons engineers reach for code instead of a visual canvas for almost everything else they build, and those reasons didn't evaporate when generative models showed up. We would rather approach AI from the deep tooling expertise the industry already has than discard it because something else looks shinier.

A node-based editor is on our roadmap. It isn't a hard build, just one we haven't prioritised over the rest of the workflow surface yet. What we're not willing to do, in the meantime, is point production AI work toward n8n because a team wants a visual canvas. n8n's editor is genuinely good for SaaS plumbing; once the workflow has branching logic, typed sub-scripts, evals, and a customer-facing SLA on the output, the JSON-graph format plus the Code-node escape hatch is not what you reach for.

The smaller community shows in edge cases. We fix what we hit; if you need a specific provider or a specific behaviour we have not yet had a reason to write, the answer is more likely "send a PR" than "there is a maintained plugin." That is the cost of a younger tool and we own it.

The division of labour we want.

Most teams shipping production AI products have two kinds of work in front of them. One is the LLM-infrastructure layer: picking providers, abstracting their differences, building eval harnesses, writing a usable editor for the people who tune prompts, handling streaming, caching, tokens, costs. The other is the work that actually distinguishes the product — the customer-facing application, and the workflows that encode the domain knowledge.

The pitch behind Podesta is that the first layer should be shared infrastructure your team does not build. We do it. The second layer is where your engineers and your domain experts should be spending their time — your engineers shipping the application your end users pay for, your domain experts tuning the workflows that encode what their domain requires. n8n and LangGraph draw that boundary differently. n8n absorbs almost all of the integration glue and gives the workflow surface to non-engineers; LangGraph delegates almost all of it to your Python and assumes your engineers own the workflow as well. Podesta lands closer to n8n in spirit — the platform absorbs the LLM-infra layer, the workflow surface is one a domain expert can reach — but with a typed-artifact story and an SDK that lets engineers compose on top.

Which one, then.

A short decision tree, with all the usual caveats about how a decision tree is a lie:

Are you wiring Slack to Notion on a Stripe webhook? Use n8n.
Is one of your developers writing a custom agent for a job they fully understand, inside a Python service they own end-to-end? Use LangGraph.
Are you productising work that requires real subject-matter expertise — and you need the domain experts in the iteration loop, an eval that tells you whether changes improved the result, and the workflow shipped behind your own product? That's our target. Try Akribes.
If you're in two of these at once: pick the shape that is most painful today and treat the other as the thing on the other side of a queue.

What "workflow" means to each.

The useful frame for comparing these tools is not which is best but what each of them treats a workflow as. For n8n, a workflow is a node graph — a document that draws itself. For LangGraph, a workflow is a Python program — a function over a typed state. For us, a workflow is a versioned, type-checked artifact with its own language, its own analyzer, and its own publish lifecycle.

Those three bets address different parts of the same broad problem and they will likely keep coexisting. The practical question for any team is not which framework wins; it is which of the three shapes their workflow is actually closest to today, and that question is usually easier to answer than the framework debate suggests.

Building a strict benchmark for AI in patent law.

2026-05-14T00:00:00.000Z

We build AI for patent workflows on top of the Podesta platform. Evaluating that agent honestly is the hard part of building it. Patent reasoning is multi-dimensional — the outcome reached, the legal basis cited in support, the route from inputs to conclusion, and how the system handles the cases experts themselves disagree on. An eval that collapses those onto a single right-or-wrong axis would miss the failure modes our customers most need us to catch. We built ours to score across those dimensions, and we built both the agent and the eval runner that scores it on the same platform: the agent is an Akribes workflow, and so is the judge. This post is about that benchmark — why we anchored it on the European Patent Office's Boards of Appeal, what multi-axis scoring looks like in practice, and what we got out of running the eval pipeline on the same primitives the agent uses.

Why patent law is a brutal domain for AI eval.

Most public benchmarks for legal AI are quiz-shaped. A question, a multiple-choice answer, sometimes a short free-text rationale. They travel well, they compare easily across systems, and they bear almost no resemblance to the artefact our customers actually produce.

A real piece of EPO work is not an answer. It is a structured document — a set of claim amendments with their basis in the application as filed, a novelty argument over a specific prior-art disclosure, an inventive-step chain under the problem-and-solution approach, a reply to a communication that does all three at once and remains admissible under the Rules of Procedure. Correctness in that world is a conclusion, plus the reasoning that justifies it, plus the legal basis cited in support, plus the procedural posture that makes the whole thing arguable in the first place.

Worse, experts disagree. The European Patent Convention is a half-century-old treaty interpreted through tens of thousands of written decisions. Two competent attorneys will read the same set of claims and the same prior art and reach different conclusions on whether an amendment is allowable under Article 123(2) EPC. Sometimes two Boards of Appeal reach different conclusions. A benchmark that does not acknowledge that texture is not measuring patent reasoning; it is measuring something easier and pretending.

Why the EPO Boards of Appeal are useful.

The Boards of Appeal are the final instance for most decisions taken by the EPO's examining and opposition divisions. Their decisions are written, public, and reasoned in detail. The Board states the facts, states the legal question, walks through the applicable case law, and renders a conclusion with the precise legal basis cited. For an AI eval, that is rare: the ground truth has already been written, by people whose job it is to write it.

We are not the first to notice that BoA decisions look like training-or-eval material. We think we are unusual in treating them as eval material specifically, and in declining to use them for training. The whole point of using a Board decision as ground truth is that the system has not seen the answer. Our cases are partitioned accordingly: the agent has access to the application documents and the prior art the Board considered, never to the decision itself.

How we structured the bench.

Each bench entry is a single, narrow legal question grounded in a single BoA decision. A typical entry asks something like: given this set of claims as amended and the application as originally filed, is the amendment allowable under Article 123(2) EPC? Or: given this claim and this prior-art disclosure, is the claim novel under Article 54?

The shape of a case on disk is deliberately boring. A metadata.json locating the decision in the EPO's case-law system; the inputs the agent sees (claims under review, application as filed); and a ground_truth.json we derive from the Board's reasoning. The ground truth is not a copy of the decision. It is a structured distillation: the conclusion the Board reached, the features or amendments the Board considered determinative, the articles cited, and the line of case law relied on. The decision text sits next to it for traceability, never as input to the agent.

The agent's output is also structured. Our pipeline is built in Akribes (Podesta's typed workflow language), and every stage returns a typed value: a list of claim features here, a problem-solution analysis there, a final conclusion with citations. That matters for the judge: we are not comparing free-text essays, we are comparing fields.

The scoring rubric.

A case score is not a single number. The judge produces a structured breakdown across several axes — the outcome the agent reached, the legal basis cited in support, the route from inputs to conclusion, and how the agent handled the cases that even the Board flagged as borderline. The headline composite is built from those, and we publish the per-axis decomposition alongside the headline whenever we publish a score at all. A pipeline can move the headline up by getting better at citing the right authorities without changing its outcome accuracy, and that has to read as a different signal than the same headline move from better outcome accuracy.

The rubric is "strict" in the sense that the legal-basis and reasoning-route axes act as severe multipliers on the outcome axis rather than as independent additions. Right answer, fabricated citation lands much lower than right answer, real citation, even though the answer is identical. We have seen enough models reach the correct outcome via case law that does not exist or via an article that does not say what they think it says that letting that count as a win would teach us to ship a system that bluffs convincingly — the failure mode our customers most need us to catch.

Calibration matters too. A confident wrong outcome on a case the Board called clearly is scored differently from a confident wrong outcome on a case the Board itself flagged as unusual on its facts. We want the agent to register the difference between "the answer is hard and the system should hedge" and "the answer is clear and the system should commit"; the rubric rewards the former and penalises the cases where the agent picks the wrong one of the two.

The headline target is a high bar on the composite under that rubric: a large majority of cases in which the agent reached the right conclusion, by reasoning a Board would accept, citing the authorities the Board itself relied on. That is not a system anyone should let near a real file unsupervised, and we say so. It is the bar we ratchet towards.

The judge is also an LLM. We know.

We will not pretend otherwise. The judge that scores agent outputs against ground truth is itself a model. Three things keep that honest. The judge prompt is versioned alongside the cases, so a score is reproducible against a specific judge version and case set. We run replicated judge calls and take the median, with outliers logged to disk. And we have characterised, separately, where the variance in our composite score actually comes from. For our current pipeline, the judge is not the loud term. The composite is. We wrote that up in The judge isn't the variance. The composite is.

The hard cases.

Some cases in the bench are genuinely hard, and we put them there on purpose. There are Board decisions where the Board itself notes that the outcome is unusual on its facts. There are decisions where the applicable case law has shifted across the Boards over the years. There are decisions where two Boards have ruled differently on what looks, at first reading, like the same legal question, and the divergence has not yet been resolved by the Enlarged Board.

A bench that excludes those is easier to score well on, and a system tuned to score well on it will quietly learn to bluff through edge cases at deployment. We want the opposite. The hard cases are where we want to see the agent hesitate, qualify, cite the relevant divergence, or flag the question as one a human attorney should resolve. A confident wrong answer on a hard case costs us more than a hedged answer that turned out right.

Akribes for the agent — and for the runner.

We build the agent itself in Akribes because legal reasoning, looked at without the storytelling, is a typed pipeline. Extract the relevant claim features from the application as filed. Identify the disclosed features in the cited prior art. Reason on novelty over each independent claim. If the claim is novel, reason on inventive step under problem-and-solution. Assemble the response in the procedural posture the case is actually in.

Each of those steps is a sub-workflow with a typed output. When the case law on a particular interpretive question shifts — and it does, about as often as you would expect for a treaty interpreted by a standing tribunal — we change one stage. The type system catches the downstream stages that no longer fit before the next eval run, not in production three weeks later. That is the difference between a refactor that lands cleanly and a refactor that quietly degrades the bench score in a way nobody notices for a month.

The eval runner that scores the agent is also an Akribes workflow. The agent runs as one script, the judge — given the agent's structured output and the ground-truth distillation — runs as another, and a small harness composes them across cases. We did not plan it that way; we backed into it because the alternative was maintaining a second copy of the model-provider abstractions, the streaming protocol, the token budgeting, and the cost accounting we already had load-bearing on the agent side. By the time the harness was real, the side benefit was that the judge prompt sits in a script a patent attorney can read. A judge buried in Python with bespoke plumbing is opaque to anyone who does not also write Python; a judge as an Akribes script reads as the sequence of checks it actually is, and our domain experts can suggest edits to it in the same way they suggest edits to the agent.

The newer version of this loop will run on Podesta Studio's bench panel, which adds visual case-by-case views, per-stage score breakdowns, and the same live event stream a Studio run already uses. The current version — the one this post describes — runs on the same primitives a platform customer has access to. We did not lean on internal tooling nobody else has.

What this is and is not.

The eval we have built is not a solution to patent AI; it is a scoring system we are willing to be measured against. The headline number will move when the pipeline improves. It will also move when we expand the case set, when the Enlarged Board issues a precedent-shifting decision, or when the judge prompt itself is revised to catch a failure mode it was previously letting through. We publish those deltas with their causes attached, because attribution is what separates a strict eval from a slogan — and because, in a domain where a confidently wrong brief is a patent application sunk on appeal, attribution is the part of the methodology that most directly translates to whether a customer can rely on the system at all.