MicroVM isolation for our CI runners.
Every CI job runs inside a Kata microVM. Here's why, what it cost, and what broke.
Our CI runs against a self-hosted Forgejo Actions setup in the same Kubernetes cluster as the product. Every job runs inside a microVM (Kata Containers on Kubernetes), not a regular container. Here is why we made that bet, what it cost, and what broke along the way.
Why microVM, not container.
A container shares the host kernel. A CI job runs arbitrary code from a PR — anyone with write access can land a workflow change, anyone with a fork can open the PR that runs it. Even with rootless containers and a tight set of Linux capabilities, the threat model assumes someone's build script never finds a kernel exploit. That assumption has been wrong, in public, often enough that we did not want to make it.
A microVM changes the shape of the worst case. The build script runs against a guest kernel inside a hypervisor (cloud-hypervisor in our case); a kernel escape now buys you root on a tiny throwaway VM, not on the node that schedules every other job in the cluster. We are not a security boutique. We just did not want a compromised dependency PR to be a cluster-level event when the alternative — ~6 GiB of guest RAM and a few seconds of startup per job — was bearable.
The other consideration is operational. CI is the workload most likely to do something strange to its kernel: mount, mkfs.ext4, raw netlink, weird capability sets in dind. A microVM contains every one of those experiments to a sandbox that will be destroyed when the job exits.
The shape of the deployment.
Kata is wired in as a Kubernetes RuntimeClass. The runner pods set runtimeClassName: kata in their PodSpec; Kubernetes schedules them through the Kata shim instead of the default container runtime. From the cluster's point of view they are still pods — kubectl logs, events, lifecycle hooks all work — but the container inside the pod runs inside a cloud-hypervisor VM, with the OCI image's rootfs presented to a guest kernel.
Two runner StatefulSets handle the actual work, one per dedicated host, both pinned by nodeSelector so the per-pod CPU limits add up to exactly the host's thread count. Each pod contains a Forgejo Actions runner sidecar plus a docker-in-docker sidecar; the runner uses the dind socket to spin up the per-step containers a workflow asks for. A workflow step is a container, inside a dind daemon, inside a Kata VM, inside a pod, on a node. It sounds layered because it is. The boundary that matters for isolation is the VM.
Workflows opt in to the pre-warmed image with a label exposed in the runner config:
labels:
- "microvm-runner:docker://registry.internal/forgejo-microvm:latest"
A workflow that wants the baked-in toolchain says runs-on: microvm-runner. Jobs that ask for plain ubuntu-24.04 still work — they get a stock catthehacker image without the pre-warming.
Why Kata and not the alternatives. Firecracker directly would have worked, but it does not speak the OCI image format the way Kata does. gVisor would have given us syscall-level isolation in userspace, but the performance cost on a Rust compile is much worse than the hypervisor cost we did pay, and it would not have helped against a kernel escape — the gVisor sentry is the kernel surface. Kata won because an existing Containerfile is also a microVM image.
What it cost.
This is the honest section. Kata jobs are slower to start than runc containers, and the gap is real. A Kata pod boots a guest kernel, hand-shakes with the guest agent, sets up the rootfs inside the VM, and only then starts your container. Even with cloud-hypervisor — the fastest of the supported Kata hypervisors — that is several extra seconds of cold start on top of the usual image pull. For a sub-minute test job, that is a meaningful fraction of total wall time.
For our Rust builds it stops mattering, because those builds spend most of their wall time in rustc and the linker, with sccache pulling cached object files from in-cluster object storage. The absolute startup overhead is the same in both worlds, but it is now amortized over a much longer compile. Same shape for Playwright shards, where the dominant cost is browser launch and navigation.
We did not benchmark this rigorously enough to publish a number. The qualitative read: jobs measured in seconds feel slower; jobs measured in minutes do not. Anyone considering this for short, cache-hit-heavy CI should know the overhead does not amortize.
The other cost is memory. Kata adds per-VM overhead — guest kernel, agent, virtiofsd, cloud-hypervisor's working set — that we budgeted at 6 GiB on top of the workload's own request. With two VMs per node at CI capacity, that is ~12 GiB per node permanently committed to "Kata exists." We have the headroom; smaller fleets might not.
The image: why we baked everything in.
Pre-warming the runner image is a self-hosting concession, not a developer-ergonomics one. The actions everyone reaches for — setup-bun, setup-uv, dtolnay/rust-toolchain — all hit api.github.com/repos/*/git/refs/tags unauthenticated to resolve which release to pull. Self-hosted Forgejo jobs are not authenticated against GitHub, so during busy hours every one of those actions returns 403 and the job fails before the build starts. The official setup-actions are not infrastructure we control; depending on them inside a self-hosted runner is depending on someone else's rate limit budget.
So the image bakes in rustup, sccache, uv, bun, node, the MinIO client, and a full Playwright browser set, all with Renovate-tracked pins:
# renovate: datasource=github-releases depName=rust-lang/rust
ARG RUST_VERSION=1.94.0
# renovate: datasource=github-releases depName=mozilla/sccache
ARG SCCACHE_VERSION=0.8.2
# renovate: datasource=npm depName=playwright
ARG PLAYWRIGHT_VERSION=1.59.1
Renovate's custom-manager picks up the # renovate: markers and opens a PR whenever any of those pins goes stale. Bumping a tool is reviewing a Renovate PR; building the new image is the existing weekly CI cron.
Other pre-baked details that earn their keep: RUSTUP_HOME=/opt/rust and CARGO_HOME=/opt/cargo chmod'd world-writable so any UID can use them; PLAYWRIGHT_BROWSERS_PATH=/opt/pw-browsers so playwright install is a no-op; SCCACHE_IDLE_TIMEOUT=0 so the sccache daemon survives the entire job.
What broke.
Three war stories from the last six months. There are more in the runner-image README; these are the most informative.
Long-running builds and SandboxChanged. Cloud-hypervisor on the Kata runtime would occasionally emit a SandboxChanged event mid-build on jobs north of ten minutes — Kubernetes sees the pod's sandbox effectively replaced and the running container disappears. The Rust toolchain layer was always the long pole. The fix on the image side is to keep RUN steps small, because each one is a checkpoint. The fix on the diagnosis side is to capture dmesg, the containerd-shim-kata-v2 processes, and recent containerd logs from the affected node before they roll over, which is what our kata-capture.sh exists for. Usual suspects: hypervisor OOM, a kernel trace in vhost-net or virtio-blk, or the shim itself crashing.
Rootless podman + the catthehacker base. Building the image locally with rootless podman against the home-directory overlay store reproducibly fails on the final COMMIT step with history lists N non-empty layers, but we have M layers on disk. The catthehacker base ships with ~18 layers, some carrying empty-layer history markers that podman's overlay driver mishandles under home-directory storage. --squash alone does not fix it. CI builds work because the Kata VM that runs the build has a clean overlay store. Local builds either run on a workstation with a non-home storage root, or fall back to --storage-driver=vfs — slow and disk-hungry, but produces a valid image.
The Forgejo registry truncating large blob uploads. This one is unresolved. Pushing the runner image to our own Forgejo container registry fails on the Playwright browsers layer (roughly a GB of compressed binaries — chromium, firefox, webkit and their OS deps, rendered as a single OCI blob). The failure mode is either 499 Client Closed Request on the client side or 504 Gateway Timeout from the ingress, and it reproduces from both CI and a workstation push over the same public ingress — which rules out the image and rules in the proxy in front of the registry. The fix lives in the ingress chart: bump proxy-read-timeout and proxy-send-timeout past the time it takes to upload the biggest single blob, raise the body-size cap, and check the registry's own upload limits. There is no reasonable image-side workaround; Playwright lays each browser down as one directory tree and the layer compresses as a single blob. Until the ingress timeouts move, the workaround is to push from somewhere with enough patience to retry.
What sccache buys us.
The image installs sccache but does not set RUSTC_WRAPPER globally — the runner-level config exposes the S3 endpoint and bucket name as default env vars, and Rust workflows opt in by exporting RUSTC_WRAPPER=sccache themselves. The opt-in matters: when the object store is unreachable or the credentials drift, an opted-in job fails loudly with "Server startup failed" rather than every Rust job in the cluster falling over silently. The backend is an in-cluster object store, so the round-trip is one Linux bridge away.
Combined with SCCACHE_IDLE_TIMEOUT=0, this is where the wall-time win for repeat Rust jobs comes from. Cold compiles still pay the full cost, but everything downstream of a touched crate pulls cached objects rather than recompiling. Pair that with the microVM startup overhead being amortized over a multi-minute compile, and the performance picture stops looking bad — the boundary cost only shows up on the short jobs.
The companion post.
For the broader self-hosting story see Self-hosting our developer toolchain. This post is the deep dive on the runtime; that one is the elevator pitch for the whole stack.
Closing.
The microVM-per-job choice is one of those decisions made for security reasons that ends up justifying itself with performance numbers six months later. The CI threat model — arbitrary code from a PR running on infrastructure that also runs the product — is the load-bearing reason; the performance overhead has stayed bearable because the long jobs that benefit most from microVM isolation are also the ones that amortize the per-job startup cost over a multi-minute compile. The remaining operational pain is downstream of the registry and the ingress, not the runtime, and it is being closed out as we get to it. We would make the same call again.