Skip to content

0hardik1/agentmoat

agentmoat

release license CI Go Go Report Card

Move Kubernetes workloads from runc to gVisor (runsc), safely and reversibly.

agentmoat.mp4

agentmoat moves Kubernetes workloads from the default runc runtime to gVisor (runsc), the user-space kernel that defends against the kernel-exploit step of a container-escape chain. It scans a cluster, classifies every workload by gVisor compatibility, generates a deterministic migration plan with a stable hash, applies it via RuntimeClass, and rolls it back on demand. On EKS specifically there is no managed switch: Bottlerocket does not ship runsc, managed node groups have no gVisor toggle, and AWS does not officially support gVisor. agentmoat fills that gap.

The threat model and CVE backdrop live in docs/threat-model.md.

Quickstart

From a fresh clone, against a local kind cluster with real gVisor preinstalled:

git clone https://github.com/0hardik1/agentmoat
cd agentmoat
make kind-up          # builds the gVisor-enabled kind node image, then creates the cluster
make e2e              # runs scan -> plan -> apply -> rollback end-to-end and asserts the results

make e2e exercises the full pipeline against a real runsc runtime and probes patched pods for gVisor markers in dmesg. It is the fastest way to see the tool work.

Against your own cluster (kubectl context already set):

make build                                            # produces ./bin/agentmoat
./bin/agentmoat scan                                  # human-readable table
./bin/agentmoat scan --output json > scan.json        # versioned schema
./bin/agentmoat plan --scan scan.json --output json > plan.json
./bin/agentmoat apply --plan plan.json                # dry-run by default
./bin/agentmoat apply --plan plan.json --dry-run=false # actually mutate
./bin/agentmoat rollback --plan plan.json --dry-run=false

scan exits 0 when nothing is incompatible, 2 when at least one workload is. apply and rollback are idempotent: re-running an applied plan reports every step as already-applied and exits 0.

Install

Build from source with Go 1.26+:

go install github.com/0hardik1/agentmoat/cmd/agentmoat@latest
go install github.com/0hardik1/agentmoat/cmd/agentmoat-mcp@latest

Once a release is tagged, prebuilt binaries (linux/darwin, amd64/arm64) and a checksums file are attached to each GitHub release; a Homebrew formula and a kubectl agentmoat krew plugin follow.

What it does

   ┌──────────┐      ┌──────────┐      ┌──────────┐      ┌────────────┐
   │   scan   │ ───> │   plan   │ ───> │  apply   │ ───> │  rollback  │
   │ ScanReport│      │ Migration│      │  Apply   │      │  Rollback  │
   │  (RO)    │      │   Plan   │      │  Result  │      │   Result   │
   └──────────┘      └──────────┘      └──────────┘      └────────────┘
       cluster      pure function      strategic-merge      reverse, also
       client-go    over a Scan        patch, default        default dry-run
                    Report             dry-run
  • scan is strictly read-only. It enumerates every Pod, Deployment, StatefulSet, DaemonSet, Job, and CronJob across the selected namespaces, classifies each one against the built-in rules, and emits a versioned ScanReport.
  • plan is a pure function from a ScanReport to a MigrationPlan. Same scan in, same plan and same SHA-256 planHash out. The plan orders steps by risk (stateless first, no host-network first, etc.) and excludes any workload classified incompatible.
  • apply is the only stage that mutates the cluster. It patches each pod template with runtimeClassName and the matching runtime=gvisor:NoSchedule toleration, stamps the namespace with agentmoat.io/plan-hash, emits a Kubernetes Event for each step it actually mutates, and appends an audit line for every step (dry-run included) to ~/.agentmoat/audit.jsonl. It defaults to --dry-run=true. Re-running an applied plan reports every step as already-applied and exits 0 via the namespace annotation.
  • rollback walks the same plan in reverse, removes runtimeClassName, and clears the namespace annotation. It deliberately leaves the toleration in place (a toleration without a matching taint is harmless, and removing a specific toleration by JSON Patch index is fragile).

The manual alternative is a sequence of kubectl get to enumerate, hand classification against the gVisor docs, per-controller kubectl patch for every Deployment/StatefulSet/DaemonSet, and a separate audit trail you build yourself. agentmoat does the classification, orders the mutations, makes them idempotent, records them, and undoes them.

Compatibility checks

14 stable rule IDs across three severities. Rule IDs are a public surface: they appear in --output json, in --rules overrides, and in the docs/compatibility-checklist.md table. The rule implementations live in pkg/classifier/builtin_rules.go.

Rule ID Severity What it inspects
raw-socket error CAP_NET_RAW, or agentmoat.io/needs-raw-socket=true
host-network error pod.spec.hostNetwork=true
host-pid error pod.spec.hostPID=true
host-ipc error pod.spec.hostIPC=true
privileged error Any container with securityContext.privileged=true
ebpf error Image hint (cilium, tetragon, falco) or CAP_BPF
kvm-nested error hostPath mount of /dev/kvm
host-path-mount warn Any hostPath volume
gpu-passthrough warn nvidia.com/gpu resource request or limit
fuse-mount warn CSI driver name containing fuse, or AGENTMOAT_USES_FUSE=true
io-uring warn Annotation agentmoat.io/uses-iouring=true
perf-events warn CAP_PERFMON or CAP_SYS_ADMIN
network-throughput info Image hint: nginx, envoy, haproxy, traefik (expect 20-40% overhead)
syscall-heavy info Image hint: redis, memcached (expect higher latency)

Any error rule firing makes the workload incompatible and excludes it from the plan. warn makes it review; the planner only includes review workloads when --include-review is passed. info is purely advisory and never blocks.

Override severities (or add rules) without recompiling via --rules <file.yaml>. See docs/compatibility-checklist.md.

Sample output

Shapes come straight from internal/schema/types.go; elided fields are marked ....

agentmoat scan --output json (one WorkloadResult from .spec.workloads):

{
  "kind": "Deployment",
  "namespace": "edge",
  "name": "frontdoor",
  "compatibility": "review",
  "reasons": [
    {
      "ruleId": "network-throughput",
      "severity": "info",
      "description": "Workload appears network-throughput-bound (image hint: nginx, envoy, haproxy, traefik); expect ~20-40% throughput overhead under gVisor's sandbox network stack.",
      "remediationUrl": "https://gvisor.dev/docs/architecture_guide/performance/"
    }
  ],
  "recommendation": "Benchmark under gVisor before opting in; consider host-network alternatives if throughput-critical.",
  "overhead": "Network throughput: 20-40%"
}

agentmoat plan --output json (one PlanStep and the top-level planHash):

{
  "apiVersion": "agentmoat.io/v1alpha1",
  "kind": "MigrationPlan",
  "metadata": {
    "generatedAt": "2026-05-22T17:14:03Z",
    "planHash": "9f4b1c2d8a3e6f5b7c1e0d2a4f6b8e3c5d7a9f1b2e4d6a8c0f1b3e5d7a9c2e4f"
  },
  "spec": {
    "summary": {"total": 3, "included": 2, "excluded": 1},
    "options": {"runtimeClassName": "gvisor"},
    "steps": [
      {
        "order": 1,
        "target": {"kind": "Deployment", "namespace": "default", "name": "web"},
        "action": "set-runtime-class",
        "runtimeClassName": "gvisor",
        "addToleration": true,
        "waitFor": "Ready",
        "riskScore": 10,
        "notes": "Stateless Deployment fronted by a Service; safe to roll first."
      }
    ],
    "excluded": [...]
  }
}

agentmoat apply --output json (one StepResult and the apply envelope):

{
  "apiVersion": "agentmoat.io/v1alpha1",
  "kind": "ApplyResult",
  "metadata": {
    "planHash": "9f4b1c2d8a3e6f5b7c1e0d2a4f6b8e3c5d7a9f1b2e4d6a8c0f1b3e5d7a9c2e4f",
    "dryRun": false
  },
  "spec": {
    "summary": {"total": 2, "applied": 2, "alreadyApplied": 0, "skipped": 0, "failed": 0},
    "steps": [
      {
        "order": 1,
        "target": {"kind": "Deployment", "namespace": "default", "name": "web"},
        "status": "applied",
        "patch": "{\"spec\":{\"template\":{\"spec\":{\"runtimeClassName\":\"gvisor\",\"tolerations\":[{\"key\":\"runtime\",\"operator\":\"Equal\",\"value\":\"gvisor\",\"effect\":\"NoSchedule\"}]}}}}"
      }
    ]
  }
}

Operational guarantees

  • Read-only by default. scan and plan never mutate. They are safe to run against production from a CI job or a read-only kubeconfig.

  • Dry-run by default. apply and rollback default to --dry-run=true: the patches are computed and surfaced in the StepResult.patch field, but nothing is sent to the API server. Mutating requires explicit --dry-run=false.

  • Idempotent. Every apply writes the plan hash to the affected namespace as agentmoat.io/plan-hash. Re-running the same plan against the same cluster reports every step as already-applied and exits 0.

  • Auditable. Every step appends one JSON line to ~/.agentmoat/audit.jsonl (disable with --no-audit), including dry-run steps, which carry dryRun: true. Each step that actually mutates also emits one Kubernetes Event on the patched object (disable with --no-events).

  • Deterministic exit codes. CI scripts can branch on them:

    Code Meaning
    0 Success. Cluster matches requested state.
    1 Generic error (kubeconfig, network, malformed plan, etc).
    2 scan / explain namespace / explain workload: at least one incompatible workload.
    3 apply or rollback: partial outcome; idempotent re-run is safe.
    4 verify: runtimeClassName mismatch and/or in-pod probe did not find gVisor.

    Full table in docs/exit-codes.md.

EKS via Packer

packer/eks-gvisor-al2023.pkr.hcl builds an EKS-optimized AL2023 AMI with runsc and the containerd v2 shim preinstalled, the containerd drop-in pre-staged, and the systrap platform pinned (KVM is unavailable on EKS instances).

cd packer
packer init .
packer validate .
packer build .

Wire the resulting AMI into a self-managed node group or a Karpenter EC2NodeClass, label the nodes runtime=gvisor, taint them runtime=gvisor:NoSchedule, and apply deploy/runtimeclass.yaml. The end-to-end recipe (CFN/Terraform snippets, IAM, Karpenter wiring) is tracked in docs/eks-deployment.md.

Local kind

make kind-up builds a custom kind worker image (kind/Dockerfile.gvisor-node) that ships /usr/local/bin/runsc and the containerd v2 shim. The cluster topology (kind/cluster.yaml) is a stock control plane plus one worker labelled runtime=gvisor; the RuntimeClass in test/e2e/manifests/runtimeclass.yaml uses handler: gvisor so pods carrying runtimeClassName: gvisor really execute under runsc. Honors CLUSTER_NAME and KEEP_CLUSTER=1 for iteration.

make kind-up          # idempotent; rebuilds the image only when missing
make e2e              # full scan -> plan -> apply -> rollback against the cluster
KEEP_CLUSTER=1 make e2e
make kind-down

Cheatsheet

Commands

Command Purpose
agentmoat scan Enumerate workloads and classify gVisor compatibility. RO.
agentmoat plan Produce a deterministic MigrationPlan from a scan. RO.
agentmoat apply Patch workloads per the plan. Default dry-run. Idempotent.
agentmoat rollback Reverse a previously applied plan. Default dry-run.
agentmoat verify Confirm live pods match the plan's runtimeClassName. RO.
agentmoat explain Embedded docs viewer; explain namespace / explain workload for deep scans.
agentmoat version Print binary version and git SHA.

Global flags

Flag Default Purpose
--output / -o table table, json, or yaml. json/yaml follow agentmoat.io/v1alpha1.
--kubeconfig $KUBECONFIG Path to kubeconfig.
--context current-context kubeconfig context to use.
--namespace / -n (all) Repeatable. Default: scan every namespace.
--all-namespaces / -A true if -n unset Explicit all-namespaces flag.
--selector / -l (none) Kubernetes label selector applied to every list call.
--include-system false Include kube-system and other kube-* namespaces.
--rules (none) Path to YAML overriding rule severities or adding rules.
--explain false Inline educational notes in supported output formats.

Per-command flags

Flag Command Default Purpose
--scan plan (inline) Read a stored ScanReport from disk instead of scanning.
--include-review plan false Also include review-class workloads in the plan.
--runtime-class plan gvisor RuntimeClass name to patch onto migrated workloads.
--plan apply/rollback required Path to a MigrationPlan JSON/YAML.
--dry-run apply/rollback true Compute patches but do not mutate the cluster.
--no-events apply/rollback false Do not emit Kubernetes Events per mutation.
--no-audit apply/rollback false Do not append to ~/.agentmoat/audit.jsonl.
--in-pod-probe verify false Exec into a running pod and check dmesg/cmdline for gVisor markers.

Output formats

Format Use case
table Default human-readable view; per-workload row with verdict and top reason.
json Versioned, stable schema (agentmoat.io/v1alpha1). Pipe into jq or store as evidence.
yaml Byte-identical to JSON after canonicalisation; convenient for review/diff.

FAQ

Why not just kubectl patch everything? You can. agentmoat is the same operation written down: it classifies (so you do not silently patch incompatible workloads), it orders by risk (stateless and no-host-network first), it is idempotent (the namespace annotation makes re-runs safe), it keeps an audit trail, and it has a one-command rollback.

Is it safe to run against production? scan and plan are strictly read-only. apply and rollback default to --dry-run=true and surface the exact strategic-merge patch in StepResult.patch before any mutation. A read-only kubeconfig is sufficient to run scan and plan: ready-to-bind RBAC for both modes ships under deploy/ as clusterrole-readonly.yaml (scan / plan / verify) and clusterrole-apply.yaml (apply / rollback).

What about workloads that need raw sockets, eBPF, or GPU passthrough? The classifier marks them incompatible. The planner excludes them. The reasons[] field on each WorkloadResult names the rule that fired and links to the relevant gVisor doc, so the recommendation is concrete: either leave the workload on runc (and put it on a non-gVisor node pool) or adopt the gVisor option that supports it (--net-raw, nvproxy).

Does agentmoat install anything in-cluster? No CRDs, no webhooks, no controllers. It patches pod templates and reads/writes one namespace annotation (agentmoat.io/plan-hash). The only in-cluster prerequisite is a RuntimeClass named gvisor (or whatever --runtime-class was passed) that points at a real runsc-shipping node.

Why a custom Packer AMI on EKS? Bottlerocket does not ship runsc, managed node groups have no gVisor switch, and AWS does not officially support gVisor. The path of least resistance is to bring your own AL2023 node image with runsc baked in. packer/eks-gvisor-al2023.pkr.hcl is that image.

Why systrap, not KVM? On EKS the instance kernels do not expose /dev/kvm to user-space; on macOS-hosted kind, KVM is unavailable too. The systrap platform works in both environments. Pinned in kind/runsc.toml and packer/files/runsc.toml.

Status and roadmap

Phase 0 (foundation), Phase 1 (read-only scan + classifier), and Phase 2 (planner + applier + rollback, with idempotency and audit) are committed. agentmoat scan, plan, apply, and rollback are wired and exercised end-to-end against a real gVisor kind cluster.

Roadmap:

  • Phase 3: agentmoat verify (pod runtimeClassName check; optional --in-pod-probe for in-container confirmation) and agentmoat explain (embedded docs viewer). Both shipped.
  • Phase 5+: EKS end-to-end recipe (CloudFormation/Terraform/Karpenter), additional Packer variants.

Where to go next

  • Architecture: the Go library at the core; CLI as a thin shell.
  • gVisor 101: Sentry, Gofer, platforms, and where the overhead lives.
  • RuntimeClass 101: one-page intro to the RuntimeClass API.
  • Threat model: what gVisor stops that runc does not, with CVE references.
  • Compatibility checklist: the full rule catalog and --rules override schema.
  • Exit codes: the deterministic exit codes by command.
  • EKS deployment: the Packer + EKS recipe (stub today; tracked for Phase 5).
  • Kind quickstart: bring up a local cluster with gVisor preinstalled (doc is a stub today; make kind-up is the working path).

License

Apache License 2.0. See LICENSE.

About

agentmoat moves Kubernetes workloads from the default runc runtime to gVisor (runsc), the user-space kernel that defends against the kernel-exploit step of a container-escape chain.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors