One agent. The whole reliability loop.
Capnode installs a single Go agent into your cluster and runs the entire operational surface from it — right-sizing workloads, cutting idle spend, detecting 25+ failure modes, healing incidents in milliseconds, and scanning your security posture. Aria, the conversational layer, explains every move in plain English.
Workloads sized to real demand, continuously.
Capnode watches actual CPU and memory behavior across every namespace and reconciles requests, limits, and replica counts to match. No more padding requests "just in case," no more OOMKills from limits set too low. The agent does the math so your team doesn't guess.
- Requests & limits tuned to observed peaks, not napkin estimates.
- Replica guidance that catches both over-provisioning and HPA thrash.
- Every namespace covered — including system workloads like CoreDNS and kube-proxy.
- Recommendations you can trust — backed by the same data Aria cites in chat.
Idle dev resources dissolve. Velocity doesn't.
Non-production environments burn money sitting idle overnight and on weekends. Capnode dissolves the resources nobody is using — pods, load balancers, even nodes — and brings them back automatically on your next push. The savings are real; the developer experience is unchanged.
- Scale to zero at rest — idle dev and staging workloads stop costing you money.
- Reclaim orphaned spend — load balancers and nodes that exist only to serve nothing.
- Restore on first push — environments wake up in seconds when work resumes.
- Production stays untouched — cost actions are scoped to the environments you choose.
25+ failure modes, caught before alerts fire.
Capnode reads live cluster state and events directly from the agent, so it recognizes failure patterns the moment they emerge — no Prometheus, no scrape configs, no PromQL to maintain. From crash loops to cert expiry to node pressure, the signal is native.
- No Prometheus required — detection works on a bare cluster out of the box.
- Workload, node, and network failure classes covered, not just pod restarts.
- Deduplicated & grouped — one root cause, not a wall of duplicate alerts.
- Ahead of your users — patterns are flagged before they cascade into an outage.
Heals in milliseconds. Asks before anything risky.
A memory-first deterministic engine resolves recurring failures like OOMKills and crash loops almost instantly — it recognizes the pattern and applies the known-good fix. Safety is tiered: SAFE_AUTO actions run on their own, while ALWAYS_APPROVAL actions wait for a human click. True human-in-the-loop, by design.
- Deterministic, not a coin flip — known patterns get known fixes, every time.
- Two safety tiers — safe actions auto-run, risky actions require approval.
- Learns each resolution — the next occurrence is matched and healed faster.
- Never touches its own namespace — Capnode is structurally barred from healing itself.
Least-privilege by construction. Blast radius contained.
The Capnode agent runs RBAC-scoped and least-privilege — it can only do what you grant it. It continuously scans your cluster's security posture for risk, and by design it is refused at the mutation gate from ever changing its own namespace. Safety here is structural, not a setting you can forget to flip.
- Scoped RBAC — the agent operates inside permissions you explicitly grant.
- Posture scanning — surfaces risky configurations across your workloads.
- Self-mutation refused — Capnode's own namespace is hard-blocked at the gate.
- Contained blast radius — risky actions are gated behind explicit human approval.
- RBAC scope — verbs limited to the actions you authorize.
- Mutation gate — risky changes require human approval before they run.
- Self-protection — the agent's own namespace is refused, no exceptions.
Ask Aria, in plain English.
Aria reads your live cluster, answers in language your whole team understands, and shows its work — every answer is grounded in real evidence and verified before you see it.
checkout-api in prod-eu-1 is in CrashLoopBackOff because it is being
OOMKilled. Its memory limit is 256Mi, but the container peaks at roughly 410Mi
under load, so the kernel terminates it and Kubernetes restarts the loop.
The fix is to raise the memory limit to 512Mi. That is a safe, reversible change — I can apply it now,
or you can review it first.
Detect. Diagnose. Remediate. Learn.
Every capability on this page feeds one continuous loop — most of it completing before a human is ever paged.
Detect
The Go agent streams live state and events; Capnode flags an anomaly the instant it appears.
Diagnose
It correlates events, logs, and history to pinpoint root cause — and Aria explains it in plain English.
Remediate
Safe fixes run in milliseconds; risky ones request approval — true human-in-the-loop.
Learn
Each resolution is remembered, so the next occurrence of the same failure heals even faster.
The loop never stops — and with every incident, Capnode gets faster at healing your specific cluster.
Give your cluster an SRE that never sleeps.
Deploy the Capnode agent in minutes. Watch it detect, diagnose, and heal — then let it learn your environment.