Capabilities

One agent. The whole reliability loop.

Capnode installs a single Go agent into your cluster and runs the entire operational surface from it — right-sizing workloads, cutting idle spend, detecting 25+ failure modes, healing incidents in milliseconds, and scanning your security posture. Aria, the conversational layer, explains every move in plain English.

AI Workloads Management

Workloads sized to real demand, continuously.

Capnode watches actual CPU and memory behavior across every namespace and reconciles requests, limits, and replica counts to match. No more padding requests "just in case," no more OOMKills from limits set too low. The agent does the math so your team doesn't guess.

  • Requests & limits tuned to observed peaks, not napkin estimates.
  • Replica guidance that catches both over-provisioning and HPA thrash.
  • Every namespace covered — including system workloads like CoreDNS and kube-proxy.
  • Recommendations you can trust — backed by the same data Aria cites in chat.
Cost Optimization

Idle dev resources dissolve. Velocity doesn't.

Non-production environments burn money sitting idle overnight and on weekends. Capnode dissolves the resources nobody is using — pods, load balancers, even nodes — and brings them back automatically on your next push. The savings are real; the developer experience is unchanged.

  • Scale to zero at rest — idle dev and staging workloads stop costing you money.
  • Reclaim orphaned spend — load balancers and nodes that exist only to serve nothing.
  • Restore on first push — environments wake up in seconds when work resumes.
  • Production stays untouched — cost actions are scoped to the environments you choose.
Incident Detection

25+ failure modes, caught before alerts fire.

Capnode reads live cluster state and events directly from the agent, so it recognizes failure patterns the moment they emerge — no Prometheus, no scrape configs, no PromQL to maintain. From crash loops to cert expiry to node pressure, the signal is native.

  • No Prometheus required — detection works on a bare cluster out of the box.
  • Workload, node, and network failure classes covered, not just pod restarts.
  • Deduplicated & grouped — one root cause, not a wall of duplicate alerts.
  • Ahead of your users — patterns are flagged before they cascade into an outage.
Detected failure modes 25+ native
CrashLoopBackOff OOMKilled ImagePullBackOff PVC Pending HPA Thrash DNS Outage Cert Expiry NodeNotReady MemoryPressure DiskPressure Ingress 503 CronJob Missed ConfigMap Drift NetworkPartition Readiness Failing Image GC Liveness Failing Pending Unschedulable Node Pressure Init Container Failing … and more
Autonomous Remediation

Heals in milliseconds. Asks before anything risky.

A memory-first deterministic engine resolves recurring failures like OOMKills and crash loops almost instantly — it recognizes the pattern and applies the known-good fix. Safety is tiered: SAFE_AUTO actions run on their own, while ALWAYS_APPROVAL actions wait for a human click. True human-in-the-loop, by design.

  • Deterministic, not a coin flip — known patterns get known fixes, every time.
  • Two safety tiers — safe actions auto-run, risky actions require approval.
  • Learns each resolution — the next occurrence is matched and healed faster.
  • Never touches its own namespace — Capnode is structurally barred from healing itself.
AI Security

Least-privilege by construction. Blast radius contained.

The Capnode agent runs RBAC-scoped and least-privilege — it can only do what you grant it. It continuously scans your cluster's security posture for risk, and by design it is refused at the mutation gate from ever changing its own namespace. Safety here is structural, not a setting you can forget to flip.

  • Scoped RBAC — the agent operates inside permissions you explicitly grant.
  • Posture scanning — surfaces risky configurations across your workloads.
  • Self-mutation refused — Capnode's own namespace is hard-blocked at the gate.
  • Contained blast radius — risky actions are gated behind explicit human approval.
Conversational AI

Ask Aria, in plain English.

Aria reads your live cluster, answers in language your whole team understands, and shows its work — every answer is grounded in real evidence and verified before you see it.

Why is the checkout-api pod crashing?

checkout-api in prod-eu-1 is in CrashLoopBackOff because it is being OOMKilled. Its memory limit is 256Mi, but the container peaks at roughly 410Mi under load, so the kernel terminates it and Kubernetes restarts the loop.

The fix is to raise the memory limit to 512Mi. That is a safe, reversible change — I can apply it now, or you can review it first.

Event: OOMKilled Restarts: 6 Limit 256Mi < peak 410Mi
Answer verified against live cluster evidence
The closed loop

Detect. Diagnose. Remediate. Learn.

Every capability on this page feeds one continuous loop — most of it completing before a human is ever paged.

Detect

The Go agent streams live state and events; Capnode flags an anomaly the instant it appears.

Diagnose

It correlates events, logs, and history to pinpoint root cause — and Aria explains it in plain English.

Remediate

Safe fixes run in milliseconds; risky ones request approval — true human-in-the-loop.

Learn

Each resolution is remembered, so the next occurrence of the same failure heals even faster.

The loop never stops — and with every incident, Capnode gets faster at healing your specific cluster.

Go agent Server React UI & Aria chat

Give your cluster an SRE that never sleeps.

Deploy the Capnode agent in minutes. Watch it detect, diagnose, and heal — then let it learn your environment.