About Capnode

We're building the SRE that never sleeps.

Capnode is an autonomous AI SRE for Kubernetes. It watches every cluster you run, catches failures before the alert fires, fixes the safe ones on its own, and asks a human before it touches anything risky — so your on-call rotation gets its nights back.

Deploy the agent See how it works

Our mission

Kubernetes shouldn't cost you your sleep.

The same pattern repeats in every platform team. A pod starts crash-looping at 3am. PagerDuty fires. Someone half-awake opens a laptop, runs kubectl describe, scrolls events, recognizes the OOMKill they've seen a dozen times before, bumps the memory limit, and goes back to bed — until it happens again next week on a different service. The runbook is in someone's head. The fix is mechanical. The toll is human.

And while everyone's busy firefighting, the cluster quietly bleeds money: idle dev namespaces left running over the weekend, load balancers nobody decommissioned, oversized requests that reserve capacity no workload ever uses. The waste is invisible right up until the cloud bill arrives.

Capnode's answer is a closed loop that runs continuously, on every cluster, without a human in the seat: Detect → Diagnose → Remediate → Learn.

Detect. The agent natively recognizes 25+ failure modes — CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending PVCs, HPA thrash, DNS outages, cert expiry, node pressure, configmap drift — before they ever surface as an alert. No Prometheus required.
Diagnose. Capnode correlates events, spec, and history into a verified, evidence-backed root cause — not a guess. Ask Aria, the conversational layer, "why is this pod crashing?" in plain English and get an answer you can trust, with the receipts.
Remediate. A memory-first deterministic engine heals known failures like OOMKills and CrashLoops in milliseconds. Safe, reversible actions run automatically; risky ones wait for a human click.
Learn. Every incident and resolution feeds the memory. The next time the same shape appears — on this cluster or another — the fix is already known.

That loop doesn't just keep clusters healthy. It also dissolves idle non-prod resources at rest and restores them on the first push, right-sizes workloads to real demand, and continuously scans cluster posture — so reliability, cost, and security all improve from the same closed loop.

How Capnode thinks

Memory-first, human-in-the-loop, evidence-backed.

Three principles shape every decision the loop makes — and every decision it deliberately leaves to you.

Memory before models

Capnode tries what it already knows works before it reaches for inference. A failure shape it has resolved before is healed deterministically, in milliseconds — fast, repeatable, and explainable. The AI is for the unknown, not the routine.

Humans own the risk

Actions are tiered. Safe, reversible fixes run automatically. Risky ones — anything with real blast radius — pause for an explicit human approval. The agent is RBAC-scoped, least-privilege, and never mutates its own namespace. Autonomy where it's safe; a click where it isn't.

Verified, not vibes

Every diagnosis is grounded in real cluster evidence — events, spec, and history that Aria can show you. No hallucinated root causes, no answers you can't check. If Capnode can't be confident, it says so and hands the decision back to you.

The company

Built by AIKAY Technologies.

Capnode is a product of AIKAY Technologies Pvt Ltd, based in India. We build reliability tooling for the platform and SRE teams who keep modern infrastructure running — the people who get paged when something breaks and are expected to have it fixed before anyone notices.

We started Capnode because we'd lived the problem: the manual runbooks, the repetitive remediations, the quiet cloud waste, the fatigue of an on-call rotation that never truly rests. We believe most of that toil is automatable — and that the parts which aren't should stay firmly in human hands. So we built a product that does the mechanical work autonomously and gives operators sharper, evidence-backed control over everything else.

Our north star is simple: a cluster operator should be able to trust Capnode the way they'd trust a seasoned teammate — one who never sleeps, never forgets a past incident, and always asks before doing anything they'd want to be asked about.

We're early, and we're building in the open with the teams who run Kubernetes at scale. If that sounds like you, we'd love to hear how you operate — reach us any time at support@capnode.io.

What we value

The principles we won't trade away.

These aren't posters on a wall — they're the constraints we hold the product to on every release.

Trust by design

Least-privilege from the first install. The agent is RBAC-scoped, can't touch its own namespace, and its blast radius is contained by architecture — not by a promise. We scan posture rather than assume it.

Contained by design

Human-in-the-loop

Autonomy is earned, not assumed. Safe fixes run on their own; anything risky waits for a person. We will never ship a default that mutates production without a human in the decision when the stakes are real.

Approval on risk

Honest engineering

No hallucinated root causes, no numbers you can't verify, no fix we can't explain. If Capnode isn't confident, it tells you and hands the decision back. We'd rather under-claim and be trusted than over-promise and be caught.

Evidence-backed

Built for scale

One cluster or a thousand, ten pods or fifty thousand — Capnode is designed for the real fleet. Aggregated alerts, paginated views, and a stateless server that scales horizontally, because reliability tooling has to survive its own success.

Fleet-scale

Build with us.

Deploy the Capnode agent into your cluster and watch the loop close — or come help us build the SRE that never sleeps. We're hiring the people who'll define how Kubernetes runs itself.

Deploy the agent See open roles