How Cerebral cut network downtime and RCA
Cerebral cut downtime and accelerated RCA (root cause analysis) by compressing “time-to-context” from an 8-15 minute manual triage loop into an automated, roughly five-second evidence assembly and reasoning pass. Instead of waking an on-call engineer to SSH in, inspect Kubernetes state, check configs, and cross-check Grafana, Mesh continuously pulls the full snapshot: workload health, sequencer logs, Prometheus probes, and the Alertmanager firing set.
Mesh bundles that context into an auditable evidence pack and produces a conservative, chain-aware decision that never takes destructive action on chain-critical services. It shortens investigation cycles while preventing “wrong but fast” actions that could halt the chain.
Mesh has compressed our RCA-to-decision time by roughly 90%. For chain infrastructure, getting the right context fast matters more than getting the right action fast - a wrong restart can corrupt data in ways we can’t recover from. Mesh maintains a continuous picture of the system and recognizes when issues across different components are part of the same incident, not three separate ones.
Madhav Goyal | Lead Engineer, Camp Network
Who Camp is
Camp Network is a high-performance sovereign Layer 1 designed for content provenance and IP-aware infrastructure. The chain consistently sustains throughput above 10,000 transactions per second and operates on an EVM architecture with data availability on Celestia.
Camp closed a $30M Series A in 2025 and runs blockchain infrastructure where chain liveness must never depend on a process restart, and observability cannot be outsourced to a third-party cloud.
The full cluster snapshot - every workload, scrape target, and configuration - is captured as Chapter 0 and InfraGraph before any decisions are taken.
The DevOps load Mesh was deployed to address
A Layer 1 in a low-traffic period produces a metric pattern indistinguishable from a real outage. Block production goes flat. SequencerBlockProductionStopped fires. Within seconds, two downstream ReadNodeFeedStale alerts fire on the read nodes. Three alerts on a perfectly healthy chain.
For an engineer paged at 2 AM, every one of those is indistinguishable from a real outage until they verify. The manual verification path:
| Step | Time |
|---|---|
| SSH into the bastion | ~1 min |
| Survey pod state | ~30 s |
| Inspect the sequencer workload | ~2 mins reading |
| Tail recent logs | ~30 s |
| Cross-check Grafana dashboards | ~3-5 min |
| Reason about idle vs. real stoppage | 1-3 min |
| Document the dismissal | ~1 min |
| Total time-to-context | 8-15 minutes per alert |
Multiply by three alerts per quiet period, multiple quiet periods per week, and a small on-call rotation. The engineer-hours add up. Crucially, the cost of being wrong is higher than the cost of being slow: any AI assistant that proposed restart sequencer to clear the alert would halt the chain.
This is the constraint Mesh was built for.
What Mesh does
Mesh observes, reasons, and produces audit-grade decisions. It does not execute against chain-critical services. Three watchers run autonomously on a single host, observation-only, never opening an inbound port:
| Watcher | Cadence | Role |
|---|---|---|
| Kubernetes health watcher | Every 60 s | Catches any workload becoming unhealthy |
| Chain heartbeat | Every 30 min | Pulls a comprehensive snapshot regardless of whether anything is firing |
| Daily digest writer | Every 60 min | Aggregates the day's runs into a markdown brief for engineers |
Every run bundles workload state, sequencer logs, Prometheus probes, and currently firing alerts into an evidence pack, then writes the resulting decision to a Merkle-rooted vault. The chain’s full topology is captured as InfraGraph. Outcomes accumulate in ReasoningBank, which learns the difference between “block production is flat because traffic is low” and “block production is flat because something is broken.”
What context assembly looks like in 5 seconds
For the SequencerBlockProductionStopped alert that fired on 2026-05-21, one Mesh run (run_20260521T152147_4b933495) reconstructed the entire engineer triage workflow autonomously:
T+0.0 s Mesh's heartbeat watcher tick begins
T+0.5 s Kubernetes state of all chain-critical workloads pulled in
T+1.0 s Live block-production, DA-finalization, feed-relay, and read-node
metrics retrieved from Prometheus
T+1.5 s Last few log lines from the sequencer container retrieved
T+1.7 s Currently firing alerts read from Alertmanager
T+1.8 s All evidence bundled and trigger pipeline begins
T+3.5 s Deterministic hypothesis engine: no signature template match
-> defers to LLM reasoning lane
T+3.9 s Claude Sonnet given the full context with chain-aware system prompt
T+4.7 s LLM returns: decision=escalate, action=open_incident, confidence=0.4
T+4.9 s Safety case score: blocked from execution (4 reasons cited)
T+5.0 s Run lands in the vault, awaiting human review
Merkle root recorded for cryptographic audit
Total time-to-context: 5 seconds. Versus 8-15 minutes manually. That is roughly 100x compression of the triage step, and every step is independently verifiable from the vault.
What Mesh concluded - in its own words
The LLM’s reasoning, pulled byte-for-byte from the vault entry:
Blockchain core service indicates blockchain core infrastructure. Automated restarts, rollbacks, and scaling are prohibited for chain services as they risk halting block production or losing sync state. Requires human operator assessment of chain health status and root cause.
The model:
- Identified that the service was chain core - correct domain recognition.
- Refused to propose any destructive action - correct safety posture.
- Cited the candidate cause and asked for human review - correct outcome.
- Logged confidence at 0.4 - honest about its own uncertainty.
An engineer reading this digest in the morning dismisses the alert in roughly 30 seconds. Mesh has already done the SSH-and-kubectl work.
The killer anecdote: the AI that almost halted the chain
Early in a simulation environment with the same faults, the LLM proposer was given the same heartbeat signal. Its first response: restart_deployment on the sequencer. Restarting the sequencer mid-block-production halts the chain and can lead to data directory corruption, causing fatal downtime.
Four independent safety walls blocked execution before any actuation path was reached:
- Schema validation rejected the proposal - the LLM had emitted an action string outside the bounded enum.
- Approval gate parked the run regardless of the proposal.
- Actuators were structurally disabled at the environment level - no execution path existed.
- Namespace allowlist specifically excluded the chain-core namespace - even if execution had been enabled, the actuator would have refused.
The next iteration, after adding a 15-line chain-aware paragraph to the LLM’s system prompt, the same signal produced the correct open_incident recommendation cited above. The change was one commit; the audit trail is byte-for-byte replayable.
How each run makes the next one smarter
- Every run improves the next one: Mesh does not just triage a single alert; it turns each run into reusable operational memory.
- InfraGraph learns dependencies: Mesh builds a live topology of services, workloads, nodes, and alert relationships so the next incident already has blast-radius context.
- ReasoningBank learns outcomes: Each decision is stored with the eventual operator result, helping Mesh recognize patterns like lazy-idle false positives, correlated DA/feed failures, and repeated unnecessary escalations.
- Triage gets faster over time: As signatures accumulate, deterministic matching can resolve known patterns before the LLM is needed.
- False positives get sharper: Repeated examples teach Mesh the difference between “block production is flat because traffic is low” and “block production is flat because something is broken.”
- Correlation improves RCA: InfraGraph helps answer whether an alert is isolated, downstream of another issue, or part of a broader network pattern.
- Safety remains structural: Camp gets faster RCA without risking an AI-initiated sequencer restart, because execution is blocked by layered safety controls.
- Auditability compounds: Every read, decision, LLM proposal, and safety block is Merkle-rooted and replayable for future review.
Observability bugs Mesh surfaced as a side effect
While doing its own context assembly, Mesh’s autonomous probing surfaced three real non-chain monitoring gaps in the cluster that had gone unnoticed:
- A pair of phantom Prometheus scrape targets pointing at a read-node replica that does not exist. The secret was prepared; the pod was not deployed.
- Node-level kubelet metrics were not reaching Prometheus due to a scrape-format mismatch.
- Two pod scrape annotations pointed at endpoints returning 404.
None of these affected chain function. All would have been caught instantly if someone had been actively reading the monitoring config end-to-end, which is precisely the kind of background audit a bounded-autonomy SRE does on every tick.