Where Monitoring Systems Die
When dashboards are all green but users are already down, you’re facing a control system failure, not an observability problem.
The Incident
02:17 AM. A major customer reports massive request timeouts in the support channel.
02:19. Oncall gets paged.
02:21. Grafana opens. All green. CPU normal. Memory normal. Error rate near zero. The system looks completely healthy.
02:25. More customers report timeouts. Business leaders join the call: How many customers affected? What’s the impact scope? Do we need to rollback? How long to recover?
No one can answer. Monitoring systems exist, but they provide nothing usable for decisions.
How it unraveled
First: monitoring reality diverged from user reality. At 02:30, the infra dashboard showed CPU < 40%, memory normal, network normal, pod count normal, no error spike. Meanwhile, users experienced request timeouts, frozen pages, API latency > 20 seconds.
Second: we couldn’t answer critical questions. Management asked five questions: How big is the impact? Specific region? Single customer? Need rollback? Need public announcement? We could answer none of them.
Third: trust eroded. By 02:40, engineers were SSHing into machines, manually checking logs, printing internal state, adding temporary debug. They stopped looking at dashboards. Once engineers bypass the monitoring system, the system has lost control.
Fourth: monitoring itself nearly failed. At 02:50, Prometheus queries started slowing. High cardinality queries OOM’d. Some dashboards took > 40 seconds to load. If Prometheus crashed, we’d be completely blind.
Root cause (found at 03:12 through manual investigation): a core dependency service had exhausted its connection pool, causing cascading timeouts. But infra metrics didn’t trigger any alert. No dependency latency SLO, no per-client SLO, no request saturation metrics. Monitoring wasn’t wrong. It just didn’t monitor actual risks.
I walked away from that incident realizing we had observability but no control capability. Monitoring existed. Decision capability did not.
If a monitoring system will die, where will it die?
Why Most Monitoring Systems Provide No Value
At 02:25, the dashboard showed CPU, memory, network: everything infrastructure engineers care about. But it couldn’t tell us how many users were affected or whether we should rollback.
Most monitoring systems measure the wrong things. They measure infrastructure. They don’t measure what matters: organizational action capability.
Monitoring value = ability to change behavior or decisions. If it cannot trigger action, it’s noise.
CPU 90% doesn’t mean risk
CPU 90% but no user impact? Probably fine. CPU 90% during traffic peak? Expected. CPU 90% during low traffic? Suspicious. Raw metrics tell you state. Risk requires interpretation.
In our incident, CPU was normal. Users were down. The metrics weren’t wrong. They measured the wrong thing.
Signal-to-noise was inverted
At 02:25 we had too much noise (infrastructure metrics) and not enough signal (user impact). The solution isn’t more data. It’s less noise: filter, group, raise confidence thresholds, remove low-value metrics.
Decision latency was infinite
The main proxy for action capability is decision latency: time from signal to action decision. Target for critical incidents: < 5 minutes.
- Signal detection: < 1 minute
- Interpretation: < 2 minutes
- Decision: < 2 minutes
We were stuck at interpretation. We couldn’t translate signals into user impact.
The real question isn’t “what metrics should we collect?” It’s “what decisions do we need to make, and what information supports them?”
Monitoring Systems Are Safety Systems
An airplane cockpit doesn’t show raw altitude and fuel numbers. It shows: “distance to danger threshold,” “fuel sufficient to reach destination,” “approaching stall speed.” Raw metrics are less valuable than processed insights. Monitoring isn’t about showing altitude. It’s about ensuring you don’t crash.
At 02:25, we had raw metrics. We had no safety signals.
The control loop
Monitoring is a feedback control system: Reality > Signal > Interpretation > Risk Framework > Decision > Action > System Change > New Reality.
If any link breaks, monitoring becomes observation noise.
In our incident, the loop broke at every point. We had infrastructure signals instead of user signals. We couldn’t interpret what signals meant. We had no SLO, no error budget, no risk framework. We couldn’t decide what to do. No action triggered. The control loop never closed.
Classical control theory failures
Oscillation: alert storms where alerts trigger actions that cause more alerts. We’d seen it before, just not that night.
Noise: false positives that get muted, until real alerts are also muted. That night was the opposite: infrastructure metrics normal, users down. Noise masking signal.
Delay: we had fresh data, but it was the wrong data. CPU metrics were current. User impact was absent.
Monitoring That Survives Production Failures
At 02:50, Prometheus queries OOM’d. Dashboards took 40+ seconds to load. One Prometheus crash away from complete blindness.
Monitoring systems must be more stable than what they monitor. We especially need monitoring when production has problems. If monitoring is unstable, we’re blind when we need vision most.
Survivability levels
Level 0: Coupled with application. Monitoring runs on same infrastructure. App crashes, monitoring crashes. Fine for dev environments. Useless during incidents.
Level 1: Independent compute/storage. Monitoring has its own VMs and storage. App crashes, monitoring survives. Region fails, monitoring also fails. At 02:50, we realized we needed at least this.
Level 2: Independent failure domain. Different AZs, different network, different storage. Survives AZ failures but not regional ones.
Level 3: Cross-region with buffering. Multi-region, offline analysis. Survives regional failures and network partitions. Required when incident costs are measured in millions.
| Level | When App Crashes | When AZ Fails | When Region Fails | Applicable Scenarios |
|---|---|---|---|---|
| Level 0 | Monitoring crashes | Monitoring crashes | Monitoring crashes | Dev environments |
| Level 1 | Monitoring survives | Monitoring fails | Monitoring fails | Small production |
| Level 2 | Monitoring survives | Monitoring survives | Monitoring fails | Critical production |
| Level 3 | Monitoring survives | Monitoring survives | Monitoring survives | Critical business |
Cardinality: the hidden survival threat
Labels multiply. service x region x instance x endpoint x status = millions of time series. Queries that worked in steady state fail during incidents. Memory exhausts. Costs explode.
The solution isn’t a better database. It’s label governance: limit label dimensions, use recording rules, aggregate early.
Meta-monitoring
You need to monitor the monitoring system, but the meta-monitoring must be simpler than main monitoring. Otherwise you get recursion. In our incident, we didn’t know Prometheus was struggling until queries started failing.
Observability Storage Is a Data Platform Problem
Most people think monitoring = Grafana dashboards. Grafana is just visualization. The core is the database. When queries fail, dashboards fail.
Observability storage is a high-frequency time-series data warehouse: write-heavy, read-heavy, high cardinality, with compression, retention, and cost pressure. This isn’t an OLTP problem.
Cardinality explosion causes memory exhaustion (Prometheus OOM), slow queries (scanning millions of series), and storage cost blowup. Solution: label governance, recording rules for aggregation, never put high-cardinality data (user IDs, request IDs) in labels.
Storage tiering matters during incidents. Hot incidents need hot data with fast queries. We were querying cold data during a hot incident, with no prioritization.
Query performance degrades at the worst time. Dashboards run complex aggregations on-demand during incidents. Without precomputation or caching, those queries slow, then fail, then OOM. When you need speed, you get complexity.
From CPU to Revenue: Monitoring the Right Layer
CPU normal. Customers down. We couldn’t see user impact because monitoring was at the wrong layer.
Monitoring has layers: business risk/value, model/logic, pipeline, service, infrastructure. Closer to business = closer to truth. Closer to infrastructure = closer to noise.
We were stuck at the infrastructure layer. CPU normal, memory normal, everything looks fine. But users were down. The truth was at the business layer.
The cognitive stack
Raw Signal > Processed Signal > Risk Interpretation > Decision Interface > Action.
In our incident:
- Raw signal (CPU, memory): existed
- Processed signal (aggregation): existed
- Risk interpretation (SLO, error budget): missing
- Decision interface (dashboards answering questions): missing
- Action (alerts triggering behavior): missing
The stack broke at risk interpretation. Without SLOs or error budgets, we couldn’t interpret what signals meant for users, couldn’t answer management’s questions, and couldn’t trigger any action.
Monitor value streams, not infrastructure
Value stream: User request > Service > Dependency > Response > User experience. In our incident, the value stream broke (connection pool exhausted) and we couldn’t see it because we only monitored infrastructure.
Per-customer SLOs would have shown “Customer X down, others normal” or “all customers affected.” That’s business-layer monitoring.
Dashboards That Support Decisions
If we had a decision-supporting dashboard at 02:25, it would show user impact, risk level, recommended actions. Not CPU, memory, error rate.
Dashboards must answer four questions:
- Is the system safe? (Risk assessment)
- Is the situation getting worse? (Trend analysis)
- Do we need action? (Action trigger)
- What action? (Decision support)
At 02:25, our dashboards answered none of these.
Tactical monitoring (real-time, action-bound, decision-focused) is the core. Exploratory monitoring (historical, analytical, insight-focused) is secondary. During an incident you need tactical. We had exploratory.
Design dashboards for incident conditions, not steady state. During incidents you need fast decisions, clear actions, risk assessment. Steady-state value is secondary.
Why Engineers Stop Looking at Dashboards
At 02:40, engineers started SSHing into machines. They bypassed dashboards entirely because dashboards didn’t help.
Trust collapse
Monitoring is a trust system. Trust collapse symptoms: too many false positives, a single false negative (missed real incident), data delay, dashboards inconsistent with experience, missing data during incidents.
We had multiple: dashboards green while users were down, no user impact data, missed the real problem. Once trust collapses, engineers bypass monitoring. It becomes useless.
Trust is engineered, not declared
- Trust budget: allocate error budgets for false positives, false negatives, delay
- Trust transparency: show data freshness, confidence scores, error rates
- Trust recovery: acknowledge false positives, fix root causes, communicate improvements
- Trust monitoring: track mute rate, bypass rate, engineer investigation patterns
We had none of this. Trust collapsed silently.
Ownership
No owner = doesn’t exist. Dashboards existed, but nobody owned them. They decayed. Nobody maintained them. Nobody trusted them.
Engineers bypass monitoring when it doesn’t help, when it’s unreliable, when it’s slow, or when only senior engineers can interpret it. In our incident, all four applied.
Risk-Driven Monitoring
If we’d designed monitoring around risk, we would have monitored connection pool saturation. Instead, we monitored CPU, memory, error rate, because those are easy.
Start from failure modes. What can go wrong? Connection pool exhaustion, dependency timeouts, request saturation. What signals indicate these risks? Pool utilization, dependency latency, queue depth. Prioritize by impact and probability. Validate: can we detect this? Do alerts trigger correct actions?
We skipped all of this. We monitored what was easy, not what mattered. If we’d monitored connection pool saturation, we would have found the problem before users noticed.
Six ways monitoring dies
- Exists but untrusted: engineers bypass it (our incident: yes)
- Alert noise collapse: alert storms, everything muted (our incident: no)
- System fragility: monitoring crashes with production (our incident: nearly)
- Can’t answer decision questions: raw metrics, no risk framework (our incident: yes)
- Dies with production: shared infrastructure, no independence (our incident: close)
- Monitoring diverges from reality: green dashboards, users down (our incident: yes)
| Death Mode | Symptoms | Root Cause | Prevention |
|---|---|---|---|
| Untrusted | Engineers SSH instead | Trust collapse, no ownership | Trust engineering, clear ownership |
| Alert noise | Alert storms, alerts muted | No filtering, no grouping | Alert governance, confidence thresholds |
| Fragile system | Prometheus OOM, slow queries | Coupled with app, shared failure domain | Independent infra, cardinality governance |
| No decision support | Metrics exist, can’t decide | Only raw metrics, no risk framework | Risk-driven design, SLO framework |
| Dies with production | Monitoring crashes too | Level 0 survivability | Independent failure domains |
| Diverges from reality | Green but users down | Wrong SLO, happy path bias | Real user metrics, correct SLO |
Where Will It Die?
Every failure in our incident traces back to one root cause: monitoring was not designed for organizational decision capability under uncertainty.
Trust collapse: engineers stop relying on it. Survivability failure: dies with production. Risk blindness: monitoring wrong things. Decision incapability: can’t answer questions. Control loop broken: no action, no feedback. Cognitive stack incomplete: missing interpretation and decision layers.
The sole purpose of monitoring systems is to maintain organizational action capability under uncertainty. Not data visibility. Not debugging. Not metric collection.
Build decision capability. Not dashboards.