Where Monitoring Systems Die

When dashboards are all green but users are already down, you’re facing a control system failure, not an observability problem.

The Incident

02:17 AM. A major customer reports massive request timeouts in the support channel.

02:19. Oncall gets paged.

02:21. Grafana opens. All green. CPU normal. Memory normal. Error rate near zero. The system looks completely healthy.

02:25. More customers report timeouts. Business leaders join the call: How many customers affected? What’s the impact scope? Do we need to rollback? How long to recover?

No one can answer. Monitoring systems exist, but they provide nothing usable for decisions.

How it unraveled

First: monitoring reality diverged from user reality. At 02:30, the infra dashboard showed CPU < 40%, memory normal, network normal, pod count normal, no error spike. Meanwhile, users experienced request timeouts, frozen pages, API latency > 20 seconds.

Second: we couldn’t answer critical questions. Management asked five questions: How big is the impact? Specific region? Single customer? Need rollback? Need public announcement? We could answer none of them.

Third: trust eroded. By 02:40, engineers were SSHing into machines, manually checking logs, printing internal state, adding temporary debug. They stopped looking at dashboards. Once engineers bypass the monitoring system, the system has lost control.

Fourth: monitoring itself nearly failed. At 02:50, Prometheus queries started slowing. High cardinality queries OOM’d. Some dashboards took > 40 seconds to load. If Prometheus crashed, we’d be completely blind.

Root cause (found at 03:12 through manual investigation): a core dependency service had exhausted its connection pool, causing cascading timeouts. But infra metrics didn’t trigger any alert. No dependency latency SLO, no per-client SLO, no request saturation metrics. Monitoring wasn’t wrong. It just didn’t monitor actual risks.

I walked away from that incident realizing we had observability but no control capability. Monitoring existed. Decision capability did not.

If a monitoring system will die, where will it die?


Why Most Monitoring Systems Provide No Value

At 02:25, the dashboard showed CPU, memory, network: everything infrastructure engineers care about. But it couldn’t tell us how many users were affected or whether we should rollback.

Most monitoring systems measure the wrong things. They measure infrastructure. They don’t measure what matters: organizational action capability.

Monitoring value = ability to change behavior or decisions. If it cannot trigger action, it’s noise.

CPU 90% doesn’t mean risk

CPU 90% but no user impact? Probably fine. CPU 90% during traffic peak? Expected. CPU 90% during low traffic? Suspicious. Raw metrics tell you state. Risk requires interpretation.

In our incident, CPU was normal. Users were down. The metrics weren’t wrong. They measured the wrong thing.

Signal-to-noise was inverted

At 02:25 we had too much noise (infrastructure metrics) and not enough signal (user impact). The solution isn’t more data. It’s less noise: filter, group, raise confidence thresholds, remove low-value metrics.

Decision latency was infinite

The main proxy for action capability is decision latency: time from signal to action decision. Target for critical incidents: < 5 minutes.

  • Signal detection: < 1 minute
  • Interpretation: < 2 minutes
  • Decision: < 2 minutes

We were stuck at interpretation. We couldn’t translate signals into user impact.

The real question isn’t “what metrics should we collect?” It’s “what decisions do we need to make, and what information supports them?”


Monitoring Systems Are Safety Systems

An airplane cockpit doesn’t show raw altitude and fuel numbers. It shows: “distance to danger threshold,” “fuel sufficient to reach destination,” “approaching stall speed.” Raw metrics are less valuable than processed insights. Monitoring isn’t about showing altitude. It’s about ensuring you don’t crash.

At 02:25, we had raw metrics. We had no safety signals.

The control loop

Monitoring is a feedback control system: Reality > Signal > Interpretation > Risk Framework > Decision > Action > System Change > New Reality.

If any link breaks, monitoring becomes observation noise.

graph LR A[Reality<br/>User Reality] --> B[Signal<br/>Signal] B --> C[Interpretation<br/>Interpretation] C --> D[Risk Framework<br/>Risk Framework] D --> E[Decision<br/>Decision] E --> F[Action<br/>Action] F --> G[System Change<br/>System Change] G --> A style B fill:#ffcccc style C fill:#ffcccc style D fill:#ffcccc style E fill:#ffcccc B -.Break Point 1: Infrastructure signals<br/>Not user signals.-> B C -.Break Point 2: Cannot interpret<br/>Signal meaning.-> C D -.Break Point 3: No SLO/error budget<br/>No risk assessment.-> D E -.Break Point 4: Cannot decide<br/>What to do.-> E

In our incident, the loop broke at every point. We had infrastructure signals instead of user signals. We couldn’t interpret what signals meant. We had no SLO, no error budget, no risk framework. We couldn’t decide what to do. No action triggered. The control loop never closed.

Classical control theory failures

Oscillation: alert storms where alerts trigger actions that cause more alerts. We’d seen it before, just not that night.

Noise: false positives that get muted, until real alerts are also muted. That night was the opposite: infrastructure metrics normal, users down. Noise masking signal.

Delay: we had fresh data, but it was the wrong data. CPU metrics were current. User impact was absent.


Monitoring That Survives Production Failures

At 02:50, Prometheus queries OOM’d. Dashboards took 40+ seconds to load. One Prometheus crash away from complete blindness.

Monitoring systems must be more stable than what they monitor. We especially need monitoring when production has problems. If monitoring is unstable, we’re blind when we need vision most.

Survivability levels

Level 0: Coupled with application. Monitoring runs on same infrastructure. App crashes, monitoring crashes. Fine for dev environments. Useless during incidents.

Level 1: Independent compute/storage. Monitoring has its own VMs and storage. App crashes, monitoring survives. Region fails, monitoring also fails. At 02:50, we realized we needed at least this.

Level 2: Independent failure domain. Different AZs, different network, different storage. Survives AZ failures but not regional ones.

Level 3: Cross-region with buffering. Multi-region, offline analysis. Survives regional failures and network partitions. Required when incident costs are measured in millions.

Level When App Crashes When AZ Fails When Region Fails Applicable Scenarios
Level 0 Monitoring crashes Monitoring crashes Monitoring crashes Dev environments
Level 1 Monitoring survives Monitoring fails Monitoring fails Small production
Level 2 Monitoring survives Monitoring survives Monitoring fails Critical production
Level 3 Monitoring survives Monitoring survives Monitoring survives Critical business

Cardinality: the hidden survival threat

Labels multiply. service x region x instance x endpoint x status = millions of time series. Queries that worked in steady state fail during incidents. Memory exhausts. Costs explode.

The solution isn’t a better database. It’s label governance: limit label dimensions, use recording rules, aggregate early.

Meta-monitoring

You need to monitor the monitoring system, but the meta-monitoring must be simpler than main monitoring. Otherwise you get recursion. In our incident, we didn’t know Prometheus was struggling until queries started failing.


Observability Storage Is a Data Platform Problem

Most people think monitoring = Grafana dashboards. Grafana is just visualization. The core is the database. When queries fail, dashboards fail.

Observability storage is a high-frequency time-series data warehouse: write-heavy, read-heavy, high cardinality, with compression, retention, and cost pressure. This isn’t an OLTP problem.

Cardinality explosion causes memory exhaustion (Prometheus OOM), slow queries (scanning millions of series), and storage cost blowup. Solution: label governance, recording rules for aggregation, never put high-cardinality data (user IDs, request IDs) in labels.

Storage tiering matters during incidents. Hot incidents need hot data with fast queries. We were querying cold data during a hot incident, with no prioritization.

Query performance degrades at the worst time. Dashboards run complex aggregations on-demand during incidents. Without precomputation or caching, those queries slow, then fail, then OOM. When you need speed, you get complexity.


From CPU to Revenue: Monitoring the Right Layer

CPU normal. Customers down. We couldn’t see user impact because monitoring was at the wrong layer.

Monitoring has layers: business risk/value, model/logic, pipeline, service, infrastructure. Closer to business = closer to truth. Closer to infrastructure = closer to noise.

We were stuck at the infrastructure layer. CPU normal, memory normal, everything looks fine. But users were down. The truth was at the business layer.

graph TD A[Business Risk/Value Layer<br/>User Impact, Revenue, SLO] -->|Closer to truth| B[Model/Logic Layer<br/>Business Logic, Data Flow] B --> C[Pipeline Layer<br/>Request Flow, Dependency Chain] C --> D[Service Layer<br/>Service Health, Latency] D -->|Closer to noise| E[Infrastructure Layer<br/>CPU, Memory, Network] F[02:25 Incident] -.Monitoring location.-> E F -.Truth location.-> A style A fill:#90EE90 style E fill:#FFB6C1 style F fill:#FFD700

The cognitive stack

Raw Signal > Processed Signal > Risk Interpretation > Decision Interface > Action.

In our incident:

  • Raw signal (CPU, memory): existed
  • Processed signal (aggregation): existed
  • Risk interpretation (SLO, error budget): missing
  • Decision interface (dashboards answering questions): missing
  • Action (alerts triggering behavior): missing

The stack broke at risk interpretation. Without SLOs or error budgets, we couldn’t interpret what signals meant for users, couldn’t answer management’s questions, and couldn’t trigger any action.

Monitor value streams, not infrastructure

Value stream: User request > Service > Dependency > Response > User experience. In our incident, the value stream broke (connection pool exhausted) and we couldn’t see it because we only monitored infrastructure.

Per-customer SLOs would have shown “Customer X down, others normal” or “all customers affected.” That’s business-layer monitoring.


Dashboards That Support Decisions

If we had a decision-supporting dashboard at 02:25, it would show user impact, risk level, recommended actions. Not CPU, memory, error rate.

Dashboards must answer four questions:

  1. Is the system safe? (Risk assessment)
  2. Is the situation getting worse? (Trend analysis)
  3. Do we need action? (Action trigger)
  4. What action? (Decision support)

At 02:25, our dashboards answered none of these.

Tactical monitoring (real-time, action-bound, decision-focused) is the core. Exploratory monitoring (historical, analytical, insight-focused) is secondary. During an incident you need tactical. We had exploratory.

Design dashboards for incident conditions, not steady state. During incidents you need fast decisions, clear actions, risk assessment. Steady-state value is secondary.


Why Engineers Stop Looking at Dashboards

At 02:40, engineers started SSHing into machines. They bypassed dashboards entirely because dashboards didn’t help.

Trust collapse

Monitoring is a trust system. Trust collapse symptoms: too many false positives, a single false negative (missed real incident), data delay, dashboards inconsistent with experience, missing data during incidents.

We had multiple: dashboards green while users were down, no user impact data, missed the real problem. Once trust collapses, engineers bypass monitoring. It becomes useless.

Trust is engineered, not declared

  • Trust budget: allocate error budgets for false positives, false negatives, delay
  • Trust transparency: show data freshness, confidence scores, error rates
  • Trust recovery: acknowledge false positives, fix root causes, communicate improvements
  • Trust monitoring: track mute rate, bypass rate, engineer investigation patterns

We had none of this. Trust collapsed silently.

Ownership

No owner = doesn’t exist. Dashboards existed, but nobody owned them. They decayed. Nobody maintained them. Nobody trusted them.

Engineers bypass monitoring when it doesn’t help, when it’s unreliable, when it’s slow, or when only senior engineers can interpret it. In our incident, all four applied.


Risk-Driven Monitoring

If we’d designed monitoring around risk, we would have monitored connection pool saturation. Instead, we monitored CPU, memory, error rate, because those are easy.

Start from failure modes. What can go wrong? Connection pool exhaustion, dependency timeouts, request saturation. What signals indicate these risks? Pool utilization, dependency latency, queue depth. Prioritize by impact and probability. Validate: can we detect this? Do alerts trigger correct actions?

We skipped all of this. We monitored what was easy, not what mattered. If we’d monitored connection pool saturation, we would have found the problem before users noticed.

Six ways monitoring dies

  1. Exists but untrusted: engineers bypass it (our incident: yes)
  2. Alert noise collapse: alert storms, everything muted (our incident: no)
  3. System fragility: monitoring crashes with production (our incident: nearly)
  4. Can’t answer decision questions: raw metrics, no risk framework (our incident: yes)
  5. Dies with production: shared infrastructure, no independence (our incident: close)
  6. Monitoring diverges from reality: green dashboards, users down (our incident: yes)
Death Mode Symptoms Root Cause Prevention
Untrusted Engineers SSH instead Trust collapse, no ownership Trust engineering, clear ownership
Alert noise Alert storms, alerts muted No filtering, no grouping Alert governance, confidence thresholds
Fragile system Prometheus OOM, slow queries Coupled with app, shared failure domain Independent infra, cardinality governance
No decision support Metrics exist, can’t decide Only raw metrics, no risk framework Risk-driven design, SLO framework
Dies with production Monitoring crashes too Level 0 survivability Independent failure domains
Diverges from reality Green but users down Wrong SLO, happy path bias Real user metrics, correct SLO

Where Will It Die?

Every failure in our incident traces back to one root cause: monitoring was not designed for organizational decision capability under uncertainty.

Trust collapse: engineers stop relying on it. Survivability failure: dies with production. Risk blindness: monitoring wrong things. Decision incapability: can’t answer questions. Control loop broken: no action, no feedback. Cognitive stack incomplete: missing interpretation and decision layers.

The sole purpose of monitoring systems is to maintain organizational action capability under uncertainty. Not data visibility. Not debugging. Not metric collection.

Build decision capability. Not dashboards.