When dashboards are all green but users are already down, you’re facing a control system failure, not an observability problem.
The Incident
02:17 AM. A major customer reports massive request timeouts in the support channel.
02:19. Oncall gets paged.
02:21. Grafana opens. All green. CPU normal. Memory normal. Error rate near zero. The system looks completely healthy.
02:25. More customers report timeouts. Business leaders join the call: How many customers affected? What’s the impact scope? Do we need to rollback? How long to recover?
No one can answer. Monitoring systems exist, but they provide nothing usable for decisions.