sre
Reliability among uncertainty.
Systems fail in ways we do not expect. Yet we still predict. Practices evolve faster than documentation. Yet we still write. We think about what's next. Yet we respond to right now.
And while so much other research most certainly arrives with word-stuffed pages, as if more words mean more learning, we chose the uncertain opposite. That is, the strength of this report comes from its quiet simplicity, its restraint, and its lack of distraction. Each insight was written not to impress, but to simply present.
After eight years of tracing reliability’s arc, the view feels complete enough to pause and look back before seeing how far the boundaries have widened. Reliability is no longer only about sustaining uptime (was it ever?). It has moved from reliability to resilience, from uptime to experience, from toil to intelligence, from tools to strategy, and from systems to people.
There are still no certainties, but there is progress. And that remains enough reason to keep building.
there will never be another outage again // featuring Alexis Gay: https://www.instagram.com/yayalexisgay // more krazam scenes: https://www.patreon.com/...
(this is also posted on O’Reilly’s Radar blog. Much thanks to Daniel Schauenberg, Morgan Evans, and Steven Shorrock for feedback on this) Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using…
Read how Google is using System Theoretic Process Analysis (STPA) to analyze pure software systems and discover risks.