incident management
Show a public status page for configured nagios hosts and services - pgmac-net/nagios-public-status-page
(this is also posted on O’Reilly’s Radar blog. Much thanks to Daniel Schauenberg, Morgan Evans, and Steven Shorrock for feedback on this) Before I begin this post, let me say that this is intended to be a critique of the Five Whys method, not a criticism of the people who are in favor of using…
Cloudflare suffered a service outage on November 18, 2025. The outage was triggered by a bug in generation logic for a Bot Management feature file causing many Cloudflare services to be affected.
Well, looks like we have a dumpster fire on DynamoDB in us-east-1 again.
View the overall status and health of AWS services using the AWS Health Dashboard.
AWS introduces a new service to streamline security event response, providing automated triage, coordinated communication, and expert guidance to recover from cybersecurity threats.
Read how Google is using System Theoretic Process Analysis (STPA) to analyze pure software systems and discover risks.