Engineering posts about Alerting
Curated summaries and key learnings for engineers working with Alerting.
Using observability data to prevent incidents
The article emphasizes the importance of using observability data to transition from reactive incident response to proactive reliability intelligence. It outlines how engineering teams can leverage...
Monitoring reliably at scale
The article outlines the challenges of maintaining reliable observability in systems that are heavily dependent on shared infrastructure, such as Kubernetes and service meshes. It highlights the...
Trust But Canary: Configuration Safety at Scale
In the Meta Tech Podcast episode featuring Pascal Hartig, the discussion revolves around the strategies employed by Meta's Configurations team to ensure safe configuration rollouts at scale. The...
How we catch and mitigate performance regressions at scale in Jira Cloud
The article discusses the complexities of detecting and mitigating performance regressions in Jira Cloud, a multi-tenant product. It highlights the challenges posed by diverse tenant configurations...
It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb
The article outlines Airbnb's transformation of its Observability as Code (OaC) alert review process, which significantly reduced development cycles from weeks to minutes. By implementing a system...