Engineering posts about Metrics
Curated summaries and key learnings for engineers working with Metrics.
Using observability data to prevent incidents
The article emphasizes the importance of using observability data to transition from reactive incident response to proactive reliability intelligence. It outlines how engineering teams can leverage...
Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog on Databricks
The article discusses the challenges of traditional observability tools in managing the massive volumes of trace data generated by AI agents. It presents a solution through Databricks' integration...
Monitoring reliably at scale
The article outlines the challenges of maintaining reliable observability in systems that are heavily dependent on shared infrastructure, such as Kubernetes and service meshes. It highlights the...
Building a fault-tolerant metrics storage system at Airbnb
The article details Airbnb's development of a high-throughput metrics storage system capable of ingesting 50 million samples per second and managing 2.5 petabytes of data. It outlines the challenges...
Building a high-volume metrics pipeline with OpenTelemetry and vmagent
This article outlines a comprehensive approach to migrating a high-volume metrics pipeline from StatsD to OpenTelemetry and Prometheus. It discusses the challenges faced during the migration, such as...
From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus
The article outlines Slack's transition to HTTP/3 and the challenges faced due to the lack of client-side observability with existing monitoring tools. It highlights the development of QUIC support...
From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership
The article outlines Airbnb's transition from a vendor-managed observability platform to a custom in-house solution built on open-source technology, specifically Prometheus. It details the challenges...
It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb
The article outlines Airbnb's transformation of its Observability as Code (OaC) alert review process, which significantly reduced development cycles from weeks to minutes. By implementing a system...