Incident Summary: 2023-09-07
Read Full ArticleSummary
The article outlines a significant outage experienced by Square on September 7, 2023, detailing the timeline of events, the root cause related to DNS server failures, and the subsequent recovery efforts. It highlights the interplay between firewall policy changes and DNS service upgrades that led to service disruptions across multiple data centers. The article emphasizes the importance of robust infrastructure and the need for improved monitoring and isolation strategies to prevent future incidents.
Key Learnings
- 1Understanding the critical role of DNS in service availability and the impact of firewall policies on DNS performance.
- 2The necessity of having a resilient infrastructure that can handle unexpected loads and failures.
- 3The importance of effective incident response strategies, including timely communication and recovery processes.
- 4Identifying opportunities for infrastructure improvements based on incident analysis to enhance future service reliability.
Who Should Read This
Senior Site Reliability Engineers analyzing incident response strategies and infrastructure resilience.
Test Your Knowledge
What specific changes to the firewall policies contributed to the DNS server failures during the incident?
How could the incident response team have improved their initial diagnosis of the DNS issues?
What architectural changes are being proposed to isolate DNS infrastructure from other services?
In what ways can monitoring be enhanced to detect similar issues before they lead to outages?
What lessons can be learned about the interdependencies between different services in a microservices architecture?
Topics
More articles about Incident Management
Explore Incident Management engineering →Cloudflare outage on February 20, 2026
On February 20, 2026, Cloudflare experienced a significant outage affecting customers using its Bring Your Own IP (BYOIP) service due to a misconfiguration in the Border Gateway Protocol (BGP)...
2025 Q4 DDoS threat report: A record-setting 31.4 Tbps attack caps a year of massive DDoS assaults
The 2025 Q4 DDoS threat report by Cloudflare reveals a significant escalation in DDoS attacks, with a record-setting attack of 31.4 Tbps marking a year of unprecedented assaults. The report...
Route leak incident on January 22, 2026
On January 22, 2026, a misconfiguration in Cloudflare's routing policy led to a significant BGP route leak, affecting both Cloudflare customers and external networks. The incident, which lasted 25...
When protections outlive their purpose: A lesson on managing defense systems at scale
The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...
Securing the Grid: A Practical Guide to Cyber Analytics for Energy & Utilities
The article outlines the critical cybersecurity challenges faced by the Energy & Utilities sector, particularly due to the convergence of IT and operational technology (OT) systems. It emphasizes the...
More from Square Engineering
View Square engineering blogs →A Massively Multi-user Datastore, Synced with Mobile Clients
The article discusses the architectural design of a massively multi-user datastore developed at Square, which is tailored to manage extensive merchant catalogs synced with mobile clients. It...
Command Line Observability with Semantic Exit Codes
The article presents a novel approach to enhancing command line tool observability at Square by introducing semantic exit codes inspired by HTTP status codes. By categorizing exit codes into user...
Celebrating the release of Android Studio Electric Eel
The release of Android Studio Electric Eel introduces a significant performance enhancement through a new parallel project import feature, which reduces average sync times for large codebases by 60%....
Developer Spotlight: Reference Health
The article highlights the journey of Reference Health, a platform that integrates Square's payment solutions into healthcare systems, enabling providers to accept secure payments directly through...
Stampeding Elephants
The article 'Stampeding Elephants' presents a case study from Square's Mobile Developer Experience (MDX) Android team, detailing their journey to modernize the build logic of their Point of Sale...