What Datadog’s Outage Taught Us About Hidden Dependencies

A subtle DNS change triggered cascading failures, but the recovery paved the way for stronger, more fault-tolerant systems

Mar 11, 2025

∙ Paid

Product for Engineers is PostHog’s newsletter dedicated to helping engineers improve their product skills. It features curated advice on building great products, lessons (and mistakes) from building PostHog, and research into the practices of top startups.

Subscribe For Free

🚀 TL;DR

On September 24, 2020, Datadog’s US region suffered a multi-hour outage due to a failure in its service discovery system, a core component that lets services find their dependencies. A routine change to a latency-measuring cluster triggered a thundering herd of DNS requests, overloading the system and breaking service discovery across Datadog’s infrastructure.

📌 The Impact: What Went Down

🔻 Web tier & API – 9+ hours of degraded access
🔻 Logs & monitoring – Outages up to 12 hours
🔻 Alerts & APM – Extended failures, up to 15 hours
🔻 Infrastructure monitoring – Fully recovered after 15+ hours

Despite the disruption, incoming data was still processed but users couldn't access it.

🔍 What Happened?

This outage was not due to a security breach or infrastructure failure. Instead, it was caused by a subtle configuration mistake made a month earlier:

🚨 August: A change moved service discovery queries from a static file (resilient but slow) to a dynamic DNS resolver (fast but failure-prone).
🚨 September 24: A routine restart of a small, low-priority cluster caused a surge of DNS queries that overwhelmed the service discovery system.
🚨 Service discovery failed → Services couldn’t find dependencies → Outage spread rapidly.

🛑 Root Cause: A Tiny Change With Big Consequences

1️⃣ Overloaded Service Discovery

Datadog relies on a distributed service discovery cluster for routing dependencies.
The DNS resolver wasn’t caching NXDOMAIN responses (i.e., missing services).
A routine restart flooded the system with unnecessary DNS lookups, amplifying failure.

2️⃣ Cascading Failures

As more services failed, they kept retrying, creating a feedback loop that brought down the web tier, alerts, and monitoring tools.
Many components couldn't start because they relied on dynamic configuration, which was unavailable.

3️⃣ Lack of Fallback Mechanisms

Had fallback service discovery mechanisms (like static files) remained in place, some services could have continued operating.

🤔 Lessons Learned

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.

Byte-Sized Design