When a Foreign Key Crashed Heroku

How a Simple Schema Bug Took Down Heroku, and How the Immediate Fix Made It Worse

Jun 01, 2025

∙ Paid

TL;DR

On June 8, 2023, Heroku suffered a multi-hour outage that broke every customer-facing surface (the Dashboard, CLI, and API) while customer apps continued running.

The cause? A small oversight in database schema design.

A foreign key field used a data type too small to reference the primary key it was linked to. This mismatch worked for a long time until the primary key exceeded the limit. Then everything broke. And fixing it triggered a second outage.

It’s a case study in how one small misalignment in your database schema can take down a platform.

Writing Post-Mortems: A Tech Lead's Guide to Learning from Failure

Byte-Sized Design

December 4, 2024

Writing Post-Mortems: A Tech Lead's Guide to Learning from Failure

If you're looking to build practical, resume-boosting projects, check out Codecrafters. This platform lets you recreate powerful tools like Git, Redis, and Kafka from scratch. Imagine saying, "I built my own Redis!" Not only will you boost your skills, but you'll also dive deep into how these tools work under the hood.

Read full story

🚨 Incident Summary

Duration: 15:06 – 18:57 UTC, June 8, 2023
Impact:

1h45m: No new deployments or logins
2h3m: Full API outage (CLI, Dashboard, slugs, webhooks — all unusable)
Running customer applications were not affected

🔍 What Happened?

At 15:05 UTC, a PostgreSQL foreign key constraint failed due to a data type mismatch: the foreign key column was defined with a smaller type (e.g., INTEGER) than the primary key it referenced (e.g., BIGINT). Once the primary key values crossed the threshold, creating new authorizations (and by extension, deploying) failed.

The on-call team responded quickly with a migration to update the foreign key to the correct type, but this schema change had a hidden cost: it wiped internal Postgres statistics, which the query planner uses to make execution decisions.

Without those stats, query performance collapsed. The API, already in recovery mode, fell over completely. This made a bad situation even worse. Engineers had to switch the system into read-only mode to stabilize it while they rebuilt the stats.

By 18:24 UTC, queries returned to normal. At 18:57 UTC, the API was back in full read/write mode and the incident was closed.

🧠 Root Causes

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.