The 2.1 Billion Problem: How a Single Integer Broke Heroku's API
Inside the 4-Hour Heroku Outage: The Critical Lesson on Integer Overflow, Schema Drift, and the Hidden Danger of Database Statistics
TL;DR
Heroku’s API went dark for 4 hours because a foreign key used int32 while its primary key was int64. When the counter hit 2.1 billion, everything broke. The engineers ran a migration to fix it, which worked but cleared Postgres’s query statistics and made everything worse. Running apps stayed up; everything else died.
What Went Down
Somewhere in Heroku’s database, a primary key was happily incrementing as a bigint. A foreign key pointing to it was using a regular int.
This went unnoticed for years until the primary key exceeded 2.1 billion and the foreign key couldn’t keep up. Integer overflow. Auth system down. Customers locked out.
On-call engineers wrote a migration to upsize the foreign key to match. The migration ran successfully and new authorizations started working again. Crisis averted.
Except it wasn’t. Altering that column wiped Postgres’s internal statistics—the data the query optimizer uses to plan efficient queries. Without those stats, queries that normally took milliseconds started taking seconds. The partial outage became a complete API failure.
They put the API in read-only mode, fixed the statistics, monitored everything, and gradually brought the system back up. Total time down: just under 4 hours.


