Byte-Sized Design

Byte-Sized Design

The 2.1 Billion Problem: How a Single Integer Broke Heroku's API

Inside the 4-Hour Heroku Outage: The Critical Lesson on Integer Overflow, Schema Drift, and the Hidden Danger of Database Statistics

Byte-Sized Design's avatar
Byte-Sized Design
Dec 23, 2025
∙ Paid

TL;DR

Heroku’s API went dark for 4 hours because a foreign key used int32 while its primary key was int64. When the counter hit 2.1 billion, everything broke. The engineers ran a migration to fix it, which worked but cleared Postgres’s query statistics and made everything worse. Running apps stayed up; everything else died.


What Went Down

Somewhere in Heroku’s database, a primary key was happily incrementing as a bigint. A foreign key pointing to it was using a regular int.

This went unnoticed for years until the primary key exceeded 2.1 billion and the foreign key couldn’t keep up. Integer overflow. Auth system down. Customers locked out.

On-call engineers wrote a migration to upsize the foreign key to match. The migration ran successfully and new authorizations started working again. Crisis averted.

Except it wasn’t. Altering that column wiped Postgres’s internal statistics—the data the query optimizer uses to plan efficient queries. Without those stats, queries that normally took milliseconds started taking seconds. The partial outage became a complete API failure.

They put the API in read-only mode, fixed the statistics, monitored everything, and gradually brought the system back up. Total time down: just under 4 hours.

Senior Engineer Takeaways

User's avatar

Continue reading this post for free, courtesy of Byte-Sized Design.

Or purchase a paid subscription.
© 2026 Byte-Sized Design · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture