How GitLab Lost 300GB of Production Data and What We Can Learn

The Hard Lesson on Backups, Disaster Recovery, and Human Error

Feb 03, 2025

∙ Paid

On January 31, 2017, GitLab.com went down hard. For hours, engineers scrambled to recover, only to realize their backup system had been broken for weeks. By the time the dust settled, they had lost 5,000 projects, 700 user accounts, and countless issues and comments.

The root cause? A single mistaken command that wiped their primary database.

In a post-mortem, GitLab detailed how they lost critical production data, the mistakes that led to the disaster, and the systemic failures that made recovery nearly impossible.

Let’s break it down:

🚀 TL;DR

GitLab.com suffered a major outage after an engineer accidentally deleted the primary database instead of a secondary replica. The disaster was compounded by broken backups, slow recovery tools, and a failure to catch issues before they escalated.

Here’s what we’ll cover:

1️⃣ 🛠️ What Happened? (The moment that wiped GitLab’s data)
2️⃣ 🛑 Root Causes (The silent failures that made recovery impossible)
3️⃣ 🤔 Lessons Learned (How GitLab changed its approach)
4️⃣ 🏗️ How to Protect Your Own Systems (Paid)

💡 More Insights & Recovery Strategies (Paid)

🛠️ What Happened?

It started as a routine maintenance task.

To test new database scaling techniques, a GitLab engineer needed to resynchronize a secondary PostgreSQL database. The process required wiping its data directory before restoring a fresh copy from the primary database.

🚨 The mistake: Instead of running the command on the secondary server, the engineer executed it on the primary database.

💾 300GB of production data vanished in seconds.

Panic set in. The team turned to their backups—only to find that their backup system had been failing silently for weeks.

With no working backups and a painfully slow recovery process, GitLab had no choice but to rebuild from a 6-hour-old snapshot, permanently losing all data created after that point.

🛑 Root Causes

1️⃣ No Protection Against Accidental Deletion

There was no safeguard to prevent engineers from running destructive commands on production systems.
The primary and secondary databases looked identical in the terminal making it easy to confuse them.

2️⃣ Backups Had Been Failing Silently for Weeks

GitLab’s automated pg_dump backups never actually ran due to a misconfiguration.
Backup failure alerts were sent via email—but those emails were silently rejected due to DMARC settings.

3️⃣ The Recovery Process Was Too Slow

Azure disk snapshots were available but took over 18 hours to restore due to slow disk speeds.
There was no clear, battle-tested disaster recovery process to follow.

4️⃣ Replication Lag Left No Secondary Backup

The secondary database was out of sync due to a spike in database load earlier that day.
When replication stopped, old transaction logs were purged, meaning the secondary couldn't be used as a failover.

In short: GitLab lost production data because of a mix of human error, broken backups, and an untested recovery process.

🤔 Lessons Learned

GitLab’s failure highlights five critical lessons for any engineering team:

1️⃣ Your Backups Are Useless Until You Test Them

If you don’t regularly restore from backups, they might not work when you need them.
GitLab assumed their backups worked. They didn't.

2️⃣ Critical Systems Need Guardrails

Running destructive commands on production databases should require extra confirmation steps.
Engineers should see clear visual indicators showing whether they’re working on production or staging.

3️⃣ Recovery Speed Matters More Than Backup Frequency

Even if backups exist, if it takes 18+ hours to restore, your business will suffer.
Solution: Invest in point-in-time recovery (PITR) instead of relying solely on daily snapshots.

4️⃣ Test Disaster Recovery Like You Test Your Code

Have a documented, rehearsed plan for recovering from catastrophic failure.
Run fire drills where engineers simulate total database loss and restore from backups.
Build automated chaos tests when and where possible throughout your pipelines. Read more here.

5️⃣ Redundancy Must Be Truly Redundant

A secondary database that isn’t truly in sync won’t save you in a crisis.
GitLab fixed this by adding multiple secondaries across different regions.

🏗️ How to Protect Your Own Systems

Many companies have similar risks lurking in their infrastructure. Here’s what you can do to avoid disaster:

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.