How Slack Rebuilt a Critical System Without Stopping the World

How a Redis Bottleneck Froze Slack and What Engineers Can Learn from the Fix

Apr 02, 2025

∙ Paid

Product for Engineers is PostHog’s newsletter dedicated to helping engineers improve their product skills. It features curated advice on building great products, lessons (and mistakes) from building PostHog, and research into the practices of top startups.

🚀 TL;DR

Slack processes over 1.4 billion jobs per day using a job queue system that powers everything from message posting to billing. But one day, Redis hit its memory cap, and the queue froze. This brought message delivery and Slack itself to a halt.

What followed was a re-architecture of a critical system, done without disrupting the business or rewriting everything from scratch. (Learn more about modernizing legacy systems without disturing production in our previous edition)

This edition breaks down what went wrong, how the queue system worked, and how you can design fault-tolerant job processing at scale.

📖 What Will We Dive Into Today?

Why Slack Relied So Heavily on a Redis Queue
How the old system worked
How It Broke: Redis and the Bottleneck Effect
What Slack Learnt
How Slack fixed their problems
How You Can Stress-Test Your Own Job Queue Design (Paid)
Slack’s Zero downtime migration strategy (Paid)
Designing for failure and not just scale (Paid)

📬 Why Slack Uses a Job Queue

If Slack were a kitchen, the web server is the waiter, and the job queue is the chef. When you post a message, the waiter writes down the order (job), hands it to the kitchen, and goes back to serving others.

The kitchen handles the heavy lifting:

Posting messages
Sending push notifications
Unfurling URLs
Triggering calendar events
Running billing calculations

These tasks are too slow and unpredictable to handle inline. That’s why Slack pushes them to a queue. On peak days, this system handles 33,000 jobs per second. That’s over 2.8 billion jobs every 24 hours.

But speed is only part of the story. The real challenge is ensuring the system doesn’t collapse under pressure.

🧱 How the Old System Worked

Slack’s original job queue architecture was built around Redis. It was reliable and fast, and it scaled until it didn’t. Here's how jobs flowed through it:

A web request would create a job ID based on the job’s type and arguments. Slack then hashed that ID and picked a Redis host. Redis stored the job in a pending queue after checking for duplicates.

Meanwhile, a fleet of worker machines continuously polled Redis. When they found a job, they moved it to an "in-flight" list and ran it asynchronously. If the job completed successfully, it was removed. If it failed, it went into a retry queue. Eventually, if it failed too many times, it was sent to a manually inspected "permanent failure" list.

It was clean, logical, and it worked for years. Until a single outage showed its limits.

⚠️ What Broke: Redis as the Bottleneck

The failure came from a subtle direction: resource contention in Slack’s database layer. Jobs slowed down. Workers couldn’t keep up.

As job execution times grew, the number of jobs waiting in Redis spiked. Redis filled up completely. And then came the kicker:

Even dequeuing a job required Redis to have free memory. So Slack couldn’t enqueue or dequeue jobs. The entire system jammed.

This was a wake-up call. Redis hadn’t failed. Slack had outgrown it.

🧠 What Slack Learned (And You Should Too)

The team took a hard look at the architecture. Scaling by adding more Redis nodes or worker machines wasn’t going to cut it anymore. Several key issues emerged:

Redis had no buffer. If enqueues outpaced dequeues, it choked.
The system relied on a bipartite mesh of connections where every worker talked to every Redis host.
There was no global backpressure. Each service pushed jobs into the queue as fast as it wanted.

This architecture assumed Redis would always be fast and memory would always be available. That assumption no longer held at scale.

🔧 The Fix: Resilience Without Rewrites

Slack’s response wasn’t to burn it all down and start over. Instead, they rebuilt the system underneath the old one.

They introduced graceful degradation paths, so that not every failure would cascade. If a job type slowed down, others could continue to flow. They segregated workers by job type, so high-latency tasks wouldn't block lightweight jobs.

They also introduced backpressure awareness. If a queue was backed up, producers could slow down rather than flooding Redis.

Retries were rethought. Instead of slamming Redis with immediate retries, Slack staggered them with exponential backoff and smarter retry logic.

Finally, Slack built a memory-aware queuing system that monitored Redis pressure and adjusted behavior before failure struck.

🔍 How to Stress-Test Your Job Queue System

Want to know if your queue can survive what Slack’s couldn’t? Here’s how to find out.

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.

Byte-Sized Design