GitHub’s Elasticsearch Problem Was Seven Years in the Making. Here’s How They Finally Fixed It
Why the right fix wasn't available until now, and what they did in the meantime.
TL;DR
GitHub Enterprise Server runs search on Elasticsearch. It also runs High Availability with a primary/replica model. For years, those two things could not coexist cleanly. Elasticsearch would move a primary shard to the read-only replica node. If you then took down that replica for maintenance, the whole thing deadlocked. The replica waited for Elasticsearch to recover before it could start. Elasticsearch couldn’t recover until the replica rejoined.
GitHub engineers knew this was broken. They spent years trying to patch around it. It took until Elasticsearch shipped Cross Cluster Replication to actually fix it.
The fix is live in GHES 3.19.1. The lesson underneath it is older than GitHub.
The Original Sin Was a Reasonable Decision
Let’s be precise about what went wrong here, because it’s easy to read this story as “Elasticsearch bad” when the real issue is more interesting.


