🔍 Uncovering the One-in-a-Million Kubernetes Gotcha
Migrating a mission-critical search system to Kubernetes, and the unexpected performance issue that almost derailed the whole project
Intro
At Pinterest, search is the backbone of the user experience powering everything from home feed recommendations to direct search queries. To handle this scale, they built an in-house search system called Manas, which served dozens of search indices across thousands of hosts.
As Pinterest grew, this custom cluster management system became increasingly complex and error-prone. So they decided to migrate Manas to Kubernetes, their in-house PinCompute platform. After a year of development, they were ready to put the new system through its paces.
TLDR
During performance testing, the team discovered a small but persistent percentage of search requests were timing out, taking 100x longer than expected. This posed a serious risk, as even a few timeouts per minute could degrade recommendations across the entire Pinterest ecosystem.
The investigation that followed uncovered an elusive interaction between Manas's memory-intensive search system and a seemingly innocuous monitoring process, cAdvisor. The root cause? cAdvisor's "working set size" metric, which scanned and cleared the entire page table every 30 seconds, causing severe contention with Manas's memory-mapped indices.
Why this matters
Migrating a mission-critical service like Manas to a new infrastructure platform is a huge undertaking. Ensuring performance and reliability is paramount - even a small regression could impact millions of users.
This story highlights the challenges of running memory-intensive workloads on Kubernetes, and the importance of thorough testing and debugging. It also showcases the value of a systematic, multi-pronged approach to problem-solving combining clearbox profiling, blackbox isolation, and good old-fashioned process of elimination.
Narrowing the Problem Space
To tackle the challenge systematically, the team began by simplifying their testing environment. They re-provisioned the test cluster onto larger EC2 instances, removed cgroup limits, and even ran Manas directly on the Kubernetes nodes.
However, none of these changes had any effect. The latency spikes persisted, indicating the issue was unique to the Kubernetes provisioning on top of the AMI they were using.
The team then took a two-pronged approach:
Keep reading with a 7-day free trial
Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.