DoorDash's Game-Changing Strategy: 70% Hit Ratio in Cache Optimization!

Decentralization and Empowering Efficiency

Byte-Sized Design

and

Akagra jain

Mar 25, 2024

∙ Paid

TLDR;

Problem: Managing large and costly Redis clusters for the feature store, which are expensive to maintain and require rapid online prediction requests, necessitates exploring an in-process caching layer to enhance performance and scalability, especially for replica requests.

Solution: In order to maximize request efficiency and reduce dependency on Redis clusters, the solution proposes incorporating an in-process caching layer within microservices, hence improving system speed and scalability.

💰 Help Wanted!

This newsletter has grown to ~~23,000~~ → 23,501+ AMAZING READERS. It’s grown to a scale that a single person can’t maintain all of it on their own.

If you’re interested in being a byte-sized design writer, apply here!

Flow

Based on the features, the DoorDash Prediction Service (SPS) forecasts the outcomes. SPS requests a feature from the feature store if it is not offered by the upstream service.

The feature store, mainly housed in Redis, incurs substantial compute costs due to high request volumes for ML features.
DoorDash explores caching and alternative storage solutions to address scalability and cost efficiency concerns.
Implementation of caching is expected to boost prediction performance, reliability, and scalability.
Despite potential immediate gains, caching promises to reduce compute costs and enhance platform efficiency over time.

Deep Dive

Problem Identification: DoorDash's feature store, predominantly in Redis, poses scalability and cost challenges due to memory-intensive servers for low-latency predictions.
Alternative Exploration: To address these challenges, the team explores cost-effective storage alternatives to Redis.
Caching Introduction: Hypothesizing repetitive requests, caching is introduced to improve prediction platform performance and cost efficiency.
Expected Benefits: Caching is anticipated to enhance latency, reliability, and scalability, optimizing predictions' quality and cost-effectiveness.

Experiment

In-Memory, Pod-Local Cache: DoorDash implemented an in-memory, pod-local cache for their SPS microservice to enhance retrieval efficiency.
LRU Cache Eviction: They adopted an LRU (Least Recently Used) eviction scheme to manage cache size and ensure efficient memory usage.
Thread Safety: To handle the high concurrency of requests, DoorDash implemented a thread-safe cache using read-write locks.
Observability with Prometheus and Grafana: They enhanced their cache system's observability by logging metrics like latency, hit rate, cache size, and memory usage.

Production-level

Preventing Slow Starts: To avoid slow cache starts when microservices restart, DoorDash will warm up caches by simulating real traffic to new processes.
Speeding Up Data Processing: DoorDash speeds up data processing by caching decoded information directly in memory, minimizing repetitive unpacking from Redis, especially for large items.
Boosting Cache Efficiency: DoorDash will enhance cache effectiveness by segregating caches based on different application components, ensuring faster data retrieval by keeping related data together.
Keeping Data Fresh: DoorDash updates cache in real-time using Kafka queues for predictions with the latest data.

Impact

Improved Performance: Caching and feature store optimizations likely improved ML feature retrieval, enhancing system performance and reducing latency for prediction requests.
Enhanced Cost Efficiency: By achieving a high cache hit rate, the system likely reduced the number of requests to the feature store, thereby lowering compute costs associated with data retrieval and processing (Redis is highly expensive).
Increased Scalability: Optimized feature store and caching ensure effective handling of increasing prediction requests without performance compromise or excessive costs.

Official Article!

Read the official article here!
(Get access to official articles, resources, and join the private system design community!)

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.

A guest post by

Akagra jain

Computer science student & ex-Amazon engineer. Published ML researcher exploring cutting-edge algorithms for societal impact. Passionate about bridging theory with practical applications.