Lyft Tests In Production 🚗💨
The best way to know if it works in production is to put it in production.
TLDR;
Lyft load tests in production. It was too expensive to have production scale in a staging environment so Lyft thought of testing in production.
It works! They made this work simulating rides using a config on their production environment.
So what’s wrong with testing in Staging?
Lyft needs to load test, especially when it gets huge traffic during real world events. Events like Super Bowl, New Years Eve parties, and graduation events.
Most companies simulate traffic in a staging environment and checks if their staging environment handles that correctly. But Lyft doesn’t do that here.
Here’s the problems with testing in staging for Lyft
1. It’s Expensive
It’s actually expected that staging environments can’t handle the same load as their production environments. Lyft wanted more realistic results and that means scaling their staging environment to the same capacity as production.
That’s expensive and Lyft thought it just made more sense to test directly in production.
2. Replaying Traffic is Risky.
Nobody want’s to get double charged, ever. Replaying production traffic in staging could replay the customer charges (and driver payouts). So Lyft chose to not replay real traffic.
3. Accurate Results.
Staging environments have test data, and other hacking data configurations. Errors in load tests in staging could be false negatives or positives. The most accurate results come from prod.
Ok, they’re in Prod. How’d they do it?
Lyft tested their services by simulating traffic.
The flow looks like this.
Have a test bot make a ton of calls to the Simulated Rides API
Make that SimulatedRides API handle requests from the SimulationTable.
Output those responses on the SimulatedRides Web UI.
Let the resource management service clean up resources like drivers, riders, scooters when the test is done.
It Runs The Same Test Everytime?
There’s a config for the SimulatedRides service that defines random events a user can take. This creates a decision tree. As an example, there can be a 50% chance the simulation closes the app after opening and checking the prices.
It also defines other things like how many riders and drivers are available, and the odds they’ll select a specific product.
{
"name": "chicago",
"client_configurations": {
"region": "chicago",
"rider_close_app_after_price_check_percent": 1,
"rider_cancel_after_accepting_ride_percent": 10,
"driver_cancel_after_accepting_ride_percent": 5,
},
"client_composition": [
{
"client_type": "rider",
"number": 50,
"behaviors": {
"shared_ride": 25,
"standard_ride": 65
"luxury_ride": 5,
"luxury_ride_suv": 5,
}
},
{
"client_type": "driver",
"number": 50,
"behaviors": {
"standard_ride": 100
}
}
],
}
Just change the configs and you’ll have a new load test.
But What Are the Drawbacks?
The thing about testing in production is that’s it’s pretty dangerous. Lyft knows this so here’s how they make it safer.
There’s always a human watching and managing the load tests. It’s not automated so well that alerts can let the production load tests run rent-free.
It’s an internal tool. This means the public doesn’t have experience using a tool like this and everyone who joins Lyft will have to onboard on to the tool with 0 background knowledge.
Quiz questions, answers, and Official Article!
(Links to official article and sources are available to paid subscribers. They help maintain and support this newsletter!)
5 questions the official article answers:
What are clients in the context of SimulatedRides, and what role do they play?
What are behaviors in SimulatedRides, and how do they influence the actions of clients?
How do actions contribute to the functionality of SimulatedRides, and what can engineers configure with them?
Keep reading with a 7-day free trial
Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.