TLDR;
DoorDash has grown to an enormous food delivery app with microservices that need to be tested. One way Doordash tests those services is by fault injection, or intentionally breaking their own services in local and staging environments.
They built their own tool called Filibuster and use that to find all the errors they didn’t even know they needed to handle.
How Big Is DoorDash?
DoorDash broke down their monolithic service for microservices. This means there’s a lot of working parts for these microservices and any one of them can break at any time.
DoorDash needs a way to ensure all of their services can gracefully handle errors if any of them go down.
This includes things like Payment Services, Customer Location Services, Available Restaurant Services and much more.
What Did They Come Up With?
The thing about software engineering everywhere is that preventing errors only takes you so far.
The inevitable precaution is to handle errors when they happen. So they decided to do this using a fault injection tool, or a tool that intentionally tries to break their own services.
Except Netflix did it first with Chaos Monkey and DoorDash isn’t trying to be Netflix.
What’s Netflix Doing That DoorDash Can’t Do?
Netflix intentionally breaks their services… in prod as well (or at least they used to). DoorDash isn’t willing to risk prod to test their services, so they decided to use their own tool for something else.
It’s Filibuster, except not by the Government.
With all the great minds at DoorDash, they brought in a PhD worker to come up with a fault injection tool called “Filibuster”.
Not In Prod?
Nope. It’s all in local and staging. Sure they don’t get the authentic results like they would testing in prod but testing in prod is a Netflix thing, not a general thing.
What Makes it Special?
Filibuster is a tool that can test all the permutations of failed dependencies. Meaning if the App depends on 5 services, ALL permutations of breaking them are testing and reported.
But I don’t want all the permutations
And you don’t need to have them. There’s ways to configure the tool such that some services are tested in local environments and others are tested in staging environments.
What Does this Mean for Developers?
It means there’s automated testing to cover what potential errors the developer can run into working with so many microservices. It only tells the developer what scenarios exists and needs to be handled. Fault injection testing won’t tell the developer how to actually handle these service errors.
That decision is up to the coder.
Sources and Official Article!
(Links to official article and sources are available to paid subscribers. They help maintain and support this newsletter!)
Keep reading with a 7-day free trial
Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.