Netflix is Chill with Breaking Their Services

Do it for the tests

May 15, 2023

∙ Paid

TLDR;

Netflix has a LOT of services. Too many to count and too many to manage. It’s way too much to query from these services and put all these things like recommendations, favorite lists, and top viewed selections on the client to piece together.

Netflix’s solution? Add a middle tier server called MAP in between the frontend and backend that puts the API responses together and makes it cleaner for the client to read.

Now everything depends on MAP and Netflix does a ton of latency and intentional failure testing to make sure this middle layer is working smoothly.

WTF is MAP?

Because there’s way too many services at Netflix to recommend you an awesome show, they came up with MAP, Merchandising Application Platform, to simplify calling these services.

API responses go from server → MAP → client. Sometimes the request can just be from a cache if it’s already there.

Can’t they just read directly from the server?

Sure, they can but it’ll make the client really really complicated putting all the API response pieces together.

Sometimes there’s duplicated logic that can just be in a single place and that’s what the middle tier is for.

As an example, sometimes credits are shown after 95% of the show. Other times it’s after 90% of the show. This branched logic doesn’t need to sit in the client and can just be queried by a single endpoint on the MAP.

Requirements for Testing

With the MAP layer, this needs to be tested by it’s dependencies and it’s dependents.

Here’s a summary of the requirements:

Don’t let the user freak out if something breaks. Make the UI still presentable.
Don’t let your dependents down. Don’t let your dependencies down. Keep the MAP failures to itself
Don’t overload the dependencies. If there’s a lot of requests by MAP, make sure the load is controllable for the server.

How’d they broke the MAP

The overall strategy is to just break the connection with MAP to everything else and see what happens. How to break it is a different story. Trying to break prod without breaking prod is a challenge but here’s how they did it:

Netflix used two things to test breaking their system:

(All Content is freely available 1 week after post!)

Latency Monkey

This tool slows down requests and responses throughout the entire system. It can make the client think the server is down when it’s actually not.

Failure Injection Testing (FIT)

FIT is a testing strategy to tag requests as failed requests so all the other services treat it differently. A request goes from

client → failure tagging service → real service → react to the “failure “request

After using both strategies, noticed a ton of bugs and fixed one as bad as hiding new releases and suggestions:

Sources and Official Article!

(Links to official article and sources are available to paid subscribers. They help maintain and support this newsletter!)

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.