When the Netflix API launched three years ago, it was to “let 1,000 flowers bloom”. Today, that API still exists with almost 23,000 flowers.

At that time, it was exclusively a public API.

Some of the apps developed by the 1,000 flowers.

Then streaming started taking off for Netflix, first with computer-based streaming… At that time, it was still experimental and did not draw from the API.

But over time, as we added more devices, they started drawing their metadata from the API. Today, almost all of our devices are powered by the API.

As a result, today’s consumption is almost entirely from private APIs that service the devices. The Netflix devices account for 99.7% of the API traffic while the public API represents only about .3%.

The Netflix API represents the iceberg model for APIs. That is, public APIs represent a relatively small percentage of the value for the company, but they are typically the most visible part of the API program. They equate to the small part of the iceberg that is above water, in open sight. Conversely, the private APIs that drive web sites, mobile phones, device implementations, etc. account for the vast majority of the value for many companies, although people outside of the company often are not aware of them. These APIs equate to the large, hard to see mass of ice underwater. In the API space, most companies get attracted to the tip of the iceberg because that is what they are aware of. As a result, many companies seek to pursue a public API program. Over time, however, after more inspection into the value propositions of APIs, it becomes clear to many that the greatest value is in the private APIs.

As a result, the current emphasis for the Netflix API is on the majority case… supporting the Netflix

There are basically two types of interactions between Netflix customers and our streaming application… Discovery and Streaming.

Discovery is basically any event with a title other than streaming it. That includes browsing titles, looking for something watch, etc.

It also includes actions such as rating the title, adding it to your instant queue, etc.

Once the customer has identified a title to watch through the Discovery experience, the user can then play that title. Once the Play button is selected, the customer is sent to a different internal service that focuses on handling the streaming. That streaming service also interacts with our CDNs to actually deliver the streaming bits to the device for playback.

The API powers the Discovery experience. The rest of these slides will only focus on Discovery, not Streaming.

As Discovery events grow, so does the growth of the Netflix API. Discovery continues to grow for a variety of reasons, including more devices, more customers, richer UI experiences, etc.

As API traffic grows, so do the infrastructural needs. The more requests, the more servers we need, the more time spent supporting those servers, the higher the costs associated with this support, etc.

And our international expansion will only add complexity and more scaling issues.

The traditional model is to have systems administrators go into server rooms like this one to build out new servers, etc.

Rather than relying on data centers, we have moved everything to the cloud! Enables rapid scaling with relative ease. Adding new servers, in new locations, take minutes. And this is critical when the service needs to grow from 1B requests a month to 1B requests a day in a year.

Instead of going into server rooms, we go into a web page like this one. Within minutes, we can spin up new servers to support growing demands.

Throughautoscaling in the cloud, we can also dynamically grow our server farm in concert with the traffic that we receive.

So, instead of buying new servers based on projected spikes in traffic and having systems administrators add them to the farm, the cloud can dynamically and automatically add and remove servers based on need.

And as we continue to expand internationally, we can easily scale up in new regions, closer to the customer base that we are trying to serve, as long as Amazon has a location near there.

As a general practice, Netflix focuses on getting code into production as quickly as possible to expose features to new audiences.

That said, we do spend a lot of time testing. We have just adopted some new techniques to help us learn more about what the new code will look like in production.

Prior to these new changes, our flow looked something like this…

That flow has changed with the addition of new techniques, such as canary deployments and what we call red/black deployments.

The canary deployments are comparable to canaries in coal mines. We have many servers in production running the current codebase. We will then introduce a single (or perhaps a few) new servers into production running new code. Monitoring the canary servers will show what the new code will look like in production.

If the canary shows errors, we pull it/them down, re-evaluate the new code, debug it, etc. We will then repeat the process until the analysis of canary servers look good.

If the new code looks good in the canary, we can then use a technique that we call Red/Black Deployments to launch the code. Start with Red, where production code is running. Fire up a new set of servers (Black) equal to the count in Red with the new code.

Then switch the pointer to have external requests draw from the Black servers.

If a problem is encountered from the Black servers, it is easy to rollback quickly by switching the pointer back to Red. We will then re-evaluate the new code, debug it, etc.

Once we have debugged the code, we will put another canary up to evaluate the new changes in production.

If the new code looks good in the canary, we can then bring up another set of servers with the new code.

Then we will switch production traffic to the new code.

Then switch the pointer to have external requests draw from the Black servers. If everything still looks good, we disable the Red servers and the new code becomes the new red servers.

So, the development and testing flow now looks more like this…

At Netflix, we have a range of engineering teams who focus on specific problem sets. Some teams focus on creating rich presentation layers on various devices. Others focus on metadata and algorithms. For the streaming application to work, the metadata from the services needs to make it to the devices. That is where the API comes in. The API essentially acts as a broken, moving the metadata from inside the Netflix system to the devices.

Given the position of the API within the overall system, the API depends on a large number of underlying systems (only some of which are represented here). Moreover, a large number of devices depend on the API (only some of which are represented here). Sometimes, one of these underlying systems experiences an outage.

In the past, such an outage could result in an outage in the API.

And if that outage cascades to the API, it is likely to have some kind of substantive impact on the devices. The challenge for the API team is to be resilient against dependency outages, to ultimately insulate Netflix customers from low level system problems.

To achieve this, we implemented a series of circuit breakers for each library that we depend on. Each circuit breaker controls the interaction between the API and that dependency. This image is a view of the dependency monitor that allows us to view the health and activity of each dependency. This dashboard is designed to give a real-time view of what is happening with these dependencies (over the last two minutes). We have other dashboards that provide insight into longer-term trends, day-over-day views, etc.

This is a view of asingle circuit.

This circle represents the call volume and health of the dependency over the last 10 seconds. This circle is meant to be a visual indicator for health. The circle is green for healthy, yellow for borderline, and red for unhealthy. Moreover, the size of the circle represents the call volumes, where bigger circles mean more traffic.

The blue line represents the traffic trends over the last two minutes for this dependency.

The green number shows the number of successful calls to this dependency over the last two minutes.

The yellow number shows the number of latent calls into the dependency. These calls ultimately return successful responses, but slower than expected.

The blue number shows the number of calls that were handled by the short-circuited fallback mechanisms. That is, if the circuit gets tripped, the blue number will start to go up.

The orange number shows the number of calls that have timed out, resulting in fallback responses.

The purple number shows the number of calls that fail due to queuing issues, resulting in fallback responses.

The red number shows the number of exceptions, resulting in fallback responses.

The error rate is calculated from the total number of error and fallback responses divided by the total number calls handled.

If the error rate exceeds a certain number, the circuit to the fallback scenario is automatically opened. When it returns below that threshold, the circuit is closed again.

The dashboard also shows host and cluster information for the dependency.

As well as information about our SLAs.

So, going back to the engineering diagram…

If that same service fails today…

We simply disconnect from that service.

And replace it with an appropriate fallback.

Keeping our customers happy, even if the experience may be slightly degraded. It is important to note that different dependency libraries have different fallback scenarios. And some are more resilient than others. But the overall sentiment here is accurate at a high level.

As discussed earlier, the API was originally built for the 1,000 flowers. Accordingly, today’s API design is very much grounded in the same principles for that same audience.

But the audience of the API today is dramatically different.

With the emphasis of the API program being on the large mass underwater – the private API.

As a result, the current API is no longer the right tool for the job. We need a new API, designed for the present and the future. The following slides talk more about the redesign of the Netflix API to better meet the needs of the key audiences.

We already talked about the tremendous growth in API requests…

Metrics like 30B requests per month sound great, don’t they? The reality is that this number is concerning…

Or this… Ad impressions are not part of the game. As a result, the increase in requests don’t translate into more revenue. In fact, they translate into more expenses. That is, to handle more requests requires more servers, more systems-admins, a potentially different application architecture, etc.

We are challenging ourselves to redesign the API to see if those same 30B requests could have been 5 billion or perhaps even less. Through more targeted API designs based on what we have learned through our metrics, we will be able to reduce our API traffic as Netflix’ overall traffic grows.

Given the same growth charts in the API, it would be great to imagine the traffic patterns being the blue bars instead of the red ones (assuming the customer usage and user experiences remain the same).

Similarly, with lower traffic levels for the same user experience, server and administration complexity and costs go down as well.

To state the goal another way, John Musser maintains a list of the API Billionaires. Netflix has a pretty lofty position in that club.

We aspire to no longer be in that exclusive company. That is one of the things the redesign strives for.

Today, the devices call back to the API in a mostly synchronous way to retrieve granular metadata needed to start up the client UI. That model requires a large number of network transactions which are the most expensive part of the overall interaction.

We want to break the interaction model into two types of interactions. Custom calls and generic calls.

For highly complex or critical interfaces, we want the device to make a single call to the API to a custom endpoint. Behind that endpoint will be a script that the UI teams maintain. That script is the traffic cop for gathering and formatting the metadata needed for that UI. The script will call to backend services to get metadata, but in this model, much of this will be concurrent and progressively rendered to the devices.

One way to think of it is to imagine a full grid of movies/TV shows as part of a Netflix UI. The red box represents the viewable area of the grid when the screen loads. In today’s REST-ful resource model, the device needs to make calls for the individual lists, then populate the individual titles with distinct calls for each title. Granted, in today’s model, the are some asynchronous calls and some of them are also performed in bulk. But this still demonstrates the chatty nature of this REST-ful API design.

Alternatively, we believe that the custom script could easily return the desiredviewables for each list in one payload much more efficiently.

Moreover, this model can return any payload, filling out any portions of the grid structure, in that single response. Now, we are not likely going to want to populate a grid in this way, but it is possible given the highly customizable nature of this model.

But once we populate the complex start-up screens through the custom scripting tier, the interactions become much more predictable and device-agnostic. If you want to extend the movies in a given row, you don’t need a custom script. That is why we are exposing the Generic API as well.

To populate the grid with more titles for more rows, it is a simple call to get more titles.

The call pattern looks like this for the Generic API. Notice, there is no need for some of the session-start requests when using the Generic API.

For this model, the technology stack is pretty simple. The client apps have their own languages designed for that particular device. The overall API server codebase is Java. And the custom scripts will be written in Groovy and compiled into the same JVM as the backend API code. This should help with overall performance and library sharing for more complex scripting operations.

To publish new scripts to the system, UI engineers will publish the script to Perforce for code management. Then it will be pushed up to a Cassandra cluster in AWS, which acts as a script repository and management system. Every 30 seconds or so, a job will scan the Cassandra cluster looking for new scripts. For all new scripts, they will be pushed to the full farm of API servers to be compiled into the JVM.

From the device perspective, there could be many scripts for a given endpoint, only one of which is active at a given time. In this case, the iPad is running off of script #2.

New scripts can be dynamically added at any time by any team (most often by the UI engineers). The new script (script 7) will arrive in an inactive state. At that time, the script can be tested from a production server before running the device off of it.

When the script looks good and is ready to go live, a configuration change is made and script 7 becomes active immediately across the full server farm.

All of these changes in our redesign effort are designed to help the apps and the UI engineers run faster.

65.
{"catalog_title":{"id":"http://api.netflix.com/catalog/titles/movies/60034967","title":{"title_short":"Rosencrantz and Guildenstern Are Dead","regular":"Rosencrantz and Guildenstern Are Dead"},"maturity_level":60,"release_year":"1990","average_rating":3.7,"box_art":{"284pix_w":"http://cdn-7.nflximg.com/en_US/boxshots/ghd/60034967.jpg","110pix_w":"http://cdn-7.nflximg.com/en_US/boxshots/large/60034967.jpg","38pix_w":"http://cdn-7.nflximg.com/en_US/boxshots/tiny/60034967.jpg","64pix_w":"http://cdn-7.nflximg.com/en_US/boxshots/small/60034967.jpg","150pix_w":"http://cdn-7.nflximg.com/en_US/boxshots/150/60034967.jpg","88pix_w":"http://cdn-7.nflximg.com/en_US/boxshots/88/60034967.jpg","124pix_w":"http://cdn-7.nflximg.com/en_US/boxshots/124/60034967.jpg"},"language":"en","web_page":"http://www.netflix.com/Movie/Rosencrantz_and_Guildenstern_Are_Dead/60034967","tiny_url":"http://movi.es/ApUP9"},"meta":{"expand":["@directors","@bonus_materials","@cast","@awards","@short_synopsis","@synopsis","@box_art","@screen_formats","@"links":{"id":"http://api.netflix.com/catalog/titles/movies/60034967","languages_and_audio":"http://api.netflix.com/catalog/titles/movies/60034967/languages_and_audio","title":"http://api.netflix.com/catalog/titles/movies/60034967/title","screen_formats":"http://api.netflix.com/catalog/titles/movies/60034967/screen_formats","cast":"http://api.netflix.com/catalog/titles/movies/60034967/cast","awards":"http://api.netflix.com/catalog/titles/movies/60034967/awards","short_synopsis":"http://api.netflix.com/catalog/titles/movies/60034967/short_synopsis","box_art":"http://api.netflix.com/catalog/titles/movies/60034967/box_art","synopsis":"http://api.netflix.com/catalog/titles/movies/60034967/synopsis","directors":"http://api.netflix.com/catalog/titles/movies/60034967/directors","similars":"http://api.netflix.com/catalog/titles/movies/60034967/similars","format_availability":"http://api.netflix.com/catalog/titles/movies/60034967/format_availability"}}}

66.
Improve Efficiency of API RequestsCould it have been 5 billion requests per month? Or less? (Assuming everything else remained the same)