A Distributed Simulation Service – Netflix TechBlog – Medium

Hundreds of models of smart TVs, game consoles, mobile devices, and other video streaming devices get shipped with a Netflix app pre-installed. Before shipping these devices, manufacturers need to have the app certified on their device firmware. Certification involves running a series of tests that validate the behavior of the Netflix app under different positive & negative scenarios, and this process is repeated each time a new model of a device is released to the market.

Netflix provides its device partners with a scalable and automatable cloud-based testing platform to accomplish this. An integral part of this platform is Simone, a service that allows simulation of different conditions required for testing. Simone is a service that enables configuration, deployment, and execution of simulationswithin arbitrary domains throughout the Netflix environment.

Why Simone?

Testing and certifying Netflix apps on devices which talk to services in a cloud-based, distributed environment like Netflix can be hard and error-prone. Without Simone, a tester would need to coordinate a request sent by the Netflix app to the individual service instance where it might land, a process which is tedious and difficult to automate, especially at scale. Additionally, devices at the certification stage are immutable, and we cannot change their request behavior. So we need to simulate various conditions in the Netflix services in order to test the device. For example, we need to simulate the condition where a user has exhausted the maximum number of simultaneous screens allowed based on a subscription plan.

Simone provides a generic mechanism to enable service owners a way to run “simulations” that are:

– domain specific behaviors within their system,

– alter business logic, and

– triggered at request time.

Simone also allows testers to certify devices against services deployed in a production environment. The implication of running in production is that there is a potential to adversely impact the customer experience. Simone is designed to minimize the blast radius of simulations and not introduce latency to normal production customer requests. The Architecture section will describe this further.

How does Simone work?

First, we will go over some of the main concepts of Simone. Later, we will see how each of these concepts come together to provide a simulation workflow.

Concepts

Template: The simulation that a service owner exposes is encapsulated in a schema, which is called a Template. A template defines the override behavior and provides information on what arguments it accepts, if any, and under what conditions the override is triggered. Templates are domain specific; they are created and maintained by the service owners. Below is a snippet from a template used to force an error when retrieving a DRM license:

Variant: A Variant, which is an immutable instance of a template, is at the core of a simulation. When testers want to create a simulation, they create a Variant of a template which defines the overridden behavior. The service then uses this Variant to provide a simulated response. Below is a sample Variant that tells the service to fail the license request for a playback. This is to simulate the “concurrent stream limit reached” scenario, where more than a specific number of concurrent playbacks are not allowed for a given Netflix service plan.

The service which handles the request changes the response based on the arguments specified in the Variant. Each Variant has a set expiration strategy which indicates when a Variant expires. An expiration strategy is needed to control the number of requests a Variant can affect and to clean up unused Variants. Currently, only the execution count is supported, which means “evict this Variant after it has been executed the specified number of times”.

Trigger: Notice the trigger and trigger arguments specified in the Variant definition above. A Trigger specifies under what conditions this Variant should be applied. In this case, when a DRM license request originates from a Netflix device which has the ESN “NFXXX-XXX-00000001”, the Variant will be applied. An ESN is a device’s electronic serial number, which is a unique identifier for the device that has Netflix app installed on it.

Triggers are defined in such a way that a Variant has a very narrow scope, such as a device ESN or a customer account number. This prevents an incorrectly defined variant from inadvertently affecting normal production requests. Additionally, the trigger implementation adds minimal computation overhead during evaluation for each request and we are continuously looking for ways to reduce it.

Architecture

Below is an architecture diagram of Simone. It is useful to understand the workflow of a Simone simulation.

Figure 1: Architecture diagram

At a high level, there are three main components which are responsible for Simone; shown as highlighted blocks in the architecture diagram above.

Simone server

Simone client

Simone Web UI

Simone server is a Java service thatprovides Create, Read & Delete operations for Variants and Templates. Testers create Variants on the server either via REST APIs — or through Simone Web UI. Simone server stores the Variant and Template data in Cassandra, which is replicated across multiple AWS regions so that testers don’t need to create Variants in each region. The server uses Apache Kafka to make Variants available to all instances of the domain service. The Kafka topic data is also replicated across the same AWS regions, using Apache MirrorMaker.

Simone client is the interface through which domain services interact with Simone server to perform the operations mentioned above. Simone client subscribes to a Kafka topic for Variant create & delete events and maintains them in an in-memory cache.

Simone Web UI provides the ability to create, view, and delete variants on Simone server. It also provides insights into the lifecycle of a variant and the underlying simulations.

Simulation Workflow

Figure 2: Workflow diagram

As shown in the workflow diagram above, when a Variant is created on Simone server, it publishes a CREATE event with Variant data to a dedicated Kafka topic. Simone client instances running within the context of domain services subscribe to this topic. When a Simone client gets the CREATE event about a Variant, it captures and stores the Variant data it in a local in-memory cache of created Variants. This way, when a production request hits any of these servers, Simone client does not need to make an external request to check if that particular request has any overrides configured. This helps avoid the introduction of additional significant latency in the request path.

If the request matches the trigger parameters of a Variant, then Simone client takes over the execution of the template action for that action. This in turn means running the simulation defined in that template. For example, “if a request comes in for this customer account number, send a different, overridden response instead of the regular response”. While executing the simulation, Simone client sends two important messages to Simone server — a synchronous CONSUME request and an APPLY event, which are published to Elasticsearch for querying later.

CONSUME request indicates to the server that the client is ready to apply a variant. The server ensures that the variant is still valid before returning a successful response to the client. If the variant expiration is count based, Simone server decrements the count by one. This allows Simone server to honor the variant expiration set during its creation. When the variant count reaches zero, Simone server evicts the variant from its datastore and sends a DELETE request to Kafka so that Simone client instances know to remove the variant from their local cache.

APPLYevent is sent by Simone client upon successful completion of a simulated request. This is the end of the simulation workflow. Service owners can emit any domain specific logs or information along with this event and testers can consume it through Simone server.

In order to increase the reliability of their tests, it is recommended that testers explicitly delete Variants created during the test instead of relying on the expiration strategy. When a Variant is deleted, Simone server publishes a DELETE event to the Kafka topic. Simone client instances, upon receiving this event, remove the variant from their caches.

This lifecycle can also be visualized in the Insights view of Simone Web UI as shown in Figure 4 below.

Simone Web UI & Insights

Simone Web UI provides users the ability to view existing Templates and associated metadata about those templates. Users can create, delete, and search for Variants through the Web UI.

Figure 3: Simone Web UI showing a list of templates

The Web UI also provides insights into the Variant lifecycle and the underlying simulation. In addition to the CONSUME and APPLY events mentioned previously, Simone server also publishes three other events to Elasticsearch — CREATE (when a variant is created), DELETE (when a variant is deleted) and RECEIVED (when a variant is received by a given Simone client instance). The RECEIVE event contains the AWS EC2 instance id of the domain service, which is helpful in troubleshooting issues related to simulations.

Figure 4: Simone Web UI showing insights on lifecycle of a variant

How does Simone help?

Now that you have seen the details, let’s walk through our initial example of simulating the concurrent streams error using Simone, and how that helps testing and certification within Netflix.

A very simple but useful application of Simone is to force a service to return various types of application errors. For example, Netflix has different streaming plans that allow different maximum numbers of concurrent streams. So a user with 2 Streams plan will only be allowed to watch on 2 devices simultaneously. Without Simone, a user would have to manually play Netflix on more than 2 devices to simulate an error when trying to start playback on a 3rd device.

Simone allows a user to create a Variant to force all playback attempts for a device to fail the license request with a “CONCURRENT_STREAM_QUOTA_EXCEEDED”. Below is what that Variant would look like.

Figure 5: Create variant via Simone Web UI

Once this Variant is created, any playback attempt from the ESN “NFXXX-XXX-00000001” will fail with the error, “CONCURRENT_STREAM_QUOTA_EXCEEDED”. This will result in the user seeing such an error as the one below .

Figure 6: Simulated error

Concluding Thoughts

To sum up, our goal is to provide our members with the best possible Netflix streaming experience on their devices of choice. Simone is one tool that helps us accomplish that goal by enabling our developers and partners to execute end to end simulations in a complex, distributed environment. Simone has unlocked new use cases in the world of testing and certification and highlighted new requirements as we look to increase the testability of our services. We are looking forward to incorporating simulations into more services within Netflix. If you have an interest in this space, we’d love to hear from you!