Load Testing with Locust.io

Overview

With more and more developers utilizing cloud services for hosting, it is critical to understand the performance metrics and limits of your application. First and foremost, this knowledge helps you keep your infrastructure up and running. Just as important, however, is the fact that this will ensure you are not allocating too many resources and wasting money in the process.

This post will describe how Kongregate utilizes Locust.io for internal load testing of our infrastructure on AWS, and give you an idea of how you can do similar instrumentation within your organization in order to gain a deeper knowledge of how your system will hold up under various loads.

What Is Locust.io?

Locust is a code-driven, distributed load testing suite built in Python. Locust makes it very simple to create customizable clients, and gives you plenty of options to allow them to emulate real users and traffic.

The fact that Locust is distributed means it is easy to test your system with hundreds of thousands of concurrent users, and the intuitive web-based UI makes it trivial to manage starting and stopping of tests.

How We Defined Our Tests

We wanted to create a test that would allow us to determine how many more web and mobile users we could support on our current production stack before we needed to either add more web servers or upgrade to a larger database instance.

In order to create a useful test suite for this purpose, we took a look at both our highest throughput and slowest requests in NewRelic. We split those out into web and mobile requests, and started adding Locust tasks for each one until we were satisfied that we had an acceptable test suite.

We configured the weights for each task so that the relative frequencies were correct, and tweaked the rate at which tasks were performed until the number of requests per second for a given number of concurrent users was similar to what we see in production. We also set weights for web vs. mobile traffic so that we could predict what might happen if a mobile game goes viral and starts generating a lot of load on those specific endpoints.

Our application also has vastly different performance metrics for authenticated users (they are more expensive), so we added a configurable random chance for users to create an authenticated session in our on_start function. The Locust HTTP client persists cookies across requests, so maintaining a session is quite simple.

How We Ran Our Tests

We have all of our infrastructure represented as code, mostly via CloudFormation. With this methodology we were able to bring up a mirror of our production stack with a recent database snapshot to test against. Once we had this stack running we created several thousand test users with known usernames and passwords so that we could initiate authenticated sessions as needed.

Initially, we just ran Locust locally with several hundred clients to ensure that we had the correct behavior. After we were convinced that our tests were working properly, we created a pool of EC2 Spot Instances for running Locust on Amazon Linux. We knew we would need a fairly large pool of machines to run the test suite, and we wanted use more powerful instance types for their robust networking capabilities. There is a cost associated with load testing a lot of users, and using spot instances helped us mitigate that.

In order to ensure we had everything we needed on the nodes, we simply used the following user data script for instance creation on our stock Amazon Linux AMI:

To start the Locust instances on these nodes we used Capistrano along with cap-ec2 to orchestrate starting the master node and getting the slaves to attach to it. Capistrano also allowed us to easily upload our test scripts on every run so we could rapidly iterate.

Note: If you use EC2 for your test instances, you’ll need to ensure your security group is set up properly to allow traffic between the master and slave nodes. By default, Locust needs to communicate on ports 5557 and 5558.

Test Iteration

While attempting to hit our target number of concurrent users, we ran into a few snags. Here are some of the problems we ran into, along with some potential solutions:

Outcome

After all was said and done, we ended up running a test with roughly 450,000 concurrent users. This allowed us to discover some Linux Kernel settings that were improperly tuned and causing 502 Bad Gateway errors, and also let us discover breaking points for both our web servers and our database. The test also helped us confirm that we made correct choices in regard to the number of web server processes per instance, and instance types.

We now have a better idea how our system will respond to viral game launches and other events, we can perform regression tests to ensure that large features don’t slow things down unexpectedly, and we can use the information gathered to further optimize our architecture and reduce overall costs.

Follow us on Twitter to keep up-to-date with our weekly blog posts.

More articles you might like:

The Long Ago Kongregate has been around for a relatively long time. We were around when Flash was cool, and also when it became uncool. We were also around when SSL was considered an optional luxury that added significant overhead and latency to network requests. Thus, we built our site with the assumption that HTTP was a fine protocol to use most everywhere, and SSL should just be used "as needed" -- for authentication, purchasing, and other sensitive data. That assumption held up for over a decade, but in recent years there has been a large movement toward using SSL everywhere, with search engines and browser vendors alike starting to punish sites for serving up un-encrypted content. Visiting Kongregate using HTTP on modern browsers was starting to become a scary and confusing experience for our players, so we decided it was time to bite the bullet and switch to using HTTPS wherever possible. The Basic Challenge Converting to SSL can be a daunting task for even simple sites without much user-generated content. Here is a short list of

Let me tell you a tale about an initiative to reduce Flash dependencies on what used to be a "Flash gaming" site. This is the story about a large-scale project with the best possible outcome being that users don't notice any changes, and the worst possible outcome of everything breaking horribly. This, my friends, is the story of Kongregate's Death to Flashy initiative. In the Beginning When we first started building Kongregate, we didn't have to worry much about whether or not users had Flash installed. It was very much ubiquitous, and since you couldn't play most of the games on the site without it we could build features based on the assumption that Flash was available. We built our chat client in Flash, we created APIs for both ActionScript 2 and 3, games implemented them, and all was good. For a very long time, the vast majority of the games on Kongregate were made with Flash, even most of our externally hosted "iframe" games. Some of these external games eventually needed a way to interact

Now that the Unity WebPlayer is mostly a thing of the past, making sure your Unity WebGL content runs smoothly is more important than ever. The main pain point that we hear from developers relates to the dreaded “out of memory” errors that end users encounter when trying to play WebGL games. This is a particularly frustrating issue for players using 32-bit versions of browsers, since they are much less likely to have a large contiguous block of free memory for the Unity heap. This article will cover some more tips and tricks for diagnosing and resolving memory-related issues with your Unity WebGL game. Monitoring Memory Usage When profiling and optimizing your Unity WebGL game, it is important to keep track of multiple kinds of memory usage. The first is the Unity heap, which defaults to 256MB and can be changed in the Publishing Settings interface under “WebGL Memory Size.” We touched on some optimization techniques for this chunk of memory (and why you want to keep it as small as possible) in a previous blog post. To reiterate, the