Main

Earlier this year I met Alex Podelko and contributed with a few comments for his blog. A few months later, came the invite to speak at CMG’s Performance and Capacity conference (CMG Performance and Capacity 2014), about our take on Performance Engineering and Testing here at Netflix. Having in mind that one of our main goals here is to “move fast”, and that sometimes performance engineers might struggle with a constantly changing environment like that, I decided to focus my talk on “How to Ensure Performance in a Fast-Paced Environment”. Here’s the full abstract:

Netflix accounts for more than a third of all traffic heading into American homes at peak hours. Making sure users are getting the best possible experience at all times is no simple feat and performance is at the core of this experience. In order to ensure performance and maintain development agility in a highly decentralized environment/(organization?), Netflix employs a multitude of strategies, such as production canary analysis, fully automated performance tests, simple zero-downtime deployments and rollbacks, auto-scaling clusters and a fault-tolerant stateless service architecture. We will present a set of use cases that demonstrate how and why different groups employ different strategies to achieve a common goal, great performance and stability, and detail how these strategies are incorporated into development, test and DevOps with minimal overhead.

Since today most of my effort is around developing new performance-focused tools and techniques in order to be more productive, evangelize performance engineering and scale our efforts, it made sense to focus the presentation on new things we are developing. It took me a while (and many revisions) to get the presentation the way I wanted. As usual, I changed half the content the night before the event.

The overall feedback was really good. Better than expected actually. I decided to go over a few things we do that are big no-nos in many large (and old) companies and sometimes this is not well received. Attendees were really interested in the tools and how we leverage all of them to achieve great performance, specially Canary Analysis, the performance test framework, automated analysis, the Monkeys and Scryer. Lots of great comments about the presentation itself, that was more “lively” than other presentations, and also the content itself. They liked the fact that we do things differently from other organizations, think outside the box and develop thing on our own.

I was also scheduled to participate in 3 panels. The first one was about new workloads, “Measuring New Workloads: Cloud Analytics, Mobile, Social”, and Elisabeth Stahl was hosting the session with Steve Weisfeldt from Neotys and me participating. The panel was really interesting and we had a lot of questions around AWS and how we run all* our streaming infrastructure there. There were also many questions on big data and how we leverage it to analyze user data and understand their behavior. Also, lots of questions around client devices and how we do real user monitoring (RUM) on them.

The second panel was “Modern Industry Trends and Performance Assurance”, hosted by Alex Podelko and with Mohit Verma (Tufts Health Plan), Steve Weisfeldt (Neotys), Ellen Friedman (MUFG Union Bank) and me as panelists. We had a great discussion around performance testing. When, Why and How to test systems. Automating performance tests and automated analysis. What could be automated or not? A/B testing. The value of testing in production and leveraging real user load. Again, lots of questions around our take on performance testing, the tools and techniques, specially the test framework. Some questions around the size of our tests and environment. We are pushing the boundaries on performance testing and engineering, and learning along the way. It was clear that we are trying things that other organizations would not even consider, and that put us a great place for innovation. One interesting question we got was around automated analysis, what should be automated and or not. My first response was obviously, Automate All The Things! But for multiple reasons, that’s not really effective. I came with a nice way of finding good candidates. If your test goal is to VALIDATE something, a pass or fail, that’s a great candidate for automation. If your test goal is to LEARN something about a system, that’s not a great candidate. What do you think?

The last panel was around APM, “APM Tools and Technologies: What Do You Need?”, also hosted by Alex and with David Halbig (First Data), Craig Hyde (Rigor), Charles Johnson (Metron) and me as panelists. It was focused mostly around how to analyze, choose and buy APM tools, what they should include or not, and so on. I have to admit that I didn’t have a lot to add to the tool buying discussion, but I tried to point out how we tried a few different tools and none worked really well for us, for one reason or another, so we just decided to fill the gaps and build our own set of tools that would achieve the same goal, transactional and deep stack performance monitoring. I don’t like the idea of spending a lot of effort trying to make a tool work for us when we can create something on our own and make it adapt to us. We already have great monitoring tools in place, like Atlas, and we are creating others to give us more insight into user transactions and demand. Creating our own tools gave us the flexibility we needed to collect only what we need, from the right sources, and easily act on it, manually or in an automated fashion. It also allows us to consume the data the way we see fit and that makes sense for us. Obviously, such endeavor doesn’t make sense for everyone. You need the scale to support it.

I’ve also attended a few interesting sessions. Alex’s talk on load testing tools shed some light on the various aspects that should be taken into account when choosing a tool. Open source vs. commercial? Availability of experienced professionals? Protocols? Environment? Features? Kudos for mentioning many great open source tools. Another interesting session was Peter Johnson’s (Unisys) workshop-like CMG-T on Java. It was geared towards beginners, but great content on Java tuning, specially Garbage Collection.

Besides all presentations and panels, I met so many amazing people there, and had great conversations. I can’t mention all here, but I wanted to at least give a shout-out to Kevin Mobley, from CMG’s board. I think we share the same view around performance engineering and had a great chat about his vision for the future of CMG as a group and the conference. I’m happy to collaborate more in the future!

Were you there? What were your thoughts on the presentation and panels? Any interesting questions you would like to bring up for discussion? Just send comment!

p.s.: You can find references to the tools and articles in the slide deck. There are also a few backup slides with a few things I could not fit into the presentations.

I have very heterogeneous performance test use cases. From simple performance regression tests that are executed from a Jenkins node to eventual large-ish stress tests that run with over 100K requests per second and > 100 load generators. With higher loads, many problems arise, like feeding data to load generators, retrieving results, real-time view, analyzing huge data sets and so on.

JMeter is a great tool, but it has its own limitations. In order to scale, I had to work around a few of it’s limitations and created a test framework to help me execute tests at scale, on Amazon’s EC2.

Having a central data feeder was a problem. Using JMeter’s master node is impossible. A single shared data source might become a bottleneck, so having a way of distributing it was important. I thought about using a feeder model similar to Twitter’s Iago or a clustered, load balanced resource, but settled for something simpler. Since most tests only use a limited data set and loop around it, I just decided to bzip files and upload them to each load generator before the test starts. This way I avoided the problem of making an extra request to get data during execution and requesting the same data multiple times because of the loop. One problem with this approach is that I don’t have centralized control over the data set, since each load generator is using the same input. I mitigate that by managing the data locally on each load generator, with a hash function or introducing random values. I also considered distributing different files to different load generators based on a hash function, but so far, there was no need.

Retrieving results was tricky too. Again, using JMeter’s master node was impossible because of the amount of traffic. I tried having a pooler fetching raw ( only timestamp, label, success and response time ) results in real-time, but that affected the results. Downloading all results at the end of the test worked by checking the status of the test ( running or not ) every minute and downloading after completion, but I settled with having a custom sampler in a tearDown thread group, compressing and uploading results to Amazon’s S3. This could definitely be a plugin too. It works reasonably well, but I loose the real-time view and have to manually add a file writer and sampler to tests.

With real-time view, I started with the same approach as jmeter-ec2, pooling aggregated data (avg response time, rps, etc) from each load generator and printing that, but it proved useless with a large number of load generators. For now, on Java samplers, I’m using Netflix’s servo to publish metrics in real time (averaged over a minute) to our monitoring system. I’m considering writing a listener plugin that could use the same approach to publish data from any sampler. Form the monitoring system I can then analyze and plot real-time data with minor delays. Another option I’m considering is using the same approach, but using StatsD and Graphite.

Analyzing huge result sets was the biggest challenge I believe. For that, I’ve developed a web-based analysis tool. It doesn’t store raw results, but mostly time-based aggregated statistical data from both JMeter and monitoring systems, allowing some data manipulation for analysis and automatic comparison of result sets. Aggregating and analyzing tests with over 1B samples is a problem, even after constant tuning. Loading all data points to memory to calculate percentiles and sorting is practically impossible, just for the fact that the amount of memory I’ll need is impractical, even with small objects. For now, on large tests, I settled on aggregating data while loading results ( second/minute data points ) and accepting the statistical problems, like average of averages. Another option would be to analyze results from each load generator independently and aggregate at the end. In the future, I’m considering having results on a Hadoop cluster and using Map/Reduce the get the aggregated statistical data back.

The framework also helps me automate most of the test process, like creating a new load generator cluster on EC2, copying test artifacts to load generators, executing and monitoring the test while it’s running, collecting results and logs, triggering analysis, tearing down the cluster and cleaning up after the test completes.

Most of this was written in Java or Groovy and I hope to open-source the analysis tool in the future.

Default behavior on cURL is GET, but you can do POST, DELETE, PUT and more complex requests. If you’re not familiar with cURL, best place to start is the manpage.

Besides “time_total”, curl also provides other timing, like “time_namelookup”, “time_connect”, etc. Checking a post by Joseph, I remembered that curl supports formatted output. This way we can create a “template” for our HTTP timing test:

The Problem

You have a test. You run a few dry-runs, everything is looking great. Then you decide to scale up things and execute a real-life test. That’s when things start to go the wrong way. JMeter crashes and you have no idea why. I’ll give you two real-life examples that happened to me not too long ago.

The first one was a really high throughput, low latency scenario. Messages were not too large, but size was significant enough for the clients to require a gzip response. Created a quick test plan to simulate that and started the test. The first thing I’ve noticed was the throughput was way lower than what I was expecting. Time to troubleshoot. Checked the service, everything looked good. Resource utilization was low, dependency response times were low and in-container time was low too. It’s a bit odd, since it’s just a simple HTTP sampler and the load generator was a relatively large instance on AWS, but let’s check the client.

Bingo! CPU usage was at peak. That’s strange for a 26 ECU instance, specially when the resource utilization from target instance was significantly lower. Scratched my head a few times, ran a few tests and took a few thread dumps and came to a conclusion. All the CPU time was being spent decompressing something, the HTTP response.

That makes sense. I actually added the following header to the request, in order to get gzip responses:

Accept-Encoding: gzip

So JMeter was spending a huge amount of time decompressing that response, a response that I don’t really care about.

I’ll get to the solution later, but first, the second example.

The second example happened to me this week. A slightly more complex scenario with probably a dozen thread groups. Again, everything ran just fine with a lower load dry-run, but increasing the load for the actual test, caused JMeter to crash. This one was easier to figure out, a nice error message was printed to jmeter.log:

2013/11/16 00:15:41 ERROR - jmeter.threads.JMeterThread: Test failed! java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.readResponse(HTTPSamplerBase.java:1658)
at org.apache.jmeter.protocol.http.sampler.HTTPAbstractImpl.readResponse(HTTPAbstractImpl.java:235)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:300)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:62)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1088)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1077)
at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:428)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:256)
at java.lang.Thread.run(Thread.java:722)

That’s strange, a 16Gb heap could not be filled so easily. Anyway, let’s bump it to 28Gb. After a few minutes, bam, same thing, OutOfMemory!

Started checking the usual suspects. I had a couple Groovy and BeanShell scripts to execute a few tasks before and after the test, but they shouldn’t be part of the actual test loop. Either way, doubled checked everything, converted Groovy scripts to BeanShell (had my fair share of Groovy-related problems with JMeter) and tested it again. Same deal. This time though, I decided to take heap and thread dumps when things started to get bad.

First, the heap dump. Nothing conclusive, but strangely, 99.8% of all memory was byte[], tracing back to JMeter’s classes.

I was not expecting JMeter to have a leak like that, so I dismissed the fact, ruled it as inconclusive and went ahead to check the thread dump. Thread dump was even more interesting. Pretty much all threads were stuck at:

parking to wait for java.util.concurrent.locks.AbstractQueuedSynchronizer

And belonged to the same thread group. I had one synchronized timer that was being used to generate bursts every 10 minutes. Interesting. Decided to remove that thread group and execute a test.

Surprisingly, the whole test ran smoothly, by simply removing the synchronized timer. So let’s check what’s inside that thread group. Not much I’m afraid. A Test Action Sampler, that was used to sleep the execution for 10 minutes. A loop controller, that looped a single HTTP Request for a couple times. The HTTP Request had a single header added, the famous:

Accept-Encoding: gzip

That’s interesting. Let’s check the response size, 12.3Mb. That’s a lot, but the transfer rate is quite fast between AWS nodes. Not a problem. 12Mb times 60 threads is too much either, but wait, that’s the compressed size. Let’s check the actual size. It took me quite a bit of time to download the entire message, but here it is:

Wow! 625Mb! Well, 625Mb times 60 threads being fired exactly at the same time, that’s roughly 37Gb just for that, excluding all other thread groups. Unacceptable!

The Solution

You have a service that usually delivers a gzip response to client over HTTP. You want to simulate that behavior using JMeter, so logically you decide to add a header to the request, the magical:

Accept-Encoding: gzip

That’s when everything goes the wrong way and you start having problems like the ones I mentioned above. So how to solve this? I looked for an option to disable message decompression in JMeter, but no luck, besides changing the sampler code itself. Something I would like to avoid. So I decided to go the easy route and create a simple Java sampler that would do the same thing as JMeter’s HTTP Request. I like httpclient4, so let’s use it to get the test going. At the moment, I’m using JMeter 2.9, that already contains Apache’s httpclient 4.2.3. Here is the sampler code I used:

It’s a bit rough, but as you can see, I just created a simple httpclient, a GET request and added the necessary header information. Checked a few things, like response code and size and returned a sampleResult with that information. Really simple, but the trick is exactly at:

EntityUtils.consume(entity);

I’m consuming the message, meaning, I’m downloading the message, but not doing anything with it after that. So no decompressing, no huge CPU utilization and no OutOfMemoryError. One thing to keep in mind is that since the response is not being decompressed, no response is being returned by the sampler. That means no response parsing. But that’s not a problem for me, since the only thing I care is the response code.

Just tested it using the same sampler that was having OutOfMemory issues before and problem solved!

I’m also planning to put this and a few other things I’ve created for JMeter into a plugin, so next time I face the same problem, I can just reuse the same sampler!

Not too long ago, during a server migration, I decided to move from Apache to Nginx, mostly because I think it’s a fun weekend project and maybe a little bit because I was looking to save some memory on my VPS. After a few hiccups, Nginx was running smoothly, serving the blog and a few other things.

Hearing Steve Sounders (@Souders) and Ilya Grigorik (@igrigorik) from Google talk about PageSpeed at Velocity Conference made me think about it again. I had the PageSpeed module configured on Apache, but didn’t bother to set it up on Nginx after the migration.

Basically it does a lot of cool stuff, like minifying and combining JavaScript and CSS files, cache optimizations, inline resources, etc, everything directly on the web server layer, so you don’t have to worry about it at the application layer.

A complete list of filters that can be used with PageSpeed can be found at:

The only detail I would like to bring up is regarding other Nginx binaries on the same server. Since the install process involves downloading and compiling Nginx, not downloading a binary from a repository, I will probably end up with multiple copies on your server. To avoid confusion, I prefer to remove repository versions before starting the setup.

After compiling and installing a new Nginx with PageSpeed, you will also need to update your nginx.conf and vhosts. Again, everything is described in detail in the README file.