Play! 2.3 Template Improvements

August 5, 2014matthew0 Comments

My thanks go out to Grant Klopper from The Guardian. Last week, he validated a change I made to the Play! Framework back in December. After upgrading from Play! 2.2 to 2.3, Grant noticed dramatic changes in response times and memory – both for the better. The changes were so awesome that Grant was able to shut down 2/3 of the servers running that app. A conversation on Twitter followed with James Roper, a lead developer at Typesafe, wherein the root cause for the improvements was discovered – my template fixes. A subsequent blog post from Typesafe was released about the story, thanking the contributors and soliciting further enhancements to their code base.

Response time of The Guardian website during upgrade to Play 2.3.
Copied from Grant’s Tweet about response times mentioned above.

Without bragging about my awesomeness, I wanted to explain how I found and fixed fix the issue and then share some benchmarks of my own. Perhaps in this way, I can also solicit enhancements to the Play! framework, which has been a tremendous help in building state-of-the-art applications on the web.

Meet the Issues

Originally, I found the problem out of necessity. As Lucidchart and Lucidpress move to a more services oriented architecture, a user service was released. One API call allows one to retrieve all the users on an account. Some of our accounts are rather large, reaching into the tens and hundreds of thousands of users. Every time this endpoint was called for one of our large accounts, the associated servers would strain on CPU and memory – response times and customers both suffered for it.

Another issue with templating could be caused with just a single item, though that item had to be very large. We found this issue in our cache and pdf generation services. We had been aware of this one for quite some time, but were willing to ignore it because those particular services had some tolerance on response times.

These two issues were the focus of my investigation to make Play! templates faster. Long story short, my bug was not a focus for the Play! team, they were working on some other important things, so they recommended I work on it. So I did.

The Results

Let me jump to the end, the results. If you haven’t already, I recommend taking a look at the graphs posted by Grant (at the top of this post). I do have response time stats from the benchmarking on my local machine, but I didn’t keep the statistics from releasing to production. His are a better representation of a production system than what I will post here.

The benchmark project is on Github. To benchmark, use the same code for different versions of the Play! framework. The following graphs were created using the same machine with these properties:

Lenovo ThinkPad W520

SSD drive

12GB RAM

No other programs running

Using `play stage` and then `./target/start` instead of the `play run` command

In both graphs, the blue columns are Play! 2.1.1 with my changes. The red columns are the Play! 2.1.1 official release.

This first graph is time to compute an XML response with a single large item. Improvement speed is about 20% across the board.

This second graph is time to compute an XML response with many small items. Improvement speed ranges from 50% to 90%.

Unfortunately, I do not have the graphs of memory, CPU, and response times when we deployed these fixes to Lucid’s servers. There was a noticeable difference in resource utilization, though.

The Fixes

My pull request details the nitty gritty. To boil it down, there were a couple issues:

Every template creates a StringBuilder to render content. This doesn’t sound too bad until you start nesting templates. Every layout template, helper function, and some loops would create new templates. To render the top template (the one for the HTTP response), tens and hundreds of StringBuilders would be created, and they would all copy the same data.

The solution here was to create a single StringBuilder at the top level, and then pass it down to all children elements for rendering. This eliminated both the memory requirements and the terrible performance of copying megabytes of data.

Many empty Seq[Any] objects were being created for no good reason. The template compiler, when it found a block of non-Scala content in a template, would create a Seq with just a single element – the string of content. The Seq would be passed to _display_, which would unwrap the Seq and call _display_ on each element.

By modifying the template compiler to insert _display_(“<content>”) instead of an sequence of the same content, we saved the instantiation of the Seq, as well as the unneeded function call to _display_.

There were a few other minor changes that helped a little:

Overloading the _display_ function to speed up the processing of rendered content

Specify the ’empty’ element inside each format (HTML, XML, etc)

Build the tree structure of content from Seq using a subtree instead of appending to another Seq

It’s no doubt that some of my changes made the code uglier. Who wants 6 _display_ functions, when one of them handles the other 5? Why have an extra function to handle the case of 0-1 items in a list, when an empty list would do the same thing?