Capacity Planning: Auto-Scaling for Large Loads

In AppScale, we call each App Engine application instance an AppServer. It’s the unit we use to scale the application. Applications need at least one AppServer running to serve requests. Adding more AppServers increases application performance and redundancy.

AppScale adds or removes AppServers as load dictates: this auto-scaling is completely automatic and transparent to the application (or App Engine developer), aside from the App Engine mechanisms to control and drive scaling.

AppScale uses HAProxy as the load balancer for the AppServers of each running application. HAProxy statistics are used to monitor the load that the applications sustain. The AppScale auto-scaler tracks the load on each application and responds (via autoscaling decisions) to each application independently. Starting from AppScale 3.3, we changed the way we trigger the scaling (up and down) of AppServers. The new changes allow the application to respond more quickly to bursts of requests.

Capacity Planning

The original auto-scaling policy monitored the queued requests of each application. It added AppServers only when there were queued requests, and it removed AppServers when there were no queued requests.This mechanism worked fairly well for a variety of loads, and it was simple to understand and tune. However, it proved to be a bit slow to react for large bursts in load. In these situations, it would take a few minutes to scale, for example, from 10 to 1,000 AppServers.

In 3.3 we changed to what we called it Capacity Planning auto-scaling. With it, AppScale monitors the load capacity of an application. This capacity is calculated as the number of AppServers the application currently has multiplied by the number of requests each AppServer can handle at a time. By default, we configure AppServers to handle seven requests at a time (for threadsafe applications).

With this new method, AppScale uses low and high watermarks based on the current application load, and adds/removes AppServers to maintain the load level between the watermarks. The defaults are 90% for the high watermark and 70% for the low watermark. When AppScale needs to scale up or down because of current load falling outside the configured watermarks, it will try to instantiate sufficient AppServers to reach 80% capacity.

A Simple Example

A threadsafe application currently running 10 AppServers can serve a maximum of approximately 70 concurrent requests without delays. If the load passes the high watermark (63 requests in this case), AppScale will start new AppServers for the application. If the load decreases to less than 49 requests, AppScale will terminate unused AppServers. In this example, if the application is currently serving 60 requests (in the middle of the default watermarks), and suddenly 60 more requests arrive, AppScale will start as many AppServers as needed to reach the desired 80% load capacity, which in this case is 22 AppServers total. Hence AppScale will start 12 extra AppServers.

Similarly, if the application load then drops to say, 25 requests, AppScale will scale back to 5 running AppServers, thus terminating 17 unused AppServers. During down-scaling, there is also a redundancy variable to take in account: AppScale will leave the minimum number of AppServers that the user configures to avoid service interruptions in case of node failures. If unconfigured, this minimum is 2 AppServers.

Test Drive

Let’s take the new autoscaling system for a test drive. We will use an application that is part of our regression testing suite, and we’ll load it from 0 to over 1,500 requests/s. The application starts with a single AppServer. This test load was done using our QA system, which is a private Eucalyptus cloud at AppScale headquarters. We cap the test deployment at 25 medium-sized virtual machine instances.

In the following graphs, we show request/s, which is the rate of clients request at any time, current sessions, which are the requests being served at the time, current queue, which are the requests waiting to be served, estimated session capacity, which is the autoscaler estimated capacity as well as the low and high watermarks, which show as a band of color around the estimated session capacity.

The graph shows that the autoscaler working to keep the current sessions within the low and high watermarks (visibles around the estimated capacity) to ensure all request are served in a timely fashion. Also note the queued requests: as the load grows quickly, the autoscaler reacts, but starting from a single AppServer leaves very little capacity to adapt, so some requests gets queued while new AppServers are started. Aside from the very beginning, you can see no queued requests. Finally as the load subsides, the estimated capacity (and number of AppServers) tracks the reduction in session count, terminating idle resources until the test completes.

This graph shows the number of AppServers running and pending (that is AppServers requested by the autoscaler). There is the steady climb in the number of AppServers to cover the growing load. Once the load subsides, the number of AppServers decreases.

As mentioned above, the previous auto-scale policy worked pretty well and so for most of our customers, there will be no noticeable change of behavior. Applications which experience an large bursts in load however (multiple times of the base load) will experience the benefits of this change -- with a now-more-responsive AppScale auto-scaler.