Sunday, July 17, 2016

Your Local Coffee Shop Performs Resource Scaling

Ever since I moved to Canada, about 8 years ago, I became an avid Starbucks customer, primarily, because it was one of the few places, where I could find a decent iced coffee. As a Greek, I was bound by destiny and tradition to keep drinking iced coffee (asking for a cold beverage in -35℃ in Alberta, it was funny to look at baristas rendered speechless). When I moved to York University for my postdoc and found the closest Starbucks to initiate my everyday routine, Starbucks happened to launch the “Mobile Order & Pay” feature. At the same time, with the new group at CERAS lab, I got better acquainted with Cloud Computing and with such concepts as Self-Adaptive Systems and Cloud Elasticity. Given these two facts, one morning I was waiting in a rather long line at Starbucks, when I noticed one of the employees coming out with a cart full of empty cups and taking orders from the people in the line. I also noticed that baristas are highly efficient and multi-tasking when crafting the beverages, but customers are rather slow in comparison when ordering or paying. “Here is resource scaling and process adaptation in practice!”, I thought. The employees noticed the delay in ordering and they decided to speed up the process, parallelized it with paying and took advantage of the much faster crafting process.Stepping back a few years, back when I started my graduate studies at the University of Alberta, my then supervisor, Dr. Eleni Stroulia, in order to introduce us to web services and processes recommended us to read Gregor Hohpe’s paper “Your Coffee Shop Doesn’t Use Two-Phase Commit” in IEEE Software Design [1]. In that paper, the author makes a simile of how Starbucks executes orders (the choice of coffee shop is completely coincidental, I swear!) with software processes, fault tolerance and rollback. In this post, I will make a similar attempt again using Starbucks as my example to explain how resource scaling in cloud works. The post is split in four parts where I lay out the details on resources and processes (part 1) as employed by the Starbucks system and the equivalent web software system on cloud, monitoring and analysis of performance metrics (part 2), planning and execution of scaling and adaptive actions (part 3) and the economics of scaling (part 4) in both systems.

1. Processes, Resources and Topologies

Starbucks is primarily a service, which means that in the center of its processes are people and human tasks. The people that participate in the service are the customers, who issue the orders and pay, and the employees, who can be distinguished in tellers and baristas (i.e., the ones who prepare the beverages or other orders). From a system’s perspective, the customers provide the input to the service and the baristas along with special equipment and raw material (coffee, milk etc.) are the resources with which the requests are executed and served.

Figure 1. The Starbucks system and the flow of orders.

Figure 1 shows the overall flow of the Starbucks system. As the clients start their interaction with the system, they enter a queue. The first interface of the system is the cash register. At this point, a client may issue an order. The order can be anything or any combination between hot or cold beverages, food and dessert items or packaged goods. Different orders may require different processes, different equipment and obviously different preparation time. This is an advantage for the baristas since they can parallelize several orders and speed up the whole order process. For example, some hot beverages need either or both the espresso machine and the milk steamer, while cold beverages may need the blender. For many of these drinks the set of required equipment may be completely independent which allows the baristas to execute them simultaneously. The same assumption holds for drinks and food items, as the latter may need to be heated. Some orders may be so simple that can be executed on the spot by the cashier, including brewed coffee, tea or some packaged items. In such orders, the wait time is almost insignificant. The orders are received and executed in a “first-in-first-out” basis. However, due to the variation in execution time, a customer may receive an order later than another, even though it may have been placed earlier than the latter. An interesting characteristic of the Starbucks system concerning its human resources is that there is relatively little training involved and as a result every employer can assume any role at any time with little or no impact to the process. This additional flexibility allows the system to reassign its resources to address its needs as they appear, e.g. assign more baristas or assign more tellers to take orders, given the equipment restrictions.Once the order has been placed, payment must be received. As with the orders, payment also comes in many forms, which can also affect the time in which the payment will be processed. For example, cash or Starbucks cards are quickly processed, while credit cards and debit cards take more time, even without the odd failure. In general, once the order has been placed, its execution starts immediately and the payment has been usually processed before the drink has been prepared. This means that the client may need to wait more before he or she receives the end product of the order, depending also on the backlog of the orders that has been accumulated by the time the order was placed.

On the Cloud…

Using the Starbucks system as basis, I will explain in similar terms how software services operate using cloud resources. One basic difference is that such systems do not rely as heavily on humans, as most operations are automatic (or at the very least interactive) and executed by software. Nevertheless, the input can come either from other software systems or from humans (as the Starbucks clients). The difference is that software clients have much higher capacity than humans in issuing requests as they have little think time; once a response is received it can be quickly processed and a new request may be issued immediately. Other than that, the systems are basically similar; we have requests coming in (like orders), the requests may be of different nature, thus requiring different resources and taking variable time to be processed, and the clients remain in a queue, while waiting for their requests to finish.

Figure 2. The topology of a web software system on the cloud and the flow of requests.

Figure 2 shows the equivalent system of a web application deployed on a cloud, along with the flow of request processing. The system we are considering is a simple three-tier architecture, where the clients issue requests through an interface, the requests are dispersed by a load balancer to copies of the application in a number of application servers, where they are processed, and, if there is a need, a database is accessed to fetch or store data. The load balancer serves as the queue of the Starbucks system. In design time, we may set the load balancer to distribute requests in a generic manner (e.g. round robin, or by busyness) or be more sophisticated and distribute the requests to specific server clusters according to their individual demands for resources (e.g. CPU, memory, disk etc.). The latter case, which is closer to the Starbucks scenario with multiple types of orders, has interesting and important economic extensions, which we will discuss in Part 4. Exactly like the Starbucks system, requests affect each other as they take up resources, which may lead to delays and longer queues.Unlike the Starbucks system, it is not as easy or seamless to repurpose resources and assign them a different task on runtime; an application server cannot become a database server with the snap of a finger. However, to compensate for this challenge, cloud environments offer additional flexibility in commissioning and decommissioning resources. Since we deal with virtual and not physical resources on the cloud layer, we can easily boot up a new server and have it working in matter of minutes or even seconds. When we no longer need it, we can stop it without affecting the functionality or the performance of the overall system. This strategy is not equally possible in the Starbucks system, because we cannot hire or fire people on the spot for a very short period, neither can we call in an employee in an instant during rush hour. We will further discuss this special ability of cloud computing in Part 3.

2. Monitoring and analysis

When considering the quality of service, we need to pinpoint those metrics, which need to be monitored with respect to the system’s health, in order to identify any potential performance problems. Performance is crucial for interactive systems as this will be perceived as quality by the end clients. In the Starbucks system, quality is determined by customers based on the several wait times they have to endure, as shown in Figure 3 by the red-yellow stars. Customers will have to wait to order, the original queue, to pay, based on payment processing times, and finally to receive their order, after all the preparations and crafting have finished.

Figure 3. The Starbucks system along with the monitored metrics.

Interestingly enough, order and pay wait times do not depend entirely on the system’s response capacity, given the primarily interactive nature of the system. While order wait time is partly waiting in line, a big chunk of it is waiting for the customer to decide what they want to order and then actually order it. For those familiar with the Starbucks menu, you can imagine that this is not always a trivial task. After the order has been placed, the cashier has to make out all the details for a particular order in an interactive manner (“Is 2% milk fine?”, “Would you like that sweetened?”). Remarkably, perceived quality is not so much determined by the actual order wait time, but more based on the wait time in the queue. This is because in the queue the customer is inactive, while during the order there is interaction, which is understandable and acceptable. Pay wait time depends on the particular payment type. Usually, cash payment may take longer than card payments, which also take long based on the responsiveness of the credit card service, not considering any potential failures and retries. Payment with the Starbucks rewards and cash cards is most often the fastest one.The processing and wait times concerning the back-end processes of the system, i.e., the preparation of the orders, are usually regulated and within constraints imposed by the resources. More often than not, baristas are quite efficient in crafting beverages, even more than the customers ordering. The performance of the equipment is standardized. Therefore, the overall performance of the process can be improved only by adding more resources, baristas or equipment. However, adding too many resources can actually create problems, as it was suggested in Fred Brooks’ book “The Mythical Man-Month”.

On the Cloud…

To a large degree, the quality of any system is perceived based on its responsiveness and effectiveness in a timely fashion. The same holds for most software systems on the cloud. As shown in Figure 4 (also as red-yellow stars), the software system’s final response time depends on the performance of the individual resources participating in a request, including computation, storage and network resources. Similarly to the Starbucks system, the client is also responsible for producing some wait time while preparing to issue the request. However, this time is not perceived by the system, since the response time is measured from the moment that the request is received, and there is no or insignificant waiting in the queue, as requests are usually processed in parallel by multithreaded applications. Nevertheless, this time, known as think time, is important when considered along with the response time; if the response time is lower (or even significantly lower) than the think time, it is not perceived as strongly by the customer. In the opposite case, if the think time is low (especially when we talk about software clients), even a slight increase in the response time will be noticed by the client. This property is helpful, when considering and setting performance goals.

Figure 4. A software system on the cloud along with the monitored metrics.

Concerning the performance of the back-end resources, we have to take into account the demand of the various requests in CPU, memory, disk, network and so on. These demands can be roughly estimated for each class of requests during the implementation of the application in the profiling process. Given the performance specification of the cloud resources and the demands of requests, we can also estimate the overall need of the application for resources and predict its performance under certain workloads. With this knowledge, we can also address any fluctuations on the workload by dynamically allocating or deallocating resources. Unlike the Starbucks system, resources in the cloud can change in number and in size in a more flexible and volatile manner, since we are talking about virtual resources, and there are less significant constraints with respect to their number since they do not affect each other to a large degree. The only constraints are imposed by the underlying hardware, which may or may not lie within the application owner’s control, and eventually is a matter of cost with respect to how many resources will be allocated.

3. Scaling Planning and Execution

Having a good understanding about the Starbucks system and its resources, and having set up the monitors for the performance metrics, we can now better understand and identify the motivation behind some of the changes that Starbucks performed to their process. Figure 5 shows the changes introduced in the Starbucks system, marked as numbered circles.

Figure 5. The adapted Starbucks system with the changes in numbered circles.

The carefully placed monitors (aka employees looking at very long lines and disgruntled patrons muttering under their breath about the same problem), revealed that the bottleneck, or at least one of the bottlenecks, of the Starbucks system is the order queue. Therefore, the first adaptive action that the employees took was the “cup cart” (Figure 5, change 1), i.e. a small wheel cart with empty cups of all sizes and types. An employee would walk along the line with the cart and he or she would mark down a cup with each customer’s order(s) and then pass them down to the baristas. The wheel cart does not accept orders for food, packaged items and brewed coffee or tea as these can be served straight at the cashier. With this change, Starbucks effectively separated the order queue from the pay queue and managed to parallelize the two processes. Having placed the order, the customers feel more relaxed as they know that their order is already being processed as they wait in line to pay. In practice, by the time the customers pay, their order may be ready for pick up, which increases their perception of quality. The second adaptive change took advantage of novel technology, where Starbucks introduced the “Mobile Order and Pay” service through the Starbucks mobile application (Figure 5, change 2). The concept is that a customer can place an order to a specific Starbucks store, pay through their Starbucks account and then go to pick up their order from the store. In this way, they can completely skip the queue and pick up their order right when they step in the store. The order and pay queues become completely invalid and the only thing that remains is the preparation time. This also becomes manageable, as the application returns an approximate time by which the order will be ready, taking into account average preparation times and customer arrival patterns.After both these changes, the only wait time that remains is the preparation time. As it has already been mentioned, this time can be reduced if necessary by commissioning more resources, human or equipment. However, several restrictions apply to these scaling actions. For example, we cannot dynamically increase the size of our equipment for a few ours and then release whatever we no longer need. The acquisition of equipment is planned based on average customer arrival and as a result there are moments when equipment is underutilized or others when it is not enough and wait time is temporarily increased for customers. In addition, we cannot increase the number of baristas to more than 2 or 3 at a given time depending on the size of the store. What Starbucks would do with respect to employees is identify specific times during the day or during the year (e.g. Christmas or other holidays) with increased traffic and assign more baristas or cashiers. This strategy has become a norm to almost all services and has been transferred to software services as well.

On the Cloud…

Thanks to the flexibility of cloud computing, there is a number of adaptive changes we can make to address potential performance issues of the deployed system. Figure 6 shows some of these changes noted in numbered circles. The first change that comes in mind when considering a high response time for a software system on the cloud is to add more resources, mainly virtual machines (Figure 6, change 1). Unlike the Starbucks system, space restrictions are less prominent in cloud systems. Although these restrictions can be imposed by hardware (a physical server cannot accommodate an infinite number of virtual machines), it doesn’t always come to that. And when it does, the software can be moved to a public cloud, which usually has much larger capacity than private infrastructure and space is not an issue. On the other hand, the problem then becomes one of cost; reserving more and more virtual machines translates into more money that needs to be paid to the public cloud provider. This is a consequence, which we will discuss in the next part concerning the economics of scaling. Cost is actually the motive for the second change (Figure 6, change 2); removing a virtual machine from a cluster when it is not fully utilized. If the workload can be shared by the remaining resources, the spare virtual machine can be removed to save costs.

Figure 6. The adapted software system with the changes as numbered circles.

Another possible change concerns the redistribution of requests to specialized clusters (Figure 6, change 3). We assume that the load balancer of the software system already distributes the incoming requests to specialized clusters according to their demands for specific resources. However, this is done in a very straightforward manner and the balancer sends the CPU intensive requests to the CPU cluster, the memory intensive requests to the memory cluster and so on. If a particular cluster is saturated, the first thought would be to add resources, as in change 1. Alternatively, since the virtual machines in the clusters actually possess all resources (CPU, memory, disk) in different configurations, we can avoid adding unnecessary resources and actually redirect requests from the saturated cluster to one that is not as utilized. However, one needs to be careful and not send too many requests to other clusters to avoid saturating their otherwise limited resources. Therefore, this action requires changes to the balancer software to become more sophisticated with management responsibilities as well.Finally, if the bottleneck is in the database requests, we can scale the data layer in a similar manner (Figure 6, change 4). Thanks to recent advancements with Big Data and NoSQL technologies in databases, it is possible to partition data and distribute it in multiple stores making both reading and writing to the database faster.

4. On the Economics of Scaling

Out of the potential bottlenecks we identified in Part 2, we have seen that Starbucks have paid particular attention and applied actions to address the order and pay queues. Performance improvement aside, looking at the economic aspect of scaling, this focus makes sense for another reason. Waiting in long lines to order may prompt the customers to abandon the endeavour and leave the store altogether. This automatically means loss of revenue for Starbucks, but it may also imply a steep fall to the customers’ long-term perceived quality, which may prevent them from visiting this particular store in the future or Starbucks altogether. On the other hand, once the customers have placed their order and wait to pay, they are less likely to leave the queue and even less after they have paid. Formally, the probability that a client will leave the system prematurely decreases as he or she progresses further within the system. Long pickup times may affect the long-term quality, but customers will rarely abandon something they have already paid for. By eliminating the order and/or the pay queue, either with the wheel cart or with the mobile app, Starbucks minimizes the risk of losing clients, who have already entered the system, and increases its total expected revenue.The adaptation costs of the changes described above are also of particular interest. The wheel cart solution is virtually inexpensive, since the cart and the cups already exist, as is the human resource (cashier or barista) at the moment of the change. The redistribution of human resources may affect the rest of the system, but since the order queue has been identified as the current bottleneck, reassigning one employ to this extra task will be of more benefit than cost. The mobile app has obvious additional costs (infrastructure, developers, maintenance and so on), but it is a global solution, which can be applied to all (or potentially all) stores. Furthermore, it eliminates two potential bottlenecks, order and pay queues, and it is a parallel and alternative process, which does not affect the other orders to a large degree. Finally, adding and removing resources dynamically and just-in-time has obvious costs, but more importantly carries high economic risk; reserving equipment for a period of high traffic that would actually be shorter than expected may result in higher costs. Overall, the cost to benefit ratio of adding physical resources may be high enough, so that the system would prefer a few short periods with higher delays than adding resources indiscriminately.

On the Cloud…

Unlike the Starbucks system, adding and removing cloud resources is a convenient change and inexpensive considering the low cost for hardware and the fact that cloud computing is an economy of scale. The last statement means that since a physical host can host a large number of virtual machines, the more VMs are commissioned by clients the more the cost spreads across these machines. However, the concept of economies of scale also applies on the software in a negative manner; the fewer requests a VM serves the more expensive they are. Therefore, it is desirable that our clusters operate close to full capacity so that their costs spread out to more requests. This is especially prominent in small systems, where the clusters are small and an extra VM will increase the average cost per request too much. As a result, when a small number of requests will trigger a scaling action to preserve performance, we may be reluctant to add new resources and prefer to wait until more requests arrive with the risk of increasing the system’s response time. However, a considerably increased response time may result in dropped requests, similar to Starbucks customers leaving the queue. Both these phenomena can result in a significant decrease to the long-term perceived quality of the system. In fact, unlike services like Starbucks, software services usually have written agreements, known as Service Level Agreements (SLAs), with their clients where they guarantee a maximum response time and a minimum availability rate (i.e., percentage of served requests out of total received). Violation of these agreements may even result to financial penalties.Concerning the concept of heterogeneous clusters using VMs optimized for specific resources, as we described our software system, the motivation comes from Amazon’s pricing policies for virtual resources. Within virtual machines of the same type (general purpose, CPU optimized, memory optimized and so on), when we want to double the resources of the VM so does its cost. However, if we want to unilaterally increase only one resource per VM, we can commission a specialized VM for a lower cost than a general purpose VM, which would unnecessarily increase the other resources as well. Although heterogeneous clusters are an optimal solution from cost and performance perspective, they require additional logic on the load balancer to make sure that the requests are distributed according to their needs.

5. Conclusions

The purpose of this post was to show that resource scaling and dynamic adaptation is a reality not only in software and computer systems, but in everyday services and processes as simple as ordering a cup of coffee. Crucial components of the adaptation process are monitoring, where we study the performance of our system and identify potential problems, correct planning and execution of the adaptive actions given the available resources, and eventually the economic considerations of the whole process. The inclusion of novel technologies, including cloud and mobile computing, does not diminish the role of humans in the process. On the contrary, scaling and smart solutions can lead not only to better services for customers, whether these are software or human services, but more importantly they can lead to economic benefits for the companies, the employees and the clients. Cost savings can allow companies to redistribute the budget towards further improving the service, or the quality of work for their employees (increased salaries, better training etc.). In addition, lower costs can lead the market through competition to lower the service prices to the benefit of the clients. These facts show that during dynamic adaptation, a service should be perceived both as a system and as a product with economic considerations.