Saturday, January 16, 2016

I have heard many companies complain about how expensive the cloud is becoming as they moved from development to production systems. In theory, the saved costs of greatly reduced staffing of Site Reliability Engineers and reduced hardware costs should compensate -- key word is should. In reality, this reduction never happens because they are needed to support other systems that will not be migrated for years.

There is actually another problem, the architecture is not designed for the pricing model.

In the last few years there have been many changes in the application environment, and I suspect many current architectures are locked into past system design patterns. To understand my proposal better, we need to look at the patterns thru the decades.

The starting point is the classic client server: Many Clients - one Server - one database server (possibly many databases)

As application volume grew, we ended up with multiple servers to handle multiple clients but retaining a single database.

Many variations arose, especially with databases - federated, sharding etc. The next innovation was remote procedure calls with many dialects such as SOAP, REST, AJAX etc. The typical manifestation is shown below.

When the cloud came along,the above architecture was too often just moved off physical machines to cloud machines without any further examination.

Often they will be minor changes, if a queue service was being used with the onsite service concurrent with the application server, it may be spawned off to a separate cloud instance. Applications are often design for the past model of all on one machine. It is rare when an existing application is moved to the cloud that it is design-refactored significantly. I have also seen new cloud base application be implemented in the classic single machine pattern.

The Design Problem

The artifact architecture of an application consisting of dozens, often over 100 libraries (for example C++ dll's), It's a megalith rooted in the original design being for one PC.

Consider the following case: Suppose that instead of running these 100 libraries on a high end cloud machines with say 20 instances, you run each library on it's own light-weight machine? Some libraries may only need two or three light-weight machines to handle the load. Others may need 20 instances because it is computationally intense and a hot spot. If you are doing auto-scaling, then the time to spin-up a new instance is much less when instances are library based -- because it is only one library.

For the sake of argument, suppose that each of the 100 libraries require 0.4 GB to run. So to load all of them in one instance we are talking 40GB (100 x 0.4).

Looking at the current AWS EC2 pricing, we could use 100 instances of the t2.nano and have $0.0065 x 100 = $0.65/hour for all 100 instances with 1 CPU each (100 CPU total). The 40GB would require c3.8xlarge at $1.68/hour, 3 times the cost and only 32 cores instead of 100 cores. Three times the cost and 1/3 of the cores... sounds like our bill could be 9 times what is needed.

What about scaling, with the megalith, you have to spin up a new complete instance. With the decomposition into library components, you only need to spin up new instances of the library that needs it. In other words, scaling up become significantly more expensive with the megalith model.

What is another way to describe this? Microservices

This is a constructed example but it does illustrate that moving the application to the cloud may require appropriate redesign with a heavy focus on building components to run independently on the cheapest instances. Each swarm of these component-instances are load balanced with very fast creation of new instances.

Having a faster creation of instances actually save more money because the triggering condition can be set higher (and thus triggered less often - less false positives). You want to create instances so they are there when the load build to require them. The longer the time it takes to load the instance, the longer lead time you need need, which means the lower on the build curve you must set the trigger point.

There is additional savings for deployments, because you can deploy at the library level to specific machines instead of having to deploy a big image. Deploys are faster, rollbacks are faster.

Amazon actually does this approach internally with hundreds of services (each on their own physical or virtual machine) backing their web site. A new feature is rarely integrated into the "stack", instead it is added as a service that can actually be turned off or on on production by setting appropriate cookies in the production environment. There is limited need for a sandbox environment because the new feature is not there for the public -- only for internal people that know how to turn it on.

What is the key rhetorical question to keep asking?

Why are we having most of the application on one instance instead of "divide and save money"? This question should be constantly asked during design reviews.

In some ways, a design goal would be to design the application so it could run on a room full of PI's.

This design approach does increase complexity -- just like multi-threading and/or async operations adds complexity but with significant payback. The process of designing libraries to minimize the number of inter-instances call while striving to minimize the resource requirements is a design challenge that will likely require mathematical / operations research skills.

How to convert an existing application?

A few simple rules to get the little gray cells firing:

Identify methods that are static - those are ideal for mini-instances

Backtrack from these methods into the callers and build up clusters of objects that can function independently.

There may be refactoring because often designs go bad under pressure to deliver functionality

You want to minimize external (inter-component-instances) calls from each of these clusters

If the system is not dependent on dozens of component-instance deployments there may be a problem.

If changing the internal code of a method requires a full deployment, there is a problem

One of the anti-patterns for effective-frugal cloud base design is actually object-orientated (as compared to cost-orientated) design. I programmed in Simula and worked in GPSS -- the "Adam and Eve" of object programming. All of the early literature was based on the single CPU reality of computing then. I have often had to go in and totally refactor an academically correct objective system design in order to get performance. Today, a refactor would also need to get lower costs.

The worst case system code that I refactored for performance was implemented as an Entity Model in C++, a single call from a web front end went thru some 20 classes/instances in a beautiful conceptual model, with something like 45 separate calls to the database. My refactoring resulted in one class and a single stored procedure (whose result was cached for 5 minutes before rolling off or being marked stale).

I believe that similar design inefficiencies are common in cloud architecture.

When you owned the hardware, each machine increased labor cost to create, license, update and support. You have considerable financial and human pressure to minimize machines. When you move to the cloud with good script automation, having 3 instances or 3000 instances should be approximately the same work. You actually have financial pressure to shift to the model that minimizes costs -- this will often be with many many more machines.