Determinism and the Semantics of Performance in the Cloud

SaaS companies are judged harshly by the perceived reliability and performance of their services. When your service becomes a critical part of a customer’s infrastructure, their fate becomes wedded to the SLA’s you deliver. It’s important to keep in mind that a service’s performance will not be measured by its average speed but by the consistency of its speed. A customer should and will demand performance measured in percentiles, guarantees of response in the 99.99% range are not unheard of. This level of guarantee is only achievable with careful architectural practice.

The Fundaments of Performance

Within the request-response cycle there are 3 primary factors affecting application performance:

Deterministic Performance: Eliminate Your Vulnerabilities

Performance vulnerabilities represent factors impossible or difficult to control. Achieving consistent performance requires the minimization of exposure to the request-response cycle from such vulnerabilities.

Classically such vulnerabilities come in the following forms:

Reading data from a non-volatile storage device

Writing data to a non-volatile storage device

Reading/writing to a non-volatile transactional database

Communicating with remote services

The greatest vulnerability (aside from dependability) posed by methods employed in the request-response cycle is non-predictability of performance. The term used to measure that predictability is determinism.

Determinism

Determinism is the best measure we have for predicting the effort and expense of making a process consistently performant. It does not represent a magnitude but instead qualifies the ability to control and predict the performance-critical behaviors of a subsystem. Some subsystems are amenable to the kinds of analysis that produce clear strategies for consistent performance (e.g. a memory-backed storage subsystem), others however offer far too many variables and conditions to provide any guarantees of consistency. Determinism represents the key distinguisher between these 2 forms of subsystems.

Reading from a non-volatile storage device

Many things can complicate the read performance of a non-volatile storage device (e.g. disk drive) but principally these are the chief factors:

Data available from high-speed memory (e.g. cached)?

Read request queued behind many others (e.g. disk contention)?

Data stored in a highly noncontiguous form (e.g. fragmented)?

The range of read speeds can vary greatly, from single microseconds for a read from cache to hundreds of milliseconds for reads directly from the media. Non-volatile storage mechanisms classically exhibit highly non-deterministic performance, with deltas as great as the difference between driving to the neighborhood store and flying to the moon. Given this you certainly don’t want to read from a disk drive during the request-response cycle unless absolutely necessary.

Writing to a non-volatile storage device

The performance characteristics of writes to a non-volatile storage device are wholly dependent upon the operating system’s caching strategy: write-through or write back (lazy-writes)? A write-through strategy simply bypasses the file system cache, writing data directly to the device’s media. This implies writes are just as vulnerable to the non-deterministic timing issues as reads. In fact in the write-through scenario you are guaranteed writes will always be much slower than reads with no chance of cache acceleration. However all modern O.S. file systems implement lazy-writes, writing data into high speed memory first which is later flushed to disk by an external process. Lazy-writes allow for extremely fast writes that exhibit highly deterministic timing. Performing writes to a non-volatile storage device within a request-response cycle is not likely to be detrimental to deterministic performance guarantees.

Writing/reading to a non-volatile transactional database

The factors affecting performance characteristics of a database are manifold, far too many to list here. Principally they fall into these categories:

Reading from non-volatile storage

Writing to non-volatile storage (non-cached writes)

Acquisition of resource locks

Maintenance of database indexes

Query plan generation

It can be easily argued the 3 latter factors are thoroughly dependent upon the first 2. This implies database performance is just as highly non-deterministic as its underlying non-volatile store. Again, you don’t know if you’re driving to the store, or the moon, you’re just along for the ride. Additionally there is the further risk of high contention associated with transactional locks between simultaneous users of the database, something extremely difficult to coordinate and control.

Communication with remote services

Distributed systems make extensive use of remote services to service local requests. What level of vulnerability and non-determinism does this introduce? Assuming a well-behaved remote service existing on the same logical network that is guaranteed to process requests in some X seconds, the remainder of the time will be spent:

Resolving IP address of hostname

Negotiating a TCP connection

Transmitting request data over network

Reading data from network

So what risks lie here? Firstly, it can be assumed that DNS services are highly reliable as necessitated by the proper functioning of any network infrastructure, thus there should be a high guarantee of deterministic performance behavior there. It’s also safe to assume under any but pathological conditions the majority of TCP negotiation attempts will be serviced in microseconds. Transmitting small amounts of data (as is often the case with service requests…especially REST-ful ones) should also exhibit a consistent performance envelope. Reading from network is somewhat compromised by the size of the response payload but a linear correlation with performance exists.

The Ideal Service

We have discussed the key vulnerabilities affecting deterministic performance of a service’s request-response cycle. Now we formalize the characteristics of the ideal service that effectively mitigates non-deterministic performance behavior.

Request-Response Cycle: The Holy of Holies

Ideally the request-response cycle is treated as sacrosanct. Nothing to compromise the performance of the service should be executed within a service transaction. This implies the following must all be true during a request-response cycle:

Local data reads are directly from memory, or reads are obviated by in-memory object cache

Local writes utilize a write-back (lazy-writes) caching mechanism

There is no direct utilization of a database, unless it be non-transactional and utilizing a memory-backed data store.

Remote calls are made only to services also exhibiting deterministic performance behavior

This would certainly bring a large amount of determinism to service performance but this is also a very tall order, being highly restrictive…but who says you can’t cheat?

Out of cycle transactions

How does one implement a service meeting the immediate objectives of deterministic performance while still retaining some of the classic features of an application, including the persistence of data to a traditional transactional database? Two techniques: contract of durability and detached transactions.

The Contract of Durability

Implicit in a successful request-response cycle is the contract of durability; that is, to a limited degree of certainty, there is assurance that all side-effects of the transaction remain persistent. To put it more simply, if data has been produced or changed as the result of the request there should be some guarantee that it’s persisted beyond the transaction. This is the contract of durability.

How can the contract of durability be met without introducing non-deterministic performance behavior? There are 3 methods of persisting transaction data:

Via an ideal remote service

Via write-back caching to a non-volatile storage device

Via memory-backed non-transactional data store

Unfortunately the contract of durability meets the immediate expectations of the client, but does not meet the obligations of the application: storing data in a centrally accessible, often transactional, data store. Undoubtedly introducing the requirement to store data in a centralized database from the request-response cycle introduces non-deterministic performance behaviors. This brings us to the second required technique: detached transactions.

Detached Transactions

The technique of committing data to a transactional data store outside the context of the request-response cycle in which it was generated is called a detached transaction. This technique enables us to distribute data to a central data store without penalizing the client for the debt of its benefit. In other words, no one need know we write to a traditional transactional (and likely highly contentious) database. Further, the ability of the service to serve requests is not compromised by database outages (well, up to a point). The perception of a transactional yet-deterministically performant service can be achieved without the inordinate expense and effort to replicate the same perception with a traditional database.

There are however inherent risks to this technique, chiefly what guarantee is there that the transaction can be completed successfully? In order for this technique to achieve consistently successful results the constraints otherwise enforced by the database must also be enforced within the request-response cycle itself. This isn’t as onerous task as it seems, requiring a determinable amount of effort to capture and implement database constraints as business rules.

There is a strong exception to this technique when transactions make references to or expect data from previous transactions. Obviously, when such requirements exist transactional data needs distribution throughout the entire service cluster, or ultimately, despite all concerns, committed directly to and read from a central data store. Any operations that depend upon the results of previous transactions should be avoided. When this becomes a critical requirement you must implement a distributed data store system to retain deterministic performance characteristics.

The Cloud: A Strategy Not A Tactic

What does this all have to do with the “cloud”? Can you take the principles of deterministic computing into a cloud environment? What must be done to achieve this?

What the cloud isn’t!

The cloud isn’t a magic place that makes things faster by turning a knob. There exists no magic that can take a monolithic application bound to a shared database and make it scale horizontally.

What the cloud is!

A bunch of cheap on-demand computers with the means to deploy whatever you want on them…so long as what you deploy runs on commodity hardware. Further a cloud environment often provides some form of infrastructure services such as network-accessible storage or monitoring tools.

Why the cloud then?

All this cheap computing provides the opportunity to solve problems in a massively parallel fashion. The cloud is not about fast, it’s about doing as much as possible simultaneously on many machines; scale out, not up!

These clouds have no silver lining

Getting your software into the cloud requires the dismissal of some classical stone-clad guarantees

Service instances all share the same database and/or see the identical view of the data

Performance problems can be remedied by enhancing the base hardware platform

Infrastructure services are under direct control, outages are planned, service interruptions are infrequent

All resources of the local hardware are completely at your disposal

The ramifications of these non-guarantees:

Data must be distributed and synchronized from and to many different sources

Software stack must be transparent and flexible, with allowances for performance tuning via source modifications or fine-grained replacement of third-party libraries and algorithms

Code must be written to be tolerant of temporary service failures and the highly non-deterministic behavior of some cloud-based infrastructure components.

I/O bandwidth is shared with other users of the physical machine. There’s no guarantee what share of that I/O resources you may have at any moment.

Surviving the Storm

Can deterministic performance be achieved in the cloud environment? Indeed it’s possible if the following cardinal rules are followed:

Only interact with services you’ve determined to exhibit deterministic performance and are under your direct control

Use non-volatile storage for reads sparingly. Assume you’ve only a limited number of I/O operations per second available to you (typically less than 200).

Using a local non-volatile data store within the request-response cycle is risky. Use should be limited to the initialization of the service and out-of-cycle persistence.

Interaction with cloud infrastructure components should be done outside of the request-response cycle as much as possible