qcon

October 24, 2010

Abstract:

Memcached is one of the most common In-Memory cache implementation. It was originally developed by Danga Interactive for LiveJournal, but is now used by many other sites as a side cache to speed up read mostly operations. It gained popularity in the non-Java world, too, especially since it’s a language-neutral side cache for which few alternatives existed.

As a side-cache, Memcache clients relies on the database as the system of record, The database is still used for write,update and complex query operations. Since the memcached specification includes no query operations, memcached is not a database alternative, unlike most of the NoSQL offerings. It also exclude memcache from being a real solution for write scalability. As a result of that many of the heavy sites started to move away from Memcache and replace it with other NoSQL alternatives as noted in a recent highscalability post MySQL And Memcached: End Of An Era?

The transition away from memcached to NoSQL could represent a large investment as many sites are already heavily invested in memcached usage. In this post, I'll illustrate an alternative approach in which we’ll extend the use of memcache for write scaling, add other goodies such as high availability and elasticity by plugging GigaSpaces as the backend datastore, and avoid the need for a re-write. The pure Java implementation could also be seen as a benefit as it can increase the adoption of memcached within the Java community and leverage the portability of java to other platforms,

Memcached overview

The diagram below shows a typical use of Memcache. It outlines a simple deployment topology often referred to as a side cache. This topology is very popular to address read mostly scenarios.

Typical use of Memcache for scaling read mostly applications

In this mode, read operations first check the memcached servers for an artificial key, derived from the requested data. If the key doesn’t exist on the memcached server, a query is made to the database, and the result is then stored with that artificial key into the memcached instances. Subsequent calls with that query go through the memcached deamon and thus saves access to the database. Updates remove the data from the memcached instances, then write the data to the database and the memcached instance at the same time in the most common situation. Wikipedia has a code snippet that provides a more detailed example of this in action.

The main benefit of this model lies in its simplicity. The memcached API is extremely simple to use for this specific scenario. The fact that it relies on a simple and open protocol as opposed to a rich client bound to to a specific language and implementation makes it portable across a wide variety of languages and environments. It also known to be fairly scalable for handling read operations as due to the inherited share nothing model.

The advantages of memcached that make it so simple and popular are the exact same things that make it fairly limited for any scenario that goes beyond read-mostly/side-cache scenarios. Some of those limitations were the main driver that forced some popular sites to switch to a NoSQL as an alternative to memcached as noted by Digg and Twitter.

The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes. This is completely wrong for large-scale web applications, where response time is critical.

We have a lot of data, the growth factor in that data is huge and the rate of growth is accelerating. We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available.

Summarizing the points from Digg and Twitter, the main limitations of the current Memcached implementations are the lack of support for:

Write scalability

Elasticity

High availability

I would also add to that list the lack of a consistency model, the limited query support, a client-based optional sharding model, management, and monitoring.

Marrying memcached and NoSQL

Many of the limitations of memcached outlined above are addressed by various NoSQL/In-Memory-Data-Grid implementations. The main difference is that many of the NoSQL alternatives were designed to act as a database and not just as a side-cache. As such they were geared for write-scaling and come with built-in elasticity and high availability in mind.

The main motivation behind the integration between the two is to address the current limitations of memcached without forcing a complete and expensive re-write. Another motivation is that we can leverage the rich client and language support that comes along with the memcached protocol as a simple way to extend the use of NoSQL alternatives to other languages.

While the pattern of integrating Memcache with existing NoSQL backend is fairly generic, in this post i would refer specifically to the GigaSpaces integration.

GigaSpaces memcached support

Joseph Ottinger published a good summary of the rationale behind the GigaSpaces memcache support in a post on TheServerSide.com, “Did Someone Say GigaSpaces Now Has Memcached Support?” In this post, I'll try to provide a reference to the underlying GigaSpaces implementation in the context of the memcached usage pattern that I outlined above.

The GigaSpaces memcached support is fairly simple. It consists of a listener service (written in Java) that implements the memcached protocol for client applications, and maps the protocol into equivalent GigaSpaces operations. There are three mode of operations in which the integration works:

Memcached-compatible – reliability and write-scaling

The memcached-compatible mode is set to be compatible with the exact same way that memcache client works today.

In this mode, each GigaSpaces node runs an embedded instance of the memcached service. The communication between the memcache client and the GigaSpaces node is handled through the memcached protocol. The GigaSpaces memcached service communicates with the GigaSpaces backend directly in-memory.

By using GigaSpaces as the backend datastore we gain immediate benefits from the robustness and richness of the GigaSpaces environment. This includes write-scaling and reliability through the built-in database integration for pre-loading data from a database and write-behind for storing data back into the database. In addition to that we gain built-in management, monitoring and deploymentautomation (through the GigaSpaces dev-ops API). The fact that the entire stack is built in Java also enables to leverage the Java tool ecosystem (debugging tools, monitoring tools, profilers) as well as the portability of the Java Virtual Machine.

As the memcached protocol doesn’t include dynamic discovery, the memcached clients have static binding to the backend server, which doesn’t provide GigaSpaces’ elasticity and dynamic scaling.

Memcached load-balancer – elasticity + write-scaling, reliability

In this mode of operation, the memcached server instance does not use the embedded GigaSpaces instance but instead would reference a remote GigaSpaces cluster. In this mode, the GigaSpaces memcached server instance becomes a memcached load-balancer between the memcached clients and the GigaSpaces data partitions. Each GigaSpaces cluster could have many memcached-load-balancers. As with HTTP load-balancers, the clients only need to point to a single IP address, and every memcached request is mapped to a GigaSpaces partition according to the key used in any given operation.

The main benefit of this mode is its simplicity and elasticity. Elasticity is achieved by the fact the GigaSpaces cluster can grow as needed. The GigaSpaces proxy will discover those instances dynamically and route the memcached requests to the appropriate node. The same applies to fail-over i.e. if one of the GigaSpacs nodes fails the proxy will rote the request to a hot backup automatically.

The memcached load balancer can be also hosted as a service within the GigaSpaces container. To gain even better scalability of the load-balancer we can combine the memcached-compatible mode (our first example) with this load-balancer mode where each of node in the gigaspaces cluster could have the embeded memcached service configured as a load-balancer. In that configuration, a client can succeed by selecting one of the nodes in the cluster while gaining access to the entire cluster.

The downside of this approach is performance as each memcache request will to go through two network hops. For that purpose we often set GigaSpaces LocalCache to minimize that extra hop for read operations. In this way subsequent get operation on the same key would be resolved within the load-balancer itself. Any update on that key through any of the partition will be updated automatically on the local cache as well.

Another potential use of that mode is gateway between two network domains. It is much easier to cross firewall and network domains through a single gateway then having the entire cluster available.

Memcached RPC – Map/Reduce support, multi-lang data feeder

One of the interesting and lesser-known uses of the memcached protocol is for Remote Procedure Calls or Map/Reduce operations. In this mode, a set operation on the memcache server will be translated to a command, and a get operation is translated into a return value for that command. This is extremely useful for data-feeder scenario and aggregated query scenarios. Non-Java feeders can also leverage the GigaSpaces support for dynamic languages to pass in code that will get executed in the server close to the data.

The execution of the code would be done through the use of the GigaSpaces polling container. In this case we would set the polling container to poll for the memcache entries.

NOTE: We may be releasing a new version with built-in support for this specific pattern so the API is still subject to change ..

Simple Example:

The following example illustrates the first two modes outlined in this post – the memcached-compatible and the load-balancer modes.

For the sake of simplicity we will use a single memcached server instance. A more advanced deployment of a complete cluster is referenced at the end of this post.

Running a single memcache instance:

In this example we will run a GigaSpaces instance using the gs-memcached{.sh/.bat} . The gs-memcached utility takes one argument that points to Space URL.

> gs-memcached <Space URL>

Note for gs-memcached utility is available only since GigaSpaces 8.0 release. For earlier versions use puInstance utility instead using the following format:

Running a single memcache in compatible mode:

To run a memcached server that reference to an embedded GigaSpaces instance we will use a Space URL in the following form: “/./<name>”

In our example we will use “/./memcache” which basically set a memcache server that points to an embedded GigaSpaces instance under the name “memcached”.

>gs-memcached{.sh/.bat} ”/./MyMemcached”

or simply

>gs-memcached{.sh/.bat}

which puts a default name “/./memcached” as the Space-URL

Running a single memcached load-balancer:

To run a memcache in a load-balancer mode we would reference to remote GigaSpaces cluster by setting the Space-URL to a remote cluster as outlined below:

> gs-memcached{.sh/.bat} ”jini://*/*/remoteGigaSpaces”

A simple memcache client:

There is nothing really unique in our memcached client. We can use any standard memcached library and point it to the host:port of our server. The default port is set to11211.

1: MemcachedClient memClient=new XMemcachedClient("localhost",11211);

2: memClient.set("1", 3600*60, "some value");

3: memClient.get("1");

4: memClient.delete("1");

Monitoring and Management:

One of the benefits of using GigaSpaces as the backend data store is the fact that we can leverage the existing monitoring and management to monitor the statistics and activity of our memcache deamon. We can use any of the GigaSpaces management tools for that purpose, including the rich client UI, the web-based UI, and command line and API based management and deployment tools.

The diagram below shows the real time statistics gathered from the gs-ui utility.

Advanced setup:

In a more advanced setup, we would want to leverage the GigaSpaces Dev-Ops API and SLA driven container to automate the deployment and managing the SLA of that deployment. A detailed description of that mode is provided in the GigaSpaces memcached documentation.

Final words – Qcon next week

We often tend to look at any new technology as like a sort of new religion. Quite often, our needs change and with that we tend to switch from one technology (religion) to another. Memcached and NoSQL are a good example for such a transition. I personally believe that to gain the benefit of a new technology we don’t need to completely abandon previous technologies but instead learn from them to make the new technologies and techniques better. That is particularly true when it comes to data technology - consider SQL, which been around for more then 4 decades!

A new writeup that came out last week on GigaOM, “Will Scalable Data Stores Make NoSQL a Non-Starter?,” is just a reminder of how quickly technology tends to shift. That makes the importance of taking a more evolutionary path toward the revolution that we’re going through even more important than ever before.

During the Qcon session Yes, SQL! next week Uri Cohen will lay out some of the various NoSQL query languages that are available today starting with memcached, MongoDB, Cassandra, Redis and outline a model for leveraging the best out of each in our existing JEE world. I hope it will lead to an interactive learning session and debate, as it often does at QCon.

December 10, 2009

The Qcon conference in San Francisco has always been one of my favorite conferences. Floyd is doing a great job of bringing an interesting blend of people from across the spectrum of the industry (Java, .Net, Ruby) into one place. He also brought some interesting speakers that you don’t normally see at this type of developers' conferences, such as the VC’s talk, which I found particularly interesting. This conference is a great environment to open your mind to other ideas and thoughts outside of your day-to-day realm. It took me a few days to let all the experiences from the various discussions in the conference sink in.

Obviously it is impossible to try to summarize three days worth of discussion in a single post, or even in a series of posts. I therefore picked out a few topics that I thought were the most interesting. I’ll start with the VC’s keynote speech.

In this keynote speech, Kevin Efrusy from Accel Partners and Salil Deshpande from BayPartners shared their successful experiences with open source companies such as SpringSource, Hyperic and Grails, and tried to draw a pattern for building a successful business model in the current market economy. Below are the main points that I took from their discussion.

OSS/SaaS/Cloud has a Common Driver

OSS/SaaS/Cloud reduce the barriers to entry to consume new technology. As a result of this, we are seeing a major shift in the technology selection process today compared with previous years. Technology is now being selected by those who are going to actually use it, rather than by the business managers. These users value simplicity, openness and productivity more then big brands. They are much more open to new technology as long as it serves their productivity needs. This shift in the decision making culture is also reflected in those companies' structure. It is now much more common to see senior management that is driven by a similar profile of technical leadership, rather than by business school graduates. Geva Perry gave an interesting explanation for this. In today’s world, innovation becomes key to the success or even survival of many companies. In such an era of innovation, technical leadership tends to have a better intuition for making the right choices that will make their product more successful then others.

Stephan the lead architect of Unibet, an online gambling company, provided an interesting insight during his presentation, on how he makes a technology choice:

Open source software and open standards should always be the first choice.

Avoid vendor lock-in. Software that is used should have a right-to-use license without any cost attached.

Commercial, proprietary software needs to show exceptional business value (over free solutions) in order to be considered.

It's Not Just About Price

Unlike what most people think, the actual cost of an OSS/SaaS product can be similar to any commercial offering, or even more expensive if you start to measure the ROI. The core difference is the fact that with OSS or SaaS, you pay only when you get real value. You also get the choice to determine when you are willing to pay.

OSS/SaaS or Cloud = Cheap Marketing

Kevin made an interesting observation WRT to the business value of OSS/SaaS. If you take away the “religious” aspect of those destructive models, one of the main business values behind OSS/SaaS can be summed up as “cheap marketing”. You can get a quick channel to a large community that you would probably never have gotten if you are not on that side of the spectrum.

How to Monetize on the Success of an Open Source Product?

The general rule of thumb is to monetize for the things that are considered high value by your customers and not for things that are of low value, like development or tools etc.

Examples of high value features are features that are relevant for the production system but less relevant for development, such as:

Deployment automation

Support (SLA)

Monitoring/Administration/Automation

Security

Examples of low value features:

Development tool

Training

Charging on those low value items can be perfectly fine for seeding your company, but this is not scalable as a long term strategy. It is also not mutually exclusive, meaning that you could still have a training business along side your other source of business. The important thing is not to rely on training as the main source of revenue for your company growth.

How to Beat the Big Players

Rely on one of the disruptive forces (OSS, SaaS, Cloud). Leverage the low marketing cost of a community-driven project to gain fast awareness (mostly through word of mouth).

Start with small components (feature vs. platform) and grow slowly through the value chain. An exception to that example is JBoss - JBoss owes its success to the adoption of J2EE. It is therefore less likely that this model can repeat itself as there is nothing similar to J2EE on the horizon.

Once you get to the right level of adoption, you need to start building value quickly to be able to monetize on the community. The right acceleration model is acquisition of other tools in that area.

Focus first on adoption (at the expense of short term revenue), and monetize later. It is very likely that when you start to build your community, you wont have a clear answer on how and where the monetization will happen. The answer often comes somewhere down the road. It is very likely that it will involve a long trial and error experience until you figure out the right combination that will drive revenue out of your community.

Interesting examples in that regard are LinkedIn/Facebook, which are both now profitable and growing fairly fast. When they started they didn't really know what was going to be their main source of revenue. Google is another good example of that.

Main Hot Trends:

“Big data on the cloud” was marked as one of the “hot trends”. Unfortunately I haven't found my notes on the rest of the hot trends that were mentioned. I hope that either Kevin or Salil will comment on that directly.

November 30, 2009

One of the core assumption behind many of today’s databases is that disks are reliable. In other words, your data is “safe” if it is stored on a disk, and indeed most database solutions rely heavily on that assumption. Is it a valid assumption?

Disks are not reliable!

Recent analysis of disk failure patterns (see references below) break the disk reliability assumption completely. Interestingly enough, even though these studies were published in 2007, most people I speak to aren’t aware of them at all. In fact, I still get comments from people who are basing their reliability solution on disk-based persistency, insisting on the argument that whether or not you are reliable amounts to whether or not your data is synched to a disk.

Below are some of the points I gathered from the research, which should surprise most of us:

Actual disk failure/year is 3% (vs. estimates of 0.5 - 0.9%) – this is a 600% difference on reported vs. actual disk failure. The main reason is that manufacturers can’t tests their disks for years, so their tests are an extrapolation of statistical behavior which proves to be completely inaccurate.

There is NO correlation between failure rate and disk type – whether it is SCSI, SATA, or fiber channel. Most data centers are based on the assumption that investing in high-end disks and storage devices will increase their reliability – well, it turns out that high-end disks exhibit more or less the same failure patterns as regular disks! John Mitchell had an interesting comment on this matter during our Qcon talk, when someone pointed to RAID disks as their solution for reliability. John said that since RAID is based on an exact H/W replica that lives in the same box, there is a very high likelihood that if a particular disk fails, its replica will fail in the same way. This is because they all have the exact same model, handle the exact same load and share the same power/temperature.

There is NO correlation between high disk temperature and failure rates – I must admit that this was a big surprise to me, but it turned out that there is no correlation between disk temperature and disk failure. In fact, most failures happen when disks were at low or average temperature.

Why are existing database clustering solutions so breakable?

Most existing database clustering solutions rely on a shared disk storage to maintain their cluster state, as can be seen in the diagram below.

Typical database clustering architecture based on shared storage

The core, implicit assumption behind Oracle RAC, IBM DB2, and many other database clustering solutions, is that failure can be avoided by purchasing high-end disk storage and using expensive hardware (fiber optics, etc). As can be seen from the research I mentioned earlier, this core assumption doesn’t correlate with the failure statistics. Hence I argue that the database clustering model is inherently breakable.

Failure can happen in various shapes and forms:

Expect network connections to fail (independent of the redundancy in networking)

Expect a single machine to go out

Human error example: Failure to use caution while pushing a rack of servers

How to design your application for resilience in the face of failure

Jason outlined some of the lessons from his experience at Amazon:

Expect and tolerate failures – build your infrastructure to cope with catastrophic failure scenarios: "Distributed systems will experience failures, and we should design systems which tolerate rather than avoid failures. It ends up being less important to engineer components for low failure rates (MTBF) than it is to engineer components with rapid recovery rates (MTTR)." (from Architecture Summit 2008 - Write up)

Code for large-scale failures – design your application to cope with the fact that services can come and go.

Expect and handle data and message corruption – don’t rely only on TCP to ensure the consistency of your data. Add validation as part of your application code to make sure that messages are not corrupted.

Code for elasticity – load/demand may grow unexpectedly. Elasticity is essential to dealing with unexpected demand without bringing the system to its knees. It is also important that when you exceed your capacity, the system won’t crash but instead report denial of service.

Monitor, extrapolateand react – monitoring at the right level enables you to detect failures before they happen and take the appropriate corrective actions.

Code for frequent single machine failures – rack failure can cause an immediate disappearance of multiple machines at the same time.

Game days – simulate failures and learn how to continuously improve your operational procedure for handling failures and architecture.

Memory can be more reliable then disk

Many people assumes that memory is an unreliable data storage. That assumption holds true if your data “lives” on a single machine; in this case if the machine fails or crashes your application crashes. But what if you distribute the data across a cluster of nodes and maintain more than one copy of the data over the network? In this case, if a node crashes the data is not gone; it lives elsewhere and can be continuously served from one of its replicas.

Under these conditions, I can argue that under certain conditions, memory can be more reliable than disk. One of the reason is the hardware itself – unlike disk drives that rely on mechanical moving parts, RAM is just a silicon chip. As such the chances for failure can be significantly smaller. Unfortunately, I don’t have the data to back up this argument, but I think it’s very likely that this is the case. In an article published on InfoQ titled RAM is the new disk, Tim Bray brings an interesting observation on this matter:

Memory is the new disk! With disk speeds growing very slowly and memory chip capacities growing exponentially, in-memory software architectures offer the prospect of orders-of-magnitude improvements in the performance of all kinds of data-intensive applications. Small (1U, 2U) rack-mounted servers with a terabyte or more or memory will be available soon, and will change how we think about the balance between memory and disk in server architectures. Disk will become the new tape, and will be used in the same way, as a sequential storage medium (streaming from disk is reasonably fast) rather than as a random-access medium (very slow). Tons of opportunities there to develop new products that can offer 10x-100x performance improvements over the existing ones.

As a matter of fact we (GigaSpaces/Cisco) are working these very days on benchmarking Cisco UCS, where we are evaluating the possibility to manage x100GB and potentially even Terabytes of data in a single node. The results look extremely promising, so the future is not that far ahead.