Web/Tech

March 07, 2013

As the technology for building and running cloud infrastructure matures, it is starting to spread out into more industries and revolutionizing how even the most conservative organizations are running their entire operations.

One of the areas undergoing a transformation is the carrier backbone services. For those who are not familiar, carrier backbone services includes services like cell and network services (DHCP, DNS, …), content serving (SMS, MMS, …), activation services, CRM, call centers, etc. Moving these critical carrier services from the existing environment tends to be labor intensive and proprietary. Selecting to move applications into a more open and virtualized environment such as cloud could yield a significant cost saving. An open cloud environment also enables Carrier organizations to reduce their time to market for delivering new services.

At time where the Telco/Carrier market is increasingly competitive, moving to a cloud-based carrier backbone can be more than a cost saving initiative. It can be a differentiator from the competition and is critical for survivability and success of the business.

That said, running carrier grade services requires special care to meet the required SLAs in terms of latency, deterministic behavior, performance, location awareness, etc. These challenges are unique enough not to fit into your mom-and-pop cloud.

The purpose of the Carrier Grade Cloud and Carrier Grade PaaS is to address these gaps and challenges.

In this post I'll try to provide a more detailed overview on what Carrier Grade Cloud and PaaS actually means. I will use examples based on GigaSpaces’ joint work with Alcatel. Alcatel-Lucent recently launched a new product in this space named CloudBand which is using Cloudify for its Carrier PaaS layer.

What does Carrier Grade mean? Learning from the Weather Chanel experience during the Sandy super-storm

The Weather Channel’s experience during Sandy is an excellent example of the need for carrier grade services. Below is a list of some of the key statistics during Hurricane Sandy:

1000% - The Weather Company’s traffic increase during Hurricane Sandy

110 GB- The amount of data, served every second during Sandy

170,000 - Peak number of simultaneous streams of video served during Sandy

1 - The amount of data centers that went bust during the storm

To address this demand during the storm, the Weather Chanel was running from 13 Data Centers managed by Verizon across North America, all with load balancing between them. During the storm Verizon increased their bandwidth capacity to meet the peak demand.

This sort of increased traffic behavior wasn't unique to the weather channel, as noted below:

So what can be learned from this process?What makes a service Carrier Grade?

Learning from the Weather Channel experience we can define a Carrier Grade service as a service with the following attributes:

Critical to the business function

Designed for massive scale

Designed to deal with major usage spikes

Location sensitive

Designed to provide deterministic response during extreme condition

This is obviously a fairly simplistic definition, but for the sake of this discussion I think it will suffice.

What Makes a Cloud/PaaS Carrier Grade?

There are various attributes that makes a Cloud/PaaS carrier grade, as I noted earlier. The two most important attributes IMO are the network and multi-site deployments. Let me explain why:

The Network

One of the main elements that is extremely important in in a Carrier Grade environment is the ability to assert control over the network.

That include the control over:

Isolation

Bandwidth

Latency

Cross-Cloud/Data Center Deployments:

Another critical element of a successful Carrier environment is the multi-site deployment. As seen with the Weather Channel’s use of 13 sites, multi-site deployment is important to address continues availability and scaling. Optimizing the latency by surveying the content closer to the location of the end user also helps to deal with challenges of data delivery.

So how are things done today?

The current Carrier backbone runs on physical appliances which basically maps to lots of irons. In this environment scaling capacity means buying more appliances. While this model works, it has two main drawbacks:

cost (infrastructure/operation)

lack of agility (i.e. it takes months and sometimes years to launch new service in this environment).

Alcatel CloudBand -- Carrier Grade IaaS/PaaS

Alcatel CloudBand is a new platform that let Telco apps easily leverage the carrier cloud services.

It is comprised of a few main elements.

Multi node/site IaaS -- a multi-site/Cloud infrastructure. The CloudBand infrastructure is essentially a policy based management on large numbers of cloud nodes. Each cloud node can run either an OpenStack or CloudStack-based infrastructure. These nodes can live in many disparate data centers. Alcatel CloudBand glues all of the disparate nodes together into a single big cloud that is accessible through an OpenStack API.

CPaaS -- Stands for Carrier Grade PaaS, which is essentially the framework enabling the on-boarding of the carrier services into the CloudBand infrastructure via a simple click and run user interface. Cloudify is integrated into this as an integral part of the CloudBand offering.

CloudBand's Unique Approach: Putting Network and Application Together

One of the unique aspects of the CloudBand architecture is its holistic approach to Network and Application. Standard cloud infrastructures tend to look at the two pieces as separate black boxes that run one on top of the other.

What does thisnew approach to Network and Applications really means?

Two example scenarios that I often use to describe the value of putting network and application together is in the areas of Disaster Recovery and Cloud Bursting. In today’s cloud, DR involves lots of wiring in which i need to explicitly point a segment of the application into a particular cloud zone and the other to another zone. Beyond the complexity of setting these zones up, it also means that there is a good degree of manual intervention required to handle a recovery or a scaling process in this environment.

Taking an automated SLA-driven approach to IaaS

Instead of identifying explicitly the zones in our cloud, with automated SLA we can simply ask the cloud to figure out the right zone for the job based on our application SLA. For example, a user could simply say something like "deploy RingTone service" where continuous availability=true redundancy=3 and distance between sites=100km". Most of that information is known to the CloudBand management at the time of deployment and it can therefore allocate machine instances not solely on image ID and zone ID, but also based on those SLA requirements.

Integrating the PaaS with the network

Many of the current PaaS solutions were designed to work with a simple cloud infrastructure.

If we design our PaaS solutions to work on top of a more intelligent infrastructure, like CloudBand, that can accept SLA-driven calls to coordinate infrastructure management, a revolution will happen. We can start looking at offloading some of the responsibility for allocating the right machine instance to a particular application tier to the infrastructure. The infrastructure could be made aware that we're deploying a data service and would therefore ensure that the nodes of that database don't reside on the same physical machine or even the same data center. Another area where the responsibility could be delegated to the infrastructure is the network isolation. Instead of dealing with security groups, the system can attach a particular network for a given application or a tier within that application and the infrastructure will make sure that any machine that is allocated for this service would be attached to this network.

Final Words

For years there has been discussion on the missing piece in the cloud puzzle - the network. Today we're at a point where this gap is starting to be filled up by projects like Quantum in OpenStack. In addition to OpenStack the Telco industry is also launching a new initiative titled NFV which stand for Network Function Virtualization. NFV was born in October of 2012 when AT&T, BT, China Mobile, Deutsche Telekom and many other Telcos introduced the NFV Call to Action document. It basically aims to combine new network API with virtualization and thus provide a standard model for a virtulized Carrier Cloud.

While it seems that the industry is moving in the right direction toward the Virtualization of the backbone systems, most of the effort seem to be focused on standardisation at the lower level of stack. Very little has been done to draw the real end game i.e. how would an end to end Carrier backbone would look like given that new virtualized infrastructure in place. More importantly we haven't yet began to think of what would be the implication of that infrastructure change on the application and services ontop of it.

This is what excites me in the CloudBand project. CloudBand doesn't just end up with yet another fancy infrastructure piece that we don't necessarily know how and what do with it. It actually takes the holistic approach and maps those fancy features into a real end to end solution which at the most basic level maps to the fact that setting up data and network clusters, disaster recovery or cloud bursting scenarios can now be fully automated in a much simpler fashion than in most of the current cloud infrastructure environment.

At a more strategic level that means that Carrier can now rely on the the cloud as an infrastructure that could manage their backbone services and thus leverage the cloud economics to meet their cost and business challenges.

June 01, 2009

In times like these, improving application performance isn't a major focus for most IT organizations. The common perception is that as long as you're meeting the bare minimum demanded by your users, you're okay - anything beyond that is a luxury you can’t afford. Well, I happen to think this perception is dead wrong: these days, you just can’t afford not to invest in high performance. The reason is simple: high performance == higher utilization.

I’ll explain what I mean. If you do something that makes an application run 10 times faster (this is a typical performance boost experienced by GigaSpaces users - and by the way, XAP 7.0 will be even faster), without changing your loads or service levels, then that application will consume 90% less resources. Or in other words, you can consolidate the servers running this application at a ratio of 10:1. The amazing thing is, this isn’t instead of the server consolidation you'll get from vendors like VMware - it comes on top of and in addition to it, because it helps you cram more virtual machines and more applications onto every piece of physical hardware.

A great example of this is an eBay subsidiary, Marktplaats, which has moved its application to XAP and is now expecting to reduce their data center from a few hundred servers to only a handful - the consolidation ratio is a whopping 18:1. Marktplaats says this reduction is largely a result of the huge performance boost they experienced, which was made possible by XAPs In-Memory Data Grid and parallel processing capabilities.

XAP also makes it possible for extreme performance to thrive in unexpected places - one example is an XTP trading platform which, thanks to GigaSpaces XAP, has become SaaS-enabled, a major differentiator for the platforms makers, Orbyte Solutions. Another is our recently-announced joint solution with Mule, the open source ESB, which proves that "high performance SOA" is not an oxymoron :)

There are many challenges- There is no out-of-the-box infrastructure for hosting the typical J2EE and SOA Stack in the cloud. There is no Weblogic, WebSphere, ALBPM, Message Bus like Tibco available in the cloud.

A development team could certainly move all of this into the cloud, but the configuration, licensing issues etc. are all something the team would have to solve on its own. This is far too bleeding edge for many people.

.. The problem of adding additional resources dynamically (e.g. more WebCache instances, or WebLogic servers) requires sophisticated distributed system management infrastructure where the entity being managed is no longer a physical or virtual box, but rather an array of boxes acting collectively as a single system..

Below are the main takeaways from Grig's summary, which I found relevant for this discussion:

1) Deploy multiple Web servers 2) Deploy multiple load balancers 3) Deploy several database servers. 4) Another way of dealing with databases is to not use them

Challenges summary?

It's easy to see that there is a common theme behind all those comments. Taking existing enterprise applications to the cloud can be very difficult simply because a) most of today's enterprise applications were built using frameworks and technologies not yet supported as first class citizen by cloud providers and b) most of those applications were not designed to take advantage of the cloud's elasticity.

Rather then pointing to my direct response to each of those challenges i thought that it would be better to provide a short summary of the main possible solutions that came through this discussion.

Does it have to be that difficult?

No. Below are two main approaches to those challenges.

- Packaging static images

The simplest approach would obviously be to package your local IT environment into images that could be easily ported to the cloud in the exact same way they run in your local IT environment; right? Well, yes, you can package anything in an image bundle and host your virtual machines in a reserved mode with fixed IP configuration. However, being able to technically do that doesn't mean that it makes sense. I would question what's the difference between this environment and any other hosting environment, and what do you expect to get by moving to such a hosted environment vs. running it in your local IT environment.

If you are going to try deploying your existing IT application on the cloud using static images, then most likely you'll end up "porting" not just the application but also the problems you were facing in your local IT environment; i.e. your application will be over-provisioned based on the peak load and you’ll end up with poorly utilized environment.

- Fully elastic application

The main driver for moving to cloud based environment in the first place was to be able to grow as you need and pay for what you use.

The question is whether you can deploy your application without changes to the application while at the same time leverage the elasticity that cloud brings.

Sounds impossible? Well, a good example that does just that is Storage. With storage you can take your existing application, run it with your local (static) disk, and then plug in a network storage and run the same application on that network device, without changing the application. In that world, instead of taking your existing local disk and virutalize it, you are taking the application and plug it to another device that has visualization built-in.

We can use the same approach as with storage; i.e. move your existing application code and run on top of a different underlying implementation that will enable you to capture the elasticity of the cloud without forcing you to re-write your entire application. If you're running in a JEE environment, it should be fairly easy. If your application has strong "ties" to back-end systems, you can use the hybrid model where your application front-end is running on the cloud, while being connected to the back-end system, using secured communication channel.

Is it really that simple?

Yes. My experience with the integration work that we had done between our Cloud Computing Framework (CCF) and EC2 showed me that getting a production-ready JEE application, including a load-balancer, self healing, auto scaling, security, database and even data grid capabilities plugged in, is actually much simpler than with any other environment I'm aware of. This is due to the built-in automation, predefined images, and the fact that I don't need to download and setup anything to get the entire system up and running. In fact, it's so simple that we decided to built our entire Demo as a Service framework around it; a framework that is used quite successfully and constantly, with customers, prospects, and now with partners.

Are there any production references?

Yes. Read Jim Liddle's blog post and see a good example of an Enterprise JEE deployment that is already running on EC2, in production, on top of our new Cloud Computing Framework. In his post, Jim describes how this Telco operator were able to address some of the common challanges that were mentioned above such as security,flexability, cost, development complexity and lock-in as well as high-avliability and scalability in a relatively simple manner .

What next?

In my discussion in the CloudSlam event I’ll try to provide a more detailed practical guide that demonstrating how you can take a step by step approach for porting existing JEE application to the cloud. I hope to get lots of questions and feedback during the discussion so that I can share those with you in one of my follow-up posts.

November 10, 2008

Last week I took part in an interesting discussion with
a group of architects, and the question of build vs. buy came up. It came up specifically in the context
of the recent experience with many of new Internet companies. I was wondering
why it is that that many of them seem to spend so much in
developing their own proprietary infrastructure, when it's clear that their
needs are not that unique and that such development is not really part of their
core IP. Many of them seem to continuously go through difficult experiences until they
get their infrastructure right. And it seems that they all stumble into the same pitfalls along the way.

The typical answers as to why they build vs. buy
were:

It's core to our intellectual property and therefore we
have to own all of our infrastructure

We didn't find a solution that fits our needs since our
needs are very unique

We had a bad experience with Product FooBar which made
us reevaluate build vs. buy

I can see how I'd react in exactly the same way; the most
basic human instinct, when entering uncharted territory, is to rely
only on yourself.

But looking at the amount of repeated failure over the past
few years, it's pretty clear that this pattern isn't really proving itself too well either. Even when we choose to build it ourselves, according to our own specific in-house requirements, we still end up falling into the same trap over and over again.

Where to draw the line of build vs. buy?

To answer that question, I looked at Fred Brooks's article "No Silver Bullet" which
was pointed out to me again by one of our lead architects few weeks
ago.

One of the interesting points was the drastic impact of
the economy on the build vs. buy decision pattern:

"The development of the mass market is, I
believe, the most profound long-run trend in software engineering. The cost of
software has always been development cost, not replication cost. Sharing that
cost among even a few users radically cuts the per-user cost. Another way of
looking at it is that the use of N copies of a software system effectively
multiplies the productivity of its developers by N. That is an enhancement of
the productivity of the discipline and of the nation.

The key issue, of course, is applicability.
Can I use an available off-the-shelf package
to perform my task? A surprising thing has happened here. During
the 1950's and 1960's, study after study showed that users would not use
off-the-shelf packages for payroll, inventory control, accounts receivable, and
so on. The requirements were too
specialized, the case-to-case variation too high. During the 1980's,
we find such packages in high demand and widespread use.

What has changed? Not the packages, really. They may be somewhat more generalized and somewhat more customizable than before, but not much. Not the applications, either. If anything, the business and scientific needs of today are more diverse and complicated than those of 20 years ago.

The big change has been in the hardware/software cost ratio. In 1960, the buyer of a two-million dollar machine would have felt that he could afford $250,000 more for a customized payroll program, one that slipped easily and nondisruptively into the computer-hostile social environment. Today, the buyer of a $50,000 office machine cannot conceivably afford a customized payroll program, so he adapts the payroll procedure to the packages available. Computers are now so commonplace, if not yet so beloved, that the adaptations are accepted as a matter of course."

The impact of cloud computing on the buy vs. build decision

I think Fred's analysis above is much more than just a historic curiosity. Exactly the same process is playing out today, with the advent of cloud computing and virtualization techniques that are turning IT infrastructure into a commodity, on the road to becoming a utility, and dramatically reducing its total cost.

As Fred says in his paper - when the hardware gets cheap, development becomes very expensive. Under these new conditions, we're all going to have to change how we evaluate off-the-shelf products compared to the alternative of developing in-house. Proper TCO measurements need to be put in place at an early stage of the decision making
process.

For example, it will no longer be sufficient to choose a product
based on the "best performance" or even "best reliability," because each of those factors has
a direct cost associated with it. Instead, we are forced to have a better
picture of the business requirements, so that we can choose the right product to
meet our business needs, and it's not always going to be that the best product
from a technical perspective is the right product - and the cheapest product won't be the right
product either.

"The hardest single
part of building a software system is deciding precisely what to
build. No other part of the conceptual work is as difficult
as establishing the detailed technical requirements, including all the
interfaces to people, to machines, and to other software systems. No other part
of the work so cripples the resulting system if done wrong. No other part is
more difficult to rectify later."

It is quite surprising to see how much of the current
decision-making process is not based on real business requirements. It is even
more surprising to see how little we as architects and business people know about
their system requirements and real application behavior.

A good example that was given in the architect meeting
is the user experience. One participant in the discussion said that at one point, he was focusing on making the latency of
serving his site pages as fast as possible and did a good job at that, but at the
end of the day, when measured against a competing site that was performing slower,
the impression of the user was that the competing site was performing better -
the reason was simple, the other site was focused on user experience which led
to less clicks per request and not how much time a single request is being
executed.

If using off-the-shelf products can
cut costs dramatically, why are there are so many product
failures?

Fred provide an interesting answer to that question as
well:

"Much of present-day software-acquisition procedure
rests upon the assumption that one can specify a satisfactory system in
advance, get bids for its construction, have it built, and install it. I think
this assumption is fundamentally wrong, and that many software-acquisition
problems spring from that fallacy. Hence, they cannot be fixed without
fundamental revision--revision that provides for iterative development and
specification of prototypes and products."

Final words

You might be thinking by now that these are all new lessons
learned from the recent changes in the economy, right? – wrong. Go check when
Fred Brooks' article was written.

If anything, I would strongly recommend that everyone reading this post would spend time reading Fred's article from start to finish, because I've only covered a small part of the philosophy behind his paper. I think the paper's viewpoint is extremely relevant today -- perhaps even more relevant today then it was when he originally wrote it.

March 29, 2008

With the recent acquisition of MySQL by Sun, there has been talk about the MySQL open source database now becoming relevant to large enterprises, presumably because it now benefits from Sun's global support, professional services and engineering organizations. In a blog post about the acquisition, SUN CEO Jonathan Schwartz wrote that this is one of his objectives.

While the organizational aspects may have been addressed by the acquisition, MySQL faces some technology limitations which hinder its ability to compete in the enterprise. Like other relational databases, MySQL
becomes a scalability bottleneck because it introduces
contention among the distributed application components.

There are basically two approaches to this challenge that I'll touch in this post:

1. Scale your database through database clustering

2. Scale your application, while leaving your existing database untouched by front-ending the database with In-Memory-Data-Grid (IMDG) or caching technologies. The database acts as a persistence store in the background. I refer to this approach as Persistence as a Service (PaaS).

While both options are valid (with pros and cons), in this post I'll focus mostly on the second approach, which introduces some thought-provoking ideas for addressing the challenge.

Disclaimer: While there are various alternative in-memory data grid products, such as Oracle Coherence and IBM ObjectGrid, in this post I'll focus on the GigaSpaces solution, because for obvious reasons I happen to know it better. Having said that, I try to cover the core principles presented here in generic terms as much as possible.

Scaling your database through database clustering:

There are two main approaches for addressing scalability through database clustering:

Database replication is used to address concurrent
access to the same data. Database replication enables us to load-balance the
access to the shared data elements among multiple replicated database
instances. In this way we can distribute the load across database
servers, and maintain performance even if the number of concurrent users increases.

Limitations:

Limited to "read mostly" scenarios: when it comes to inserts and updates, replication overhead may be a bigger constraint than working with a single server (especially with synchronous
replication)

Performance: Constrained by disk I/O performance.

Consistency: asynchronous
replication leads to inconsistency as each database instance might
hold a different version of the data. The alternative -- synchronous replication -- may cause significant latency.

Utilization/Capacity: replication
assumes that all nodes hold the entire data set. This creates two problems:.1) each table holds a large amount of data, which
increases query/index complexity. 2) We need to provision (and pay for) more storage capacity with direct
proportion to the number of replicated database
instances

Complexity: most database
replication implementations are hard to configure and and are known to cause
stability issues.

Non-Standard: each database product has
different replication semantics, configuration and setup. Moving from one
implementation to another might become a nightmare.

Database partitioning ("sharding"): database shards/partitions enable the distribution of data on multiple nodes. In other words, each node holds part
of the data. This is a better approach for scaling both read and write
operations, as well as more efficient use of capacity, as it
reduces the volume of data in each database instance.

Limitations:

Limited to applications whose data can be
easily partitioned.

Performance: we are still constrained by disk I/O performance

Requires changes to data model: we need to modify the database schema to fit a partitioned model. Many database implementations require that knowledge of which partition the data resides in is exposed to the application
code, which brings us to the next point.

Requires changes to application
code: Requires different model for executing aggregated queries (map/reduce and the like).

Static: in most database implementations, adding or changing partitions involves down-time and
re-partitioning.

Complex: setting-up database
partitions is a fairly complex task, due to the amount of
moving parts and the potential of failure during the process.

Non-standard: as with replication,
each database product has different replication semantics, configuration and setup.
Partitioning introduces more severe limitations, as it often requires changes to
our database schema and application code when moving from one database product to
another.

Time for a
change - is database clustering the best we can do?

The fundamental problems with both database replication and database partitioning are the reliance on the performance of the file system/disk and the complexity involved
in setting up database clusters. No matter how you turn it around,
file systems are fairly ineffective when it comes to concurrency and
scaling. This is pure physics: how fast can disk storage be when every data
access must go through serialization/de-serialization to files, as well as
mapping from binary format to a usable format? And how concurrent can it be
when every file access relies on moving a physical needle between different file
sectors? This puts hard limits on latency. In addition, latency is often severely affected by lack of scalability. So putting the two
together makes file systems -- and databases, which heavily rely on them -- suffer from limited performance and scalability.

These database patterns evolved under the assumption that memory is scarce and expensive, and that network bandwidth is a bottleneck. Today, memory
resources are abundant and available at a relatively low cost. So
is bandwidth. These two facts allow us to do things differently than we used to, when file systems were the only economically feasible option.

Scaling through In Memory Caching/Data Grid

It is not surprising
that to enhance scalability and performance many Web 2.0 sites use an in-memory caching solution as a
front-end to the database. One such popular solution is memcached. Memcached is
a simple open source distributed caching solution that uses a protocol level
interface to reference data that resides in an external memory server. Memcached enables rudimentary caching and is designed for read-mostly scenarios. It is used mainly as an addition to the LAMP stack.

The simplicity of memcached is both
an advantage and a drawback. Memcached is very limited in functionality. For
example, it doesn't support transactions, advanced query semantics, and
local-cache. In addition, its protocol-based approach requires the
application to be explicitly exposed to the cache topology, i.e., it needs to be aware of each server host, and explicitly map operations to a specific node. These limitations prevent us from fully exploiting the memory
resources available to us. Instead, we are still heavily relying on the database for
most operations.

Enter in-memory Data Grids.

In-memory data grids (IMDG) provide object-based database capabilities in memory, and support core database functionality, such as advanced indexing
and querying, transactional semantics and locking. IMDGs also abstract data topology from application code. With this
approach, the database is not completely eliminated, but put it in
the *right* place. I refer to this model as Persistence as a Service (PaaS). I covered the core principles of this model in this
post. Below I'll respond to some of the typical questions I am asked when I present this approach.

How
Persistence as a Service works?

With PaaS, we
keep the existing databases as-is: same data, same schema and so on.
We use a "memory cloud" (i.e., an in-memory data grid) as a front-end to the
database. The IMDG loads its initial state from the database and from that
point on acts as the "system of record" for our application. In other words, all updates and
queries are handled by the IMDG. The IMDG is also responsible for keeping the
database in sync. To reduce performance overhead, synchronization with the
database is done asynchronously. The rate at which the database is kept in
sync is configurable.

The in-memory data
model can be different from the one stored in the database. In most
cases, the memory-based data model will be partitioned to gain maximum scalability and
performance, while the database remains unchanged.

How does PaaS improve performance

compared to a relational database?

Performance gains over relational databases are achieved because:

PaaS relies on memory as the system of record, and memory is significantly faster and more concurrent than
file systems.

Data can be accessed by reference, i.e., no need for continuous serialization of data, as with
a file system.

Data manipulation is performed
directly on the in-memory objects. Complex manipulation is
easily achieved by running either Java/.Net/C++ code or a SQL query. There is
no need for serialization/de-serialization of data or network
calls during the process.

Reduced contention: instead of placing all data in a single table, and consequently having many clients accessing that table, we split it into many small tables, each of which will be accessed by a smaller number of clients.

Parallel
aggregated queries: queries that need to span multiple partitions to
perform join/sum/max operations can be executed in parallel across
the nodes. The fact that the queries run on smaller data sets reduces the time
it takes to perform the actual operation on each node. In addition, the
fact that queries execute on multiple machines leverages the full
CPU and memory power of those machines.

In-process
local cache: read-mostly operations are cached in the client
application local address space. This means that subsequent reads will be executed
locally.

Avoid Object-Relational Mapping (ORM): read operations are performed directly from memory in object format. Thus, there is no need for O/R mapping overhead at this level. O/R mapping
happens in the background either during the initial load process, or during the asynchronous
update of the database.

If you keep the
database in sync, isn't your solution limited by database
performance?

No. Because:

Data is sent asynchronously and in
batches

Updates are performed in parallel by all
partitions.

Updates to the database are
executed collocated in the same machine as the database through a mirror service. This enables to reduce the network overhead to the data base as well as benefit from specific optimization such as batch operations.

The database is not used for
high availability purposes. This means that In-flight transactions are not stored in the database, only the end result of the business transactions. This reduces the amount of updates
that are sent to the underlying database. Also keep in mind that queries
don't really hit the database, only updates and inserts. All this together
means that the IMDG acts as a smart buffer to the database. It is common that the number of read/update hits the IMDG receives is 10x higher than the number of hits on the underlying database is seeing.

The database and the
application are now decoupled, giving you more options for
optimization. For example, there are scenarios where writing to the database is
required to ensure the durability of the data. In this scenario, you can store
the data directly in a persistent log (to ensure durability). The log can be
updated at a relatively high rate. You can read the data from the persistent
log back into the database as an off-line operation. With these options in place we
can easily get to 30,000 to 40,000 updates per second with a single instance of MySQL. If this is not sufficient you can always combine data base clustering to speed up the data base access.

Doesn't
asynchronous replication mean that data might be lost in case of failure?

No, because asynchronous replication refers to the transfer of data between the IMDG and the database. The IMDG, however, maintains
in-memory backups that are synchronously updated. This means that if one of the
nodes in a partitioned cluster failed before the replication to the underlying database took place, its backup will be able to instantly continue from that exact
point.

What happens if
one of my memory partitions fails?

The backup
of that partition takes over and becomes the primary. The data grid cluster-aware
proxy re-directs the failed operation to the hot backup implicitly. This enables
a smooth transition of the client application during failure -- as if nothing
happened. Each primary node may have multiple backups to further reduce the chance of total failure. In addition, the
cluster manager detects failure and provisions a new backup instance on
one of the available machines.

What happens if
the database fails?

The IMDG
maintains a log of all updates and can re-play them as soon as
the database becomes available again. It is important to note that during
this time the system continues to operate unaffected. The end user will not notice this failure!

How do I maintain
transactional integrity?

The IMDG
supports the standard two-phase commit protocol and XA transactions. Having said that, this
model should be avoided as much as possible due to the fact that it introduces
dependency among multiple partitions, as well as creates a single point of
distributed synchronization in our system. Using a classic distributed
transaction model doesn't take advantage of the full linear scalability potential of the partitioned topology. Instead, the recommended approach is
to break transactions into small, loosely-coupled services, each of which can be
resolved within a single partition. Each partition can maintain transaction
integrity using local transactions. This model ensures that in partial
failure scenarios the system is kept in a consistent
state.

How is
transactional integrity maintained with the database?

As noted
above, distributed transactions might introduce a severe performance and scalability bottleneck, especially if done with the
database. In addition, attempting to execute transactions with the database violates one of the core principles behind PaaS: asynchronous updates to
the database. To avoid this overhead, the IMDG ensures that transactions are
resolved purely in-memory and are sent to the database in a single batch. If
the update to the database fails, the system will re-try that operation until the
update succeeds.

This model relies heavily on partitioning. How do I handle queries that need to span
multiple partitions?

Aggregated
queries are executed in parallel on all partitions. You can combine this model
with stored procedure-like queries to perform more advanced manipulations, such as
sum and max. See more details below.

What about stored procedures and prepared statements?

Because the
data is stored in memory, we avoid the use of a proprietary language for stored procedures. Instead, we can use either native Java/.Net/C++ or dynamic
languages, such as Groovy and JRuby, to manipulate the data in memory. The IMDG
provides native support for executing dynamic languages, routes the query to where
the data resides, and enables aggregation of the results back to the client. A reducer
can be invoked on the client-side to execute second level aggregation.
See a code example that illustrates how this model works here.

Can I change these prepared statements and stored procedure equivalents without bringing down the data?

Yes. When
you change the script, the script is reloaded to the server while the server is
up without the need to bring down the data. The same capability exists in case
you need to re-fresh collocated services code on the server-side.

Do I need to
change my application code to use an IMDG?

It depends.
There are cases In which introducing an IMDG can be completely seamless and there
are cases in which you will need to go through a re-write, depending on the programming model:

Abstracts the
transaction handling from the code. Domain model is based on POJOs, and therefore,
doesn't require explicit changes, only annotations (annotation can be provided
through an external XML file). If our application already uses a DAO pattern then it would require changing the DAO. This allows narrowing down the scope of changes required to use
an IMDG-specific interface. This option is highly recommended for best performance
and scalability.

See details
here.Do I need to
change my code if I switch from one topology to
another?

No. The
topology is abstracted from the application code. The only caveat is
that your code needs to be implemented with partitioning in mind, i.e., moving from
a central server or a replicated topology to partitioning doesn't require changes to
the code as long as your data includes an attribute that acts as a routing index.

How are IMDGs and PaaS different from in-memory databases (IMDB)?

The relational model itself doesn't prevents us from taking full advantage of the fact that the data is stored as objects in memory. For example, when we use in-memory storage in an IMDG, we don't
need the O/R mapping layer. In addition, we don't need separate
languages to perform data manipulation. We can use the native
application code, or dynamic languages, for that purpose.

Moreover, one of the fundamental problems with in-memory databases is that relational SQL semantics is not geared to deal with distributed data models. For example, an application that runs on a central server and was uses things like Join, which often maintains references among tables, or even uses aggregated queries such as Sum and Max, doesn't map well to a distributed data model. This is why many existing IMDB implementations only support very basic topologies and often require significant changes to the data schema and application code. This reduces the motivation for using in-memory relational databases, as it lacks transparency.

The GigaSpaces in-memory data grid implementation, for example, exposes a JDBC interface and provides SQL query support. Applications can therefore benefit from best of both worlds: you can read and write objects directly through the GigaSpaces API, query those same objects using SQL semantics, and view and manipulate the entire data set using regular database viewers.

Can I use an existing Hibernate mapping to map data from the database to the IMDG?

Yes. In addition, with PaaS, the Hibernate mapping overhead is reduced as most of it happens in the background, during initial load or during the asynchronous update to the database.

Yes. Starting with GigaSpaces 6.5 both Hibernate (Java) and nHibernate (.Net) are supported. C++ applications deffer to the default Hibernate implementation. In addition, with GigaSpaces' new integration with Microsoft Excel, .Net users can easily access data in the IMDG directly from their Excel spreadsheets without writing code!

Final
words:

While this approach is generic and can be applied to any database product, MySQL is the most interesting to discuss as it is widely adopted by those who need cost-effective scalability the most, such as web services, social networks and other Web 2.0 applications. In addition, MySQL faced several challenges
in penetrating large enterprises. With the acquisition of Sun, MySQL
becomes a viable option for such organizations, but still requires the
capabilities mentioned above to compete effectively with rival databases. The combination of IMDG/PaaS with MySQL provides a good solution for addressing some of the bigger challenges in cloud-based deployments. More on that in a future post.