pureScale can now be deployed on virtual machines using sockets (on VMware ESXi and KVM)

Incremental backup support

DB2 Spatial Extender support

Online (inplace) REORG now supported

Federated two-phase commit

Some of these features are really useful to get started with pureScale like having the possibility to run pureScale using just a basic TCP/IP network or using virtual machines under KVM or VMware ESXi. This is great for who wants to explore pureScale technology just using a basic network or a virtualized environment!

But also, for actual pureScale users, two great enhancements like incremental backups and online REORG will be a huge improvement in terms of system and database maintenance and administration.

If you like me are really looking forward to try all these new features, just go and download the just released DB2 Cancun Release 10.5 FP4 at this link:

I thought you might be interested in a nice little demo tool that is being being developed by Jorge Mira and Christopher La Pat here in the lab.

This lightweight demo tool's purpose is to graphically demonstrate the key elements of pureScale WLB graphically and in real time on any pureScale database and without changing the databse or application running. It is not for production use!

Introducing psMON-:

psMon allows for
platform independent monitoring of a pureScale cluster to graphically demonstrate the key elements of pureScale WLB graphically and in real time. PsMon provides the user
with a view of a number of useful pieces of data regarding the current
operating status of a pureScale cluster. The data displayed is broken down into
two graphs:

A view of the system resources (CPU, memory, and
the current pureScale priority) .

The current
transaction throughput, optionally broken down to show commits and rollbacks.

In addition to the graphs, tables below the system resource
graphs provide information on what applications are using the highest amounts of
system resources.

The entire client application is built in Java and is
therefore platform independent. It is a lightweight executable JAR file that
can easily be run from almost any computer and can connect to any machine on
which the PSMServer application has been deployed.

Today I saw that the status of one of our testing pureScale instances was red.

Yes, red, it had a red sign, as this instance is on a PureData System for Transactions. The DB2 pureScale instances panel lets you have a quick look at the status of your instances and the databases deployed on them.

Connecting via ssh to the instance, and trying to run db2instance -list shown that this particular instance was not in a good shape:

db2vr1@compute05:/home/db2vr1> db2instance -list
The member, CF, or host information could not be obtained. Verify the cluster manager resources are valid by entering db2cluster -cm -verify -resources. Check the db2diag.log for more information.

Following the tip, I ran db2cluster -cm -verify -resources and it shown that the cluster state was inconsistent. At this point, I had a look at the db2diag.log and I could see that there were some errors related to cluster resources.

After seeing that the issue was due to the cluster resources for pureScale, I decided to try to stop and restart the cluster services with the following commands:

db2vr1@compute05:/home/db2vr1/sqllib/bin> ./db2cluster -cfs -start -all
All specified hosts have been started successfully.

2) Verifying and repairing the instance

At this point, I tried to verify again the status of the cluster with db2instance command:

db2vr1@compute05:/home/db2vr1/sqllib/bin> db2instance -list
The member, CF, or host information could not be obtained. Verify the cluster manager resources are valid by entering db2cluster -cm -verify -resources. Check the db2diag.log for more information.

Now, trying again to repair the cluster resources with the db2cluster command:

We have heard that a “tens of kilometres” limit applies to
the distance between the two sides of a Geographically Dispersed pureScale Cluster (GDPC).Buy why?

This is based on a physical
limitation i.e. the speed of light in glass (fibre) which is about 5 µs / km.From this we can calculate a round-trip times
from member to CF as follows at these distances:

3km = 30 µs

10km = 100 µs

50 km = 500 µs

100 km = 100 µs (or 1 ms)

300km = 3000 µs (or 3 ms)

This will have a significant effect on the performance of
the cluster, especially when we start to get into tens of kilometres.The
“normal” times for RDMA actions are of the order of 15 µs so to those we need
to add this latency for the distance.Compared
to a normal pureScale cluster (all in one location) anRDMA action will be slower at a distance as
follows.

As with all of the pureSystem family the use of patterns to automate repeatable tasks is a feature of pureData systems.

With pureData here are two types of patterns available:1) Topology Patterns. Topology patterns will install all of the software required to run pureScale and create the pureScale instance on a number of compute nodes. You can deploy a topology pattern of 2, 4, or 6 nodes. The 2 node topology pattern consists of an instance of 2 members with a cluster caching facilities (CFs) co-located on each of two compute nodes. This is the smallest practical HA installation of Db2. The 4 nodes topology pattern consists of 2 members and 2 CFs on separate compute nodes. The 6 node topology pattern is a 4 member cluster with 2 CFs on separate compute nodes. Of course the more nodes you deploy to the higher the performance and resilience.2) Database Patterns. A database pattern is essentially a method of storing and reusing database configuration settings. Databases patterns are used to create and configure a database within the instance topology. PureData systems will come with an IBM transaction processing database pattern. You can also "roll your own".

The use of topology and database patterns allows instances and databases can be deployed in a matter of minutes repeatably and dependably.

I am referring to the standard form of database disaster recovery whereby the production system is duplicated at a second "standby" site. The changes made at the production site are made at the standby site keeping it more or less up to date with the production site. This hardware and software, often costing as much as the production system, only gets used in the case of a pretty major disaster when it takes over the function of the production site. Fortunately these events are quite rare. In my experience the rarity of the disaster scenario makes it psychologically difficult to spend all of that money on stuff that will "probably never get used". This can be an even more onerous decision when the database is clustered and several servers are involved. So what can we do?

Those clever folks at the IBM labs have done it again. Why not simply stretch the cluster out so that half is at one data center and half is at another. If one data center fails for whatever reason the application keeps going. The best ideas are usually simple! Of course we retain the existing features of purescale: High availability, capacity on demand etc. This is the Geographically Dispersed PureScale Cluster (GDPS). More details are available. in this white paper

As Winston Churchill said (and famously paraphrased by a certain Roy Keane years later) "He who fails to plan plans to fail". This applies to any task that is more than trivial. Whether it is creating a new lawn or installing pureScale...

If you are going to be installing pureScale then the installation itself is really easy. Where the preparation comes in is in getting the environment set up beforehand. I strongly recommend that you take a careful look here at the DB2 infocenter "Planning for a IBM DB2 pureScale Feature for Enterprise Server Edition deployment" . This should be your first step towards installing pureScale and done before any of the logistics can fall into place. There are various considerations and checklists here such as:

Using a user managed file system or allowing pureScale to create one (I recommend the latter).

DB2 client considerations. Essentially to help you figure out what kind of client connectivity you want (workload balancing, Automatic client reroute etc.)

We are currently setting up the tpc-c benchmark on the cluster. Tpc-c is the standard benchmark for OnLine Transaction Processing ( http://www.tpc.org/tpcc/ ). We will be doing some test runs on the pureScale cluster and some tuning to see what kind of throughput we can get out of the cluster for typical OLTP workloads . We will start out with 4 nodes first with the default parameters, then start tuning and tweaking. Please let me know if you want to know more?

We are currently building nanoclusters in several locations worldwide, including here in Dublin. Somehow the lab guys have come up with a way to get pureScale to work on about $500 worth of hardware! Respect! The nanocluster is a pureScale cluster built on 3 intel Atom boxes with 1 acting as storage and running the demo software, the other two having one CF and one member on each. Gigabit ethernet is used between the nodes.

The nanoclusters (obviously) do not consist of a supported configuration and are not for production use.

Nanoclusters are a great way to get pureScale out there for demos and for clients, ISVs and partners to try it out for themselves. The cluster comes pre-packaged with demos and instructional software.

If you want to get your hands on one of these little beauties. Please contact your friendly local sales, avalanche teams or me for more information or a demo.

Unfortunately we can't give the nanoclusters out on loan for extended periods but we will also be releasing instructions and code to allow you to build your own.

1) Connection based WLB:In summary this is based on routing new connections to the servers with the lowest load. A list of information about servers is maintained. This is updated regularly and when this is done a coordinating member will construct the server list. Active members return their load information to the coordinating member. This includes Hostname, port number, CPU load, memory load.The coordinating member then sends the server list to the other members. On each server it's Weight is calculated by an algorithm and based on this server's load information and the total number of servers. Higher weight means there is lower workload on this machine so send more workload to it. The % workload being handled by a server is approximated from the number of connections the server is currently serving from the total number the entire cluster is serving. % of workload to be sent to member = this member weight / (total of other member weights). New connections are sent to servers where the "% workload being handled" is under the "% of workload to be sent to member"

2) Transaction based WLB.This works in a similar way to the above and involves the server list. Because we are not dealing purely with new connections as above, existing connections need to be actively rerouted to different members to rebalance the workload. This works as follows. A transport pool is maintained on each member, each connection can be moved from one member to an other (by disassociation from a transport on the first server and association to an transport on the second server). After every 8 transactions or 2 seconds whichever comes first, each server will attempt to re-balance workloads by moving the logical connections if necessary.

Notes:WLB for purescale involving j2ee is configured in the j2ee driver file. db2pd -serverlist shows the currently cached serverlist on this member (note priority and weight are synonymous).

The question of preventing split brain
scenario comes up again and again with regard to pureScale (PS).

The scenario is as follows:

In a standard PS setup we have a
primary and a standby CF. If the connection between these two
machines fails but both keep going then the secondary node would
"think" that the primary has failed and perform a failover.
Now both CFs would take control of the shared data (the database)
and the database would end up in a big mess. This would happen if
the networking between the two machines broke down or if one got
really busy and couldn't respond to the other fast enough.

Of course if this was true the we would
be in big trouble but fortunately it is not. A technology called I/O
fencing is used to ensure the above scenario can't happen.

I/O fencing is implemented via SCSI-3
Persistent Reserve technology. The core of the technology involves
“registration” and “reservation” rights to disk partitions.
Registration allows access to data. Many nodes (members and Cfs) can
have “registration” access but only one can hold ”reservation”
on a partition. Registered nodes can eject others. Ejection is a
final and atomic action. An ejected node cannot eject another node.

Cluster services software on each node
manages various failover scenarios in the cluster. There are
numerous failover scenarios and these things are worked out to the
nth degree. In outline if any failures are detected then all nodes work out what to do in a similar way. First of all to say what a quorum is. A quorum is a group of nodes in a cluster that can communicate with each other, the number of nodes in a quorum must be more than half of the total in the cluster or if exactly half must have "reserve" on the tie break partition. If I am part of a "quorum" I can continue and take part in a failover and recovery, the first part of which is to eject or fence any nodes that are not part of the quorum. This prevents the "bad" nodes from updating the shared data. If I am a "bad" node i.e. not part of a quorum I wait to regain access to the other nodes and when I regain access I must undo anything I have done locally since the problem started (tidy up). I can then rejoin the cluster.

First to say what it is not not suited to i.e. data warehouse type applications. It is a shared disk solution and as such not really suitable for data warehousing. This is because of tendency of large transactions being the main workload in such an environment.

It is suited to OLTP loads.

Some questions:

Do you need to come up with a database solution for your application? This could be a new build or replacing old hardware and software. Do you have an application that generates a lot of small of smallish transactions? Do you need continuous availability and built in resilience? Do you need to be able to ramp up the capacity of your system easily in the future rather than buying all of the hardware and licenses you might need over the next 2 - 5 years now?

If the answer if yes to most of these questions then pureScale is for you.

I guess you might ask "why is this relevant?". Well 10 microseconds is approximately the time taken for a purescale member to communicate with the central cache to look for a piece of data. Let's call this a "pureScale communication" for the sake of simplicity. More on the technicalities of purescale communication, Remote Direct Memory Access (which facilitates this communication) etc in the next blog entry but for now...

...have you every stopped to think what length of time 10 microseconds represents?

A microsecond is 10 to the minus 6 seconds or one second divided by a million. I think this is so small a number that it is hard for us to understand. I looked for some examples to illustrate just how fast this is and there are some here on wikipedia but nothing that is intuitively understandable (at least not to me).

I though I could find something to explain this and here's a couple of things that are quite quick:

Quick as the blink of an eye? So how does a pureScale communication compare. The blink of an eye takes about 350,000 microseconds. That's 35,000 pureScale communications!

Just a brief look at the architecture of a pureScale cluster at a very high level. Questions welcome.

A DB2 pureScale cluster is made up of number of servers which are connected together, a shared area of disk and some software that all work together to provide a high performance and resilient database.

The cluster is made up of a number of "controllers" or Coupling facilities (CF) and a number of members.

Coupling facilities:

Provide centralized locking and caching.

Do not run DB2 and do not perform what you might think of as the normal work of the database i.e. processing queries.

Members:

Run DB2 and do what you might think of as the normal work of the database i.e. processing queries

Access the centralized locking and caching provided by the CF.

Use Remote Direct Memory Access (RDMA) to communicate very quickly with the rest of the cluster.

There are normally two CFs - a primary one and standby one which can take over in case of any fault on the main one.

There are normally 2 or more members. The number of members can be increased to add more processing power to the cluster.