This blog is by a long-time Oracle storage professional who has history with both NetApp and EMC.

November 2011

November 11, 2011

Appliances such as microwave ovens, refrigerators, iPods, iPads and TVs are excellent examples of the ease-of-use approach. Bringing the inherently complex world of Oracle databases together with the ease-of-use approach of appliances is challenging. By definition if Oracle Exadata is an appliance then its use should be simple, require relatively little maintenance and like a refrigerator do its job which in this case is run databases at extreme performance levels. If Oracle Exadata isn’t an appliance than what is it?

I found this question quite compelling. Remember that I really grew up in a truly appliance-oriented environment (NetApp). See my first blog posts for more information on my background at NetApp. For this reason, I think I understand what an appliance is pretty well.

At NetApp (and during my early days at NetApp, filers were true appliances by any measure), the appliance concept meant that the device was a toaster: One lever to push down, and one knob to turn. That's it. Plug it in. It works. No step 2.

Cisco really originated the term appliance. The Cisco router replaced the previous form of router, which was typically a UNIX box running routed. As such, the Cisco router pretty much defined the concept of what it means to be a true appliance.

The folks at Cisco made the following argument: We don't need all of the infrastructure of UNIX to do routing. A UNIX box has to do a lot of things. A router really only has to do one thing: Networking. We could make a dramatically simplified device which would be able to do routing really well, at a much lower cost than a UNIX box.

Based upon this concept, an appliance has the following characteristics:

Extremely simple interface. Should be vastly simpler than doing it the non-appliance way. I.e. a Cisco router is vastly simpler than running routed on a UNIX box.

A single purpose. The device must be dedicated to doing one thing, but doing it extremely well. Like the way a Cisco router is much better at doing routing than a UNIX box running routed. Or like the way a NetApp filer is much better at doing NFS file serving than a UNIX box running nfsd. You get the idea. By dramatically reducing the number of functions the device performs, you also dramatically reduce the amount of code that must be run on the device. (The original NetApp ONTAP OS was a single-threaded 16 bit OS with only a few 100K of lines of code.) This leads to the next feature of an appliance which is:

Vastly reduced cost. The original NetApp filer was about a $5,000 device. An equivalent UNIX box used as an NFS file server ran around $50,000. Similar cost differences existed for Cisco routers vs. UNIX boxes as routers.

Transformative technology. An appliance, if it is truly an appliance, becomes the obvious and natural way to do things. Within a very short period of time after introducing the router, Cisco controlled the router market. They completely displaced the previous way of doing routing. The same thing occurred in file serving with NetApp.

By any reasonable measure, Oracle ExaData fails all of these tests:

It has as complex an interface as any Oracle database server (which is to say it runs the most complex and expensive piece of software ever written for general purpose use). Certainly not appliance-like.

An Oracle ExaData rack contains general purpose compute servers, which can be used to run basically anything you want. You can load any Oracle application on it certainly, and no-one would claim that an Oracle database server is an appliance!

Oracle ExaData is manifestly more expensive than a normal, open-systems database server, and vastly more expensive (assuming intelligent management) than using VMware vSphere for virtualizing Oracle database servers.

Oracle ExaData is possibly addictive in the Big Blue sense, but it is certainly not a transformative technology in the same way that a Cisco router, iPad, iPhone, or such is.

In terms of an analogy that works, I like to use cars. The two companies in the car business that manufacture appliance cars are Honda and Toyota. The Honda Civic is an appliance car, as is the Toyota Camry. Either one of these cars provides all of the appliance advantages:

They have a radically simplified interface. Everything about these cars is designed to make them effortless to operate. Because they are so simple, they are also very reliable and efficient.

They are single purpose vehicles. They get you from point a to point b. That's it. Nothing fancy.

They are sold at a very reasonable cost, relative to non-appliance vehicles (such as BMW, or Mercedes for example).

Once you have driven a Honda Civic or Toyota Camry, assuming you are an appliance driver (and there are a lot folks who are appliance drivers), these cars are completely addictive. You simply trade one in for the new model once the old one wears out (and they take a long, long time to wear out). I have known folks who have been driving these cars (in various model years) their entire lives.

Using the car analogy, ExaData is definitely not a Honda or a Toyota. It is not even a BMW or a Mercedes. It is a Ferrari. It is a tricked out, high performance machine. It is very fast, no question. It is *&^% expensive though. And it is very, very complex and demanding to drive.

November 09, 2011

There seems to be a lot of confusion on licensing when customers consider running Oracle databases on VMware. Part of the confusion is caused by Oracle on purpose (classic FUD) by suggesting licensing is more expensive on VMware than on physical servers. The reality couldn't be more different - I strongly believe that many customers can actually *save* on database licenses by going virtual. But to understand how to achieve this, you need to know a few things - I hope I can clear this up in a short explanation. I will keep the discussion to Oracle database licenses and ignore application/middleware etc. for now.

License models

Customers typically license their basic database by one out of three options:

License by CPU (core) - the more CPU cores, the more licenses are needed. There is a processor core factor depending on the type of CPU and can be 0.25, 0.5, 0.75 or 1.0.

License by named user - the more named users, the more licenses are needed. The amount of CPU's is not important, neither the amount of total databases. Typically one license pack per 25 users.

Enterprise License - the customer negotiates a contract for the whole company and afterwards can deploy as many databases on as many servers/cpus as he wants.

If a customer uses 2 or 3, then it does not matter if they run virtual or physical. But there are also no license savings possible without re-negotiating their contracts. I don't want to go as far as suggesting to customers to change their license models so we leave this as-is for now.

In my experience, most enterprise customers use either cpu licensing or enterprise contracts. Some have different licensing methods for different business units. Oracle can be very creative in customer-specific contracts so expect to find a different situation for each individual customer.

But let's assume CPU licensing for the sake of this discussion.

Maintenance & support

Users typically buy the CPU licenses but then have to pay maintenance for the time they use the licenses. Yearly maintenance cost is about 25% of license (list price). I have no information on typical discounts. I expect customers to get at least 50% discount off the price list (but only on licenses, not on maintenance AFAIK).

Database Edition and options

The plain database license comes in 3 versions (for servers):

Standard Edition One - Maximum 2 processors, no options allowed. Only used for testing and very small deployments

Enterprise Edition (EE) - No limitations and on top of EE, you can have many licensed features. Most customers will use this, at least for production databases

On top of the basic Database license, most customers use a set of options, each requiring additional licenses per CPU. The most common options are:

Real Application Clusters (RAC) - allows many servers running the same database (active-active clustering) to allow scale out performance and high availability.

Real Application Clusters One Node - same but one database can only run actively on one node. For high availability only.

Active Data Guard - remote replication using log shipping. Note that standard Data Guard is free, but Active Data Guard allows the standby database to be opened for read-only purposes and offers some extra features.

Partitioning - allows tables to be split up in smaller chunks. Absolutely required when running large databases and no downtime can be tolerated. Eases administration work and offers some performance benefits.

Real Application Testing - allows workloads to be recorded and re-played on another database to do performance and functionality testing

In my experience, nearly all customers have partitioning. Most customers have tuning/diagnostics pack. Some customers have RAC. Some customers have the other options. There are more options available but these are the most common.

Many customers have 3 or more options - sometimes the options cost more than the base database license - especially if they use RAC they will have most of the other options, too.

Running on a cluster

If a database runs on a cluster, then Oracle assumes the database can make use of any processor in the cluster. This is independent on what kind of cluster is used (so can be MSCS, HP MCSG, Vmware, Oracle RAC, etc).

This is basically the foundation for all FUD and confusion. For example, if you deploy a VMware farm (cluster) of 16 servers, and all virtual machines run all kinds of stuff (file/print, exchange, apps, etc etc) and only one tiny virtual machine in the corner, with only one virtual CPU runs a small Oracle database, you would expect only to pay for one CPU core - but Oracle's reasoning is that this tiny VM can be dynamically moved (VMotion) to all nodes in the cluster and on any processor. Therefore, all CPU's have to be fully licensed by Oracle. So in this case, running the single database on a (small) physical server would be cheaper than running on a VM in the farm.

Total cost of the stack

In a typical database server deployment, the cost of the database licensing is far greater than the cost of the hardware + OS licenses combined. I have no hard numbers but I assume the average DB license cost (plus options) is 10 times larger than the cost of the server + OS.

So a $5,000 server would typically require $50,000 on licenses. Then because maintenance is 25% yearly, the total cost of licenses over a 3 to 5 year period is even higher - so for a 5 year TCO the total license cost might be $75,000 (assumption - could also be closer to $100,000 - and no, I didn't make a mistake with an extra zero, Oracle *really* is this expensive).

Utilization

It is very hard to size a typical Oracle database based application. There are no good methods or calculations to figure out how much CPU power, disk I/O and memory is needed to run a given app. So historically, project teams size their database servers for peak loads, and because they cannot predict how big the peak load is, they double the resources "just in case". The end result is that most database servers are way oversized in terms of CPU and memory.

Most physical deployed database servers will average on about 10-15% CPU load (or less). However, they will peak to higher loads at certain times, such as monday morning when many users log in, or when month/quarter/year-end batch processing is started, etc.

Then, the utilization numbers can be influenced by other tasks of the processors. Some common causes of "artificially high" CPU loads on database servers:

All of these cause the processors, expensively licensed for database processing, to do other stuff.

So if a server is running at 15% utilization, then the utilization caused by the database workload itself might only be 10% and the rest caused by other stuff (whether needed or not).

Needless to say that Oracle likes customers to use their expensive licensed CPU's for other tasks because it forces them to buy additional CPUs sooner and therefore drive their license revenues.

Isn't life great for an Oracle rep? ;-)

Number of databases

Most customers run many databases. For the average enterprise customer that I visit, 100+ databases is a normal number. A big global that I visited runs 3000+ Oracle databases worldwide (and this is only the scope of this specific project team). Imagine the cost of licensing all these databases on all individual servers...

Why so many? Well, customers do not like to share multiple applications on one database (and often this is not even supported). So if you run SAP ERP, Oracle JD Edwards, your own banking app and a few others, they all require their own production database.

For each production database, you might find an acceptance environment, test system, development server, maybe a staging area to load data into the data warehouse, maybe a firefighting environment, a standby for D/R, a training system and so on. Customers will rarely share production environments on the same server (unless virtualized or at least with workload management segregation). Sometimes they share a few databases for non-prod on a server. So for, say, 100 databases, the average customer runs between 30 and 50 (physical) servers.

Power of big numbers

It does not require rocket science to understand that many of these databases do not require peak performance at the same time. A development system typically drives workload during daytime (when developers are coding new application features). A data warehouse runs queries during the day and loads in the evening. For a production system it depends on the business process. An acceptance system might sit idle for weeks and then suddenly peak for a few days preparing for a new version deployment into the live production system. And so on.

So what if you could share resources across databases - without influencing code levels, security, stability and so on?

If that would be possible - you would not size for "peak load times two" anymore. You would size for what you expect and assume an average utilization of, say, 70% over the whole landscape. If one database needs extra horsepower, there is enough available in the landscape.

How much license cost would you save by bringing down the number of CPU's so that utilization goes up from 10% to 70%?

What would be the effect on power, cooling, floor space, hardware investments, time-to-market?

What would be the business advantage of not limiting production performance of a single server, by whatever was sized during initial deployment? Risk avoidance?

What would be the business advantage of solving future performance issues by just adding the latest and greatest Intel server in the cluster and Vmotion the troubled database over?

Wasn't this exactly why we started server virtualization in the first place about 8 years ago? And why EMC aquired VMware?

Wouldn't you think the average Oracle sales rep is scared to death when his customer starts considering to run his databases on a virtual (cloud) platform? Would it make sense for him to drive his customers mad with FUD around licensing, support issues and whatever he can think of to prevent his customers going this way? Even threatening to drop all support if they continue to go in that direction?

If Oracle is scared of losing license revenue, wouldn't you think there is a huge potential for savings for our customers here?

The journey to the private database cloud

So how should we deal with this?

A few starting points:

Oracle supports VMware. Period. Any other claim of Oracle reps can be taken with a grain of salt (to be more specific: it's nonsense).

Oracle does NOT certify VMware. Then again, Oracle does not certify anything except their own hard- and software. But IMO, support is all you need and the discussion around certification leads nowhere.

Oracle might ask the customer to recreate issues on a physical server if they suspect problems with the hypervisor. Isn't it great that we can do this easily with Replication Manager? ;-)

Oracle only supports Oracle RAC on VMware for one specific version (11.2.0.2). Any other version with RAC is not recommended on VMware because of support issues. Expected to change in the future.

Both EMC and VMware offer additional support guarantees for customers deploying Oracle on Vmware. So where Oracle pulls back, EMC and VMware will fix any issue anyway.

Performance is no longer an issue. With Vsphere 5, a single virtual machine can have 32 virtual processors, 1 TB ram and drive 1 million iops. Only the most demanding workloads would not fit in this footprint. But with customers running hundreds of databases, maybe we should start with the 95% + that DO fit and make significant savings there. By the time we're done, VMware will have Vsphere 6 and who knows what happens then.

How to get around the licensing issue

As I said, Oracle requires licenses for all servers in a cluster. So how do you limit the number of licenses? By deploying an Oracle-only VMware cluster. Only run Oracle databases here. No apps, no middleware, no fileservers, and try to move everything off that does not relate to database processing. No host replication, no storage mirroring, etc.

Say you have a legacy environment with 10 servers, each with 16 cores, so you have 160 cores licensed with oracle EE and a bunch of options. Average CPU load is 15% but let's assume 20% to be conservative.

I claim that a single VMware cluster with 3 servers each with 32 cores will easily do the job. Now we have 3 * 32 = 96 cores to be licensed. 96/160 = 0.6 = 60% so we saved 40% on licensing right away. Probably the average CPU load on the whole cluster will still be much less than 70% so we can gradually add a bunch more databases until we average out on 70%.

If the old system was not running Intel x86 but SPARC, PA-RISC or POWER cpu's then the processor factor was probably 1.0 or 0.75. Intel has 0.5. So for 96 cores (Intel) you would need to pay 48 full licenses. Another 33% savings.

The savings of 40% on licensing will easily justify an investment of a nice new EMC storage infrastructure with EFDs, FAST-VP and all other goodies. Do you think the customer will push us hard for a $0.01 lower GB price competing HDS or Netapp if we just saved them millions in Oracle licenses?

But the story does not end here.

Additional savings

Let's assume the customer needed high availability and scale-out performance and was running Oracle RAC. RAC is the most expensive licensed option and you need at least two for a two-node cluster. But VMware allows for HA (High Availaiblity clustering) as well. Using VMware HA instead of RAC, you would have to fail-over and recover the database in case of an outage - if the customer cannot tolerate this then he needs to stick with RAC (only for mission critical databases!). But most customers can live with 5 minutes of downtime in case a server CPU fails and in that case, replacing RAC with VMware HA can save them another big bunch of dollars.

Let's assume that with virtualization you justified the investment in a nice EMC infrastructure with Flash drives to replace the competitive gear. Now the Oracle cluster is no longer limited by storage I/O's and can drive more workload out of the same 3 VMware servers in the cluster. But you can also replace host mirroring (where applicable). You can implement snapshot backups to get the I/O load away from the production servers. You removed the middleware and apps stuff from the database servers - reducing CPU utilization and allowing even more headroom for DB consolidation - all without buying extra licenses from Oracle.

You have a customer who wants even more?

What if they create TWO database clusters for VMware? One for production (running Oracle Enterprise Edition (EE) with all the options they need) and one for Non-prod (running Oracle Standard Edition (SE) without options - good enough for test/dev and smaller, non-mission critical workloads). I bet the number of non-prod databases will be much more than prod. By removing the expensive options AND moving from Enterprise to Standard Edition, you saved another ton of money on Oracle licensing as SE is much cheaper than EE. But be aware - the devil is in the details and using Standard Edition is not for the faint-of-heart (for example, you could no longer clone a partitioned database to a SE enabled server because of the missing license and functionality). Still if the customer is keen on saving as much as possible, then this might be the final silver bullet...

Do they run a huge Enterprise Data warehouse? Carefully find out if they have troubles with it and see if you can position Greenplum - saving another bag of money and speeding up their BI queries. But be careful, in an Oracle-religious shop it might backfire on you...

Reality Check

I had this discussion already with a few enterprise customers. And found that although the story is easy in theory, the reality is different. If a customer already has the 160 CPU licenses purchased from Oracle, then the Oracle rep will not happily give a money-back in return of the shelfware licenses. So in that case the customer can only save on maintenance and support. But having enough licenses on the shelf, he would not have to purchase any more for the next 5 to 10 years. So talk costavoidance instead of immediate savings. And again, if they are licensed by user or have a site license, then saving on licenses will be a tough discussion. Still, the savings on power/cooling/hardware/floorspace would still be significant enough to proceed anyway.

And don't forget the other benefits of private cloud of which we all know how to position: they are no different for Oracle than for other business applications.

Final thought

For this to work you need a customer that is willing to work with you and be open on how they negociated with Oracle, and a team of DB engineers to work with you to make it happen. If internal politics cause significant roadblocks then you will get nowhere.

It's not an easy sell but the rewards can be massive. We're only just starting to figure out how to convince customers and drive this approach. Feedback welcome and let me know if you need support.

I have been saying for a while that most Oracle databases do not need the level of uptime and fault tolerance that RAC provides, so Darryl and I are certainly thinking along the same lines.

I think that the value proposition for vSphere and NFS are very similar. Given that I have spent the bulk of my career pushing NFS and NAS for Oracle database storage, the synergy is obvious at least to me.

Both technologies are about having a "good enough" infrastructure for an Oracle database which, while certainly important to the business, is not the back-end for an online catalog, or an online securities trading app.

For databases like that, I would recommend neither vSphere nor NFS. But the vast, vast majority of databases running Oracle do not fall into this category.

In the case of NFS, for years I made the statement that I believe that 90% of all Oracle databases running in datacenters all of the world could be run over an NFS mount with absolutely no change in performance, reliability, or user experience. (There would, on average, be a big reduction in cost and improvement in manageability, though.)

vSphere is exactly the same. For anything other than the most barn-burning performance, with absolutely the highest standards of fault tolerance, vSphere works just fine. It provides a very high level of reliability, a few minutes a year of downtime, at a vast reduction in cost and complexity compared to RAC.

Because the value propositions are so similar, I believe that the combination of NFS and vSphere is going to become increasingly popular. We'll certainly see how it turns out.

It may sound strange given EMC's reputation and history, but EMC has a strong partnership with Oracle in the area of NAS. We began working with Oracle on the Direct NFS client ("dNFS") way back in 2007, when dNFS was introduced as a major new feature in Oracle Database 11g Release 1. At that time, EMC, was a co-presenter (together with Oracle) at Oracle OpenWorld 2007 on this subject.

Simplify administration of NFS in terms of both network port scaling and mount point parameters

Make administration of NFS uniform across all platforms

dNFS succeeds widely at all of these. dNFS dramatically improves latency of network I/O from Oracle by eliminating most context switches and making the code path to the disk much shorter. dNFS also provides better port scaling than kernel NFS, with much simpler administration. No fussy ether channel network switch configuration is required. All of the port scaling and port failover is provided within the Oracle environment.

dNFS also made administration completely uniform across all platforms. Amusingly, Windows is included! NFS now works on Windows, at least with an Oracle database.

In September 2010, Oracle announced that dNFS would be improved in Oracle Database 11g Release 2. One major improvement is the addition of dNFS clonedb, a thinly-provisioned, rapid database replication feature. This feature also works very well with storage-based replication.

The basis for dNFS clonedb is a database copy. This copy can be a backup, a storage-based snapshot or clone, or an operating system copy. (Of course, EMC likes to leverage our storage-based snaps and clones.)

Once a copy exists, it can be used to create a clonedb instance. The steps to do this are contained in My Oracle Support article 1210656.1 . Effectively, this creates a read / write virtual database which takes up minimal space, is created almost instantly, and contains only the space required to store any changes to the database. Given that storage-based snapshots are also space and time efficient (that is, they take up very little space, and can be created very quickly) there is a great deal of synergy between these technologies.

EMC did some performance testing with dNFS clonedb. The network diagram for the testbed is here:

First, we established a baseline in terms of performance. We ran an OLTP workload against the production database with no additional operations. This produced the following:

Notice the perfectly clean scaling of this performance chart. We then tested the performance of creating a snapshot to serve as the source for the clonedb database. This produced:

Note the slight response time hit when the snapshot was taken. However, transactional throughput was not affected. Basically, this is at most a minimal performance hit. Finally we tested performance while creating clonedb database instances from the storage snapshot. This produced:

Note that there was no performance hit during clonedb creation at all. One additional test was performed. We measured the storage space occupied by the clonedb when it was created. A 10 TB database was used as the source. The total space occupied by the clonedb was only 7 MB.

See the link above for the full presentation of this technical session. Further blog posts in this series will contain summaries of other EMC technical sessions at OOW 2011.

disclaimer: The opinions expressed here are my personal opinions. I am a blogger who works at EMC, not an EMC blogger. This is my blog, and not EMC's. Content published here is not read or approved in advance by EMC and does not necessarily reflect the views and opinions of EMC.