Will the Single Box System make a Comeback? December 8, 2011

For about 12 months now I’ve been saying to people(*) that I think the single box server is going to make a comeback and nearly all businesses won’t need the awful complexity that comes with the current clustered/exadata/RAC/SAN solutions.

Now, this blog post is more a line-in-the-sand and not a well researched or even thought out white paper – so forgive me the obvious mistakes that everyone makes when they make a first draft of their argument and before they check their basic facts, it’s the principle that I want to lay down.

I think we should be able to build incredible powerful machines based on PC-type components, machines capable of satisfying the database server requirements of anything but the most demanding or unusual business systems. And possibly even them. Heck, I’ve helped build a few pretty serious systems where the CPU, memory and inter-box communication is PC-like already. If you take the storage component out of needing to be centralise (and this shared), I think that is a major change is just over the horizon.

At one of his talks at the UKOUG conference this year, Julian Dyke showed a few tables of CPU performance, based on a very simple PL/SQL loop test he has been using for a couple of years now. The current winner is 8 seconds by a… Core i7 2600K. ie a PC chip and one that is popular with gamers. It has 4 cores and runs two threads per core, at 3.4GHz and can boost a single core to 3.8 GHz. These modern chips are very powerful. However, chips are no longer getting faster so much as wider – more cores. More ability to do lots of the same thing at the same speed.

Memory prices continue to tumble, especially with smart devices and SSD demands pushing up the production of memory of all types. Memory has fairly low energy demands so you can shove a lot of it in one box.

Another bit of key hardware for gamers is the graphics card – if you buy a top-of-the-range graphics card for a PC that is a couple of years old, the graphics card probably has more pure compute grunt than your CPU and a massive amount of data is pushed too and fro across the PCIe interface. I was saying something about this to some friends a couple of days ago but James Morle brought it back to mind when he tweeted about this attempt at a standard about using PCI-e for SSD. A PCI-e 16X interface has a theoretical throughput of 4000MB per second – each way. This compares to 600MB for SATA III, which is enough for a modern SSD. A single modern SSD. {what I am not aware of is the latency for PCI-e but I’d be surprised if it was not pretty low}. I think you can see where I am going here.

Gamers and image editors have probably been most responsible for pushing along this increase in performance and intra-system communication.

SSD storage is being produced in packages with a form factor and interface to enable an easy swap into the place of spinning rust, with for example a SATA3 interface and 3.5inch hard disk chassis shape. There is no reason that SSD (or other memory-based) storage cannot be manufactured in all sorts of different form factors, there is no physical constraint of having to house a spinning disc. Density per dollar of course keeps heading towards the basement. TB units will soon be here but maybe we need cheap 256GB units more than anything. So, storage is going to be compact and able to be in form factors like long, thin slabs or even odd shapes.

So when will we start to see cheap machines something like this: Four sockets for 8/16/32 core CPUs, 128GB main memory (which will soon be pretty standard for servers), memory-based storage units that clip to the external housing (to provide all the heat loss they require) that combine many chips to give 1Gb IO rates, interfaced via the PCIe 16X or 32X interface. You don’t need a HBA, your storage is internal. You will have multipath 10GbE going in and out of the box to allow for normal network connectivity and backup, plus remote access of local files if need be.

That should be enough CPU, memory and IO capacity for most business systems {though some quote from the 1960’s about how many companies could possible need a computer spring to mind}. You don’t need shared storage for this, in fact I am of the opinion that shared storage is a royal pain in the behind as you are constantly having to deal with the complexity of shared access and maximising contention on the flimsy excuse of “sweating your assets”. And paying for the benefit of that overly complex, shared, contended solution.

You don’t need a cluster as you have all the cpu, working memory and storage you need in a 1U server. “What about resilience, what if you have a failure?”. Well, I am swapping back my opinion on RAC to where I was in 2002 – it is so damned complex it causes way more outage than it saves. Especially when it comes to upgrades. Talking to my fellow DBA-types, the pain of migration and the number of bugs that come and go from version to version, mix of CRS, RDBMS and ASM versions, that is taking up massive amounts of their time. Dataguard is way simpler and I am willing to bet that for 99.9% of businesses other IT factors cause costly system outages an order of magnitude more times than the difference between what a good MAA dataguard solution can provide you compared to a good stretched RAC one can.

I think we are already almost at the point where most “big” systems that use SAN or similar storage don’t need to be big. If you need hundreds of systems, you can virtualize them onto a small number of “everything local”
boxes.

A reason I can see it not happening is cost. The solution would just be too cheap, hardware suppliers will resist it because, hell, how can you charge hundreds of thousands of USD for what is in effect a PC on steroids? But desktop games machines will soon have everything 99% of business systems need except component redundancy and, if your backups are on fast SSD and you a way simpler Active/Passive/MAA dataguard type configuration (or the equivalent for your RDBMS technology) rather than RAC and clustering, you don’t need that total redundancy. Dual power supply and a spare chunk of solid-state you can swap in for a failed raid 10 element is enough.

Like this:

Related

“it is so damned complex it causes way more outage than it saves” – Yeah, well: I’ve been saying precisely that since day one of RAC, but does anyone listen?
No, of course not: what we need is to generate more con-sultancy business to get all this extreme complexity going, isn’t it? And stuff the consequences for the average client when the “extreme consultant” is gone after having dropped a white elephant in the IT department.
But that will never stop the Julians of this world from trying to force-sell an extended RAC solution to a client who doesn’t need one now, has never needed one and will never need one, isn’t it? Ah well, gotta recoup all that investment in RAC of the last 10 years somehow, at whatever cost (to clients and to Oracle itself)…
Yes, I am totally in agreement with you and in total disagreement with overly complex solutions that achieve nothing but sell consultancy time. Sorry, but it needs to be said.

I turned off every single OPS system I came across (for you youngsters, Oracle Parallel Server was Oracle’s attempt at clustered before RAC) because it was, sorry, cr4p. When RAC was new, it was just too unstable to label “High Availability” and you still had to design a system specifically for it to stand anything but the most fanciful chance of giving as good performance as a single node. 90% of the time, if you shut down RAC and ran just one node, reliability and performance both increased.

However, come end-of-9 and 10, it seemed to me that if you had plenty of experience and you knew how to tweak your db design and application code to be RAC-friendly, two-node RAC was good. Four node RAC was fine once you had spent 6 months finding the specific way you had to set it up for your hardware and OS. It was a way to get that CPU grunt and memory allocation from four cheap boxes {well, three, as you had to plan for one to go pop otherwise what was the point} that you just could not afford to provide with one big one. That and Enterprise storage was so expensive and tricky to manage that you tried to have only a few of those beasts. RAC made sense for a while.

But if you can get away from shared storage? Heck, I really do think it will be truly required by 1 or 2% of very large, very rich companies. In which case, they are probably going to buy Exa-whatever.

- “If you are going to use simple loops to evaluate performance, Access may be your database of choice.”
Hehe, play nicely :-)
It’s actually a very sensible test. it is testing only “CPU speed”, nothing more, but it is testing the CPU speed that Oracle can see. Also, and this is a very important consideration, it is such innocuous code Julian can persuade DBAs to run it on their live system.

– “The thing about gamers and video, they don’t care so much if you lose a couple of bits here and there”
The thing about DW and BI is they don’t care so much if you lose a couple of records or RI here and there. Wild guessing may be your database of choice.

Actually, one of the best ways to test how fast is it possible for a given CPU/mem config to get I/O into and out of it is to do a dd of /dev/zero into /dev/null for a fixed size: no disk, you can then see how fast the CPU/mem combo can run the I/O portion of the code and suck data into and out of memory.
In our Power6 box I get 800MB/s for a single CPU/mem combo. If I then fire two processes, I get an aggregate of twice that. I stopped the testing at 4 concurrent processes in our DW lpar as the rest of the box simply “froze”!
Of course: this is not our true I/O capacity. With only two 4Gbps FC HBAs, the best I can hope for is 800MB/s aggregate – no matter how many CPU/mem combos I throw at it.
Which is kinda confirmed experimentally: when restoring with RMAN and 4 streams I get – sustained – 700MB/s aggregate I/O (hey, it takes time to uncompress the backup, hence no 800!)
Sometimes a closed loop is a good way of testing max theoretical raw performance! ;-)

Should have said earlier – nice test. I usually resort to asking the sys admins.
One thing to remember is that, more and more, DBAs have no direct access to the unix box without jumping through hoops, I’ve even seen situations where the DBA has no unix access and installs have to be done with a sys admin at your side. What a waste of resource.

A return to single node systems with Active Dataguard for DR will only come to pass if the profit margin is greater than selling a huge multi-node RAC system with interconnects everywhere. So it’s never going to happen.

Salesmen outperform Infrastructure Architects 9 times out of ten, regardless of The Truth(tm).

Good points for RAC Martin. After I moved away from a shop which uses RAC heavily, to a shop which does not use RAC at all but uses active passive, I can say, “hell yeah this is what I can really call uptime”. The only thing I miss about RAC is having ASM (somehow I started not to like filesystems) but I don’t think it can be an option for non-GI env anymore like it was.

One more thing I wanted to add is that I have never seen a customer who adds nodes (like Larry first adversitised) when the hardware is not enough, what they always do is buy newer more powerful servers with same number of nodes. So one selling point is just not a real at all.

I personally think, If you have enough budget for redundant hardware and you really care to have uptime ,RAC is not the way to go.

Good points, Coskan. Particularly the one about the h/w upgrade. I wish I had a dollar for every time I’ve heard the “we can always add-on memory/CPU/servers” mantra.

99 times out of 100 one finds that due to the planned obsolescence of modern hardware, that is a completely false argument: the memory will require complete replacement as it’ll be a different model, the CPU will require a complete replacement as it’ll be a different chip level requiring a new mobo or at least slot upgrade and the server will be a different model as the old one is not made anymore!

With the result that the whole “add-on later” thing never happens: it’s almost always a complete replacement.

Realistically speaking if one “adds-on” more than 24 months after initial install, one is up for a major upgrade instead of an “add-on”. And if one needs to add-on less than 24 months after initial install, then the question should be asked: “who configured that system in the first place with so little growth capability?”. Where is the ROI in that?

The whole argument for “add-on later” suffers from a darn little inconvenient truth, called “R-E-A-L-I-T-Y”! ;-)

Of course: one could argue that RAC is supposed to facilitate using various levels of server hardware all accessing the same database. Hence it’d be possible to have different capacity servers using the same database. That would easily allow for upgrading each server as/if needed. Ever tried to do that or heard anyone doing it successfully?
Oooooops..,

The use of different levels of servers and thus an unbalanced RAC used to be strongly recommended against by Oracle – at least the technicians in Oracle said “don’t do it” anyway. I worked on one such system for a while, one older server and two more modern units with twice the cpu power and memory. It gave a few issues and I think (can’t back this one up) we saw a heck of a lot of buffer cache churn on the smaller box.

System stats (cpu speed and relative IO speed) are not RAC aware. You have one set for the instance and thus all nodes in the cluster. In our case it turned out that the relative speed of single to multiblock reads was the same across all nodes, though slower on the older machine. (10ms and 18ms compared to 6ms and 13ms if I remember correct). You would think that as it was the SAN providing the data then the IO speed would be very similar, but no.

I still think about the only reasons for RAC are to allow you to fake up a much larger machine with a set of much cheaper smaller ones.

Good point by Coskan as well.
And note that “larger” hardware always seems to be availble.

With current state of play, if you cant find a box strong enough to run your load, you have a code/app problem, not a hardware problem.

That leaves “HA” as the other reason for RAC, but then you discover that “clusters” only cover part of you HA, you will generally need a failover or standby-box anyway. And since the additional RAC layers only add to you problems, not really add to the solution, .. why have it in the first place ?