I know this is a "software-only" discussion, but I love how you separated "MD" from "LVD", but you failed to equivalently even list the "equivalent" hardware options! I would be interested to find out if you were confusing "Software RAID Cards" with "[True] Hardware" RAID Cards?
So, to "rehash" the list again with the "hardware" choices for everyone else:

Worse yet, your making these kinds of statements in your software RAID-focused article:

"let me advise that you do not select the first option. Linux RAID device drivers are almost universally listed at a tier-3 level of support; essentially, you're on your own. A few made it to tier-2 support, but these are mostly the more expensive controllers on the market."

Are you comparing your experience with AMI and Adaptec's "non-support" of their SCSI RAID products under Linux with the direct support, "tier-1/2-like level" that DPT (before Adaptec bought them out), IBM/Mylex, ICP-Vortex and 3Ware Do have for theirs??? I know many OEMs that sell DPT (again, before Adaptec bought the out), IBM/Mylex, ICP-Vortex and 3Ware because they Do offer "direct support" at a tier-1/2 "level." Heck, you'll recognize them by looking at their driver's source code and seeing a "(C) (company name)" right next to the GPL (or GPL-compatible) license.
And, again, I hope you aren't confusing "Software RAID Cards" (which are just an ATA controller with a BIOS, so the "RAID" is done 100% in an OS-specific software driver) with "[True Hardware RAID Cards" which aren't so expensive if you'd bother to research. I'm sorry, but many of these great, hardware RAID solutions are well worth the cost! 3Ware solutions, with their non-blocking I/O ASIC + SRAM approach, start at only $100 for a 4-channel/4-drive product (Escalade 6410) nowdays. Heck, they even have a firmware update that supports Ultra133/137GB+ drives! And official, GPL 3Ware drivers have been in the stock kernel since 2.2.15.
And finally, regarding:

"And if there's one problem we absolutely want to avoid, it's Linux device driver problems. Life's too short for this kind of pain."

I have used ICP-Vortex on Linux/x86, DPT on Linux/x86 _and_ Linux/Alpha (let alone NT/x86, NT/Alpha and VMS/Alpha) and IBM/Mylex on Linux/x86, in addition to 3Ware. 0 issues in every case! Cannot say the same for RAID-5 with LVM approaches (let alone MD), especially with the "regular changes" in the kernel's VM in 2.4.x.
For "reliability," I want a well-supported hardware solution at this point. Just because most people blindly choose Promise and Adaptec for controllers (which are the two worst for Linux because of their "retail-focus" which means "non-disclosure" of specifications) doesn't mean they don't exist or aren't affordable!Bryan J. SmithSmithConcepts, Inc.
Consulting Engineers and IT Professionals

This is to address the issue of the above claim that Adaptec does not support Linux.

I am the Linux driver author/maintainer for the RAID products at Adaptec. Adaptec has Linux as a tier 1 OS. We actively support the Linux OS for all of our RAID and SCSI products. A few years ago this was not the case, but it has been for the last 2 years. We provide drivers, Q/A our products, and provide tech support for these products under Linux.

The 2400A was not a DPT product. It always has been an Adaptec product. Adaptec did use the DPT firmware and the i2o interface. It has produced many new products from that interface. The Linux driver (dpt_i2o) is now part of the generic kernel.

We are also now producing new products based on a new firmware. While it is not the i2o interface, all of these products do use a common interface allowing the same driver to work with them all (aacraid). Adaptec is supporting and helping maintain the aacraid driver that is in the generic kernel.

When I was searching for a hardware RAID controller for a small internal app to run as a proof-of-concept on RH8.0, we chose the adaptec 2100S controllers. Much of the reasoning was from the O'Reilly text "Managing RAID on Linux" which featured a section on the 2100S controller.

The controllers have worked well on w2k and Linux.
I have no idea as to their RAID 5 performance, as I don't configure storage systems with RAID 5. RAID 10 performance is good.

According to the Orlando Adaptec office, which is both a RAID support call center as well as an ASIC design center, Linux is NOT a supported platform. Furthermore, Adaptec is moving away from the I2O design. This is apparent in their new 4-channel 5000 series.
I know this because I interviewed for a job there as an ASIC designer.

Did you use the notail option to mount the Reiser File Systems? This is not set by default and prevents packing of data (to save space) and increases performance. Check http://www.namesys.com/ for further performance tips.

I believe XFS is quite the performer out of all the Journalled FileSystems. Haven't heard great data integrity stories about EXT3 (mainly disasters) which doesn't help even if it's fast. Still a little early for EXT3 I think as I would rather still have my data intact.

LVM is getting close, but has some quirks like "loosing your volume set info". I suggest you make a backup of the relevant files - loose those and you have trouble.

I used XFS for my machine mostly because I hate fsck time and I was planning on putting this on a laptop. Since SGI came out with Redhat installation disks (which made it very easy to install), I thought this was the ticket.

While it worked well for normal operation and through a few "crashes", it didn't take very long from until a "crash" made XFS very confused and I was hosed. I'm sure a little bit of patience and another computer connected to the web would have enabled me to get the disk back, but that is not the point.

Now I'm using ext3 and even though I still enjoy just turning the computer off at random times, it's done a great job.

The author did not state which journaling mode was used on ext3. It matters.

If the database is seeking all over the filesystem and then running fsync(), then ext3 in data=journal mode can make a huge difference, because all the dirty data is written out *linearly* to the journal, for later aysnchronous writeback. This can offer 10x speedups or more.

Of course, the data has to eventually be written back to the main filesystem, but this can happen later, and allows better request merging, and is non-blocking.

If the author was using the default ordered journaling mode, well, then I'm surprised. Probably he was, in which case ext3 will perform better than indicated here.

Flemming 'Not really anon just lazy' Frandsen, says:
If you run a DBMS then you do NOT do this on top of RAID 5, it is slow as ***** for databases!!

The reason RAID 5 is so wrong is that a database is typically larger than the available RAM *and* changes to data are small and spread out.
RAID 5 needs the data of all the other disks to write one bit, so if the data isn't cached already (and it isn't with RAW devices, remember, that's what RAW means) then all the disks need to read their lumps and then recompute the checksum and write to two disks (the one with the data and the one with the checksum)

If you want DBMS performance you MIRROR, set up a number of striped disks (not partitions) and let the database run ontop of that.

Use at least 4 disks for the database: two for the data mirror and two for the log mirror.

I'm SOOO sick of people saying RAID5 is "slow". I'm also REALLY, REALLY sick of people suggesting software raid is faster or safer than hardware raid under linux.

Both are true in many circumstances, but both are certainly NOT true in other circumstances. The circumstance in which RAID5 is not slow and is certainly MUCH safer than software raid is actually very common - you just need a good, well-supported raid controller.

Dell has shipped very nice "PERC" raid controllers for several years, repackaged first from AMI and later from Adaptec. They have on-board batteries (you *did* buy the battery, didn't you?) that last up to 72 hours and up to 128MB of cache. Controllers can be clustered, in which case even the cache is mirrored. They're supported under Linux using the Megaraid drivers and quite stable.

Computing parity doesn't slow down the server's CPU - all raid stuff is handled on the raid controller. The impact of parity computations on this controller is negligible. Raid5 is effectively as fast as RAID1, since you do the same number of writes per stripe - the only additional overhead is the parity computation and supporting read which is almost *always* pulling from controller cache. Sure it's possible that application IO patterns will cause cache-misses, but for most applications with such a large cache that's not the case.

I will never understand the willingness of some admins to rely on software raid - handled by the kernel - to manage partitions holding that kernel. That just creates a circular dependency - if your kernel crashes it's quite likely your raid will be corrupted...which in turn means it's quite likely your kernel will no longer exist on disk after a reboot. Don't you feel *much* safer relying upon a seperate piece of hardware to handle all the disk mirroring and striping? That piece of hardware is identical for most of its users, unlike your kernel, and its very carefully backed up with a battery. It's also backed up by a company whose livelihood depends upon its reliability.

Software raid is by definition slower and less reliable than hardware raid - the only exception is when you use poor quality hardware raid controllers, and since there are stable, high-quality controllers for intel boxes running linux, using poor quality hardware just means you're ignorant or impoverished.

I will never understand the willingness of some admins to rely on software raid - handled by the kernel - to manage partitions holding that kernel. That just creates a circular dependency - if your kernel crashes it's quite likely your raid will be corrupted...which in turn means it's quite likely your kernel will no longer exist on disk after a reboot. Don't you feel *much* safer relying upon a seperate piece of hardware to handle all the disk mirroring and striping?

Modern "hardware" is hardwired software. The kernel with its drivers is similarly reliable. I have been using Linux since 2000 on a hundred machines and I have never had a production system crash due to the kernel. The moving parts are what usually fail: fans and drives. I keep my kernel in RAID 1 arrays, so if I ever need to reboot, one of the mirrors will be available. I prefer RAID 1 over RAID 5 for data, because I write once and read thousands of times. I like being able to read several files at once to keep multiple processes happy. We always use journalling filesysems so any sort of a crash is recovered more gracefully.

I would imagine that RAID 10 would be the perfect mix for this then. You have the redundancy of mirroring and you would have the speed advantage of striping. It would just take twice as many drives. But, if your data is that important...

Not to champion RAID 5 for any particular use, but a write doesn't require data from all the other disks. It only needs to read the redundant stripe, XOR it with the old and new data block, and write both. RAID-5's read of the redundant data, and possible re-read of the old data stripe makes it quite slow.

Data from all the disks is only needed during RAID-5 recovery. You read all stripes, XOR them, and what's left is the lost data.

Then, contrary to popular lore much RAID-5 gear isn't as failsafe as they're marketed. Powerfail between the data write and redundant write will leave the array in mid-state. It will fix this, but disks have the greatest tendancy to fail on power up. "No single point of failure" is the hype, but this likely duplex failure is ignored.

My recommendation for "one big disk"? Buy twice what you need, mirror on two controllers, then stripe across the mirror sets.

RAW filesystems exist for data consistency, not performance, as the DBMS needs to know when data has been flushed to disk, rather than sitting in an OS buffer. Once you are using RAW devices, it is up to the DBMS to handle all disk buffering, and it shouldn't be a surprise that (for example) Oracle does not do this as efficiently as the Linux kernel.

RAID doesn't help things here either - head thrashing will kill performance. How about putting data on one drive, indexes on another, and save another drive for the database journals (which are written serially). Not being able to do this was one of the minor critisms I had of the excellent postgreSQL a while back. Time spent tuning the slowest part of the system really is time well spent, and the rewards can be surprising. Even though disk drives now have much better data transfer rates, it is the seek time which is the real killer.

I'm not interested. MySQL is not a real database, period. The only reason it gets so much use is because all the clueless web host admins use it (you know, the kinds who don't know that perl 4 is obsolete already, and that Java is meant to be used in a servlet container as opposed to CGI, arrrggg), thus all the PHP ppl use it, thus web host admins use it again. It's a vicious loop. You can *hardly* find a professional hosting environment with PostgreSQL, unless of course you do either dedicated hosting or colo.

It's a shame. PostgreSQL is awesome and a pleasure to deal with. It's the closest free DB to Oracle, imo. In fact, psql is hundred times better than sql*plus even!!! sql*plus is a horrible piece of junk in comparison to psql.

So, what's the point of comparing a non-DB (MySQL) to a real DB (PostgreSQL)? One can wish to enlighten all the MySQL users out there, but this topic has been covered at length before...

As Dynamic Content Web Developer, I have worked both with mySQL and PostgreSQL.

Well, to be a DataBase, a program should at least be able to STORE DATA... PostgreSQL 6.5 once DESTROYED the whole database with a single INSERT, while other 6.5 and 7.0 installs experimented corruption every each and then...

Sorry, maybe MySQL is not a real DB, but in the past PostgreSQL was not even a good Data Store!! Having lost data in the past, now I go MySQL because simply don't trust PostgreSQL any more.

Maybe it is mature on other platforms, but on Linux it is very immature.

From personal experience with xfs, stay away from xfs for another 6 months or so, at least. xfs on Linux is still very raw. I can't tell you how many corruptions I've seen using it. Some corruption causing bugs in xfs tools were fixed just recently.

I think xfs holds some promise, but I really do think ppl have been over-hyping it. xfs is not all that.

Advice: if you are crazy and do want to go ahead and use xfs on Linux in a production environment, make sure you:

1. Only use SGI's kernel tree! Either get latest stable kernel from sgi's site, or get the latest kernel and tools from SGI's CVS. Do not apply any patches, especially if you're not a kernel hacker. Use vanilla SGI kernel, period. In general, follow advice on SGI's site for the most stable kernel + xfs tools.

2. Make backups. I mean it. I've seen corruption even on pure vanilla SGI kernel + vanilla SGI xfs tools, both from their cvs.

As a side note, ReiserFS has been getting a ridiculous number of fixes lately. I would stay away from it as well. I know nothing about JFS, but if it's anything like with XFS, stay away for now until it gets more mature. If you want a journalized FS, use ext3. That's my suggestion, regardless these DB tests.

Most important in an FS, and especially in a journalized FS, is data integrity. Benchmarks don't matter one bit if your data integrity is not assured. It's kind of silly actually. I think the author of the article should have done research to find the most stable FS instead of the best performing FS, imo. Of course finding the most stable FS is really hard, because there is no 100% reliable benchmark for stability.

I'm sorry to disagree with you. I have installed about 20+ servers in the last 8 months, all based on XFS, all made by clean ftp.kernel.org source + XFS patch, compiled with the recommended compiler, and always the latest tools. And I have NEVER SEEN NOT EVEN ONE fs corruption while running XFS. And these are servers, not workstations (those I don't count anymore). Add to that the ability to do on-line defragmenting and on-line resizing.... I have found the filesystem I needed. The only thing I'm worried is the speed of development of XFS with regard to ext3. This is what we should watch.

One thing to consider when you are going multi-drive is the MTBF. For seven drives, the risks of failure for one of the drives approaches 100%. This is a very dangerous position to be in for an lvm or a raid 0. While raid5 does provide a hit on write performance, one is able to sleep well at night knowing the chances of two drives failing over the same 4 hour period are low.

Also, when considering high availability situations, if ext3 doesn't meet your criteria, consider another journalling filesystem. A several hundred gigabyte or multi-terrabyte filesystem can take anywhere from hours to days to fsck.

Bear in mind that the performance level of downed servers is exactly 0%.

One last point to make. If you haven't done so within the last year, spend a solid week evaluating Postgresql. Postgresql may be the fastest sql server out there running in relational mode and may meet some of your needs from time to time.

I have had some software RAID5 problems.
I fear that it may be even LESS reliable
than RAID0, straight LVM, or straight partitions.

It seems that even in a simple power loss
(trip over the power cord) 2 or more disks
can be left in a dirty state, corrupting
the whole array.
Regular old journaling filesystems are pretty
robust to this failure.... as long as their
underlying block device survives!

I'm giving serious consideration to going back to
regular filesystems. RAID on anything less than a
high-end, ECC, battery-backed buffering controller
may just be too risky.

``...the risks of failure for one of the drives approaches 100%. This is a very dangerous position to be in for an lvm or a raid 0. While raid5 does provide a hit on write performance''

Well, the author did say that RAID 5 or 0+1 was how he did things in the real world. With disk and bus speeds getting faster all the time, RAID 5 isn't looking that bad for many applications. The write performance hit is measurable but not horrible, especially with newer controllers.

``one is able to sleep well at night knowing the chances of two drives failing over the same 4 hour period are low.''

But I'd still use journaled on top of RAID 5 or 0+1 in a disk controller/array that allows hot swapping. (As a result, I haven't incurred downtime due to disk failures in, oh, 5+ years.) We used to use raw RAID devices with Oracle. Disk space management was a nightmare. I pushed for a conversion from raw to journaled filesystems several years ago. You gotta really convince those Oracle DBAs who've had it drummed into their heads by Oracle tech support that you really need to have raw partitions. While our benchmarks didn't show the wide disparity in performance between raw and journaled that the author's runs showed, the journaled filesystems did perform noticeably better. And coupling that with the far, far easier time one has in managing disk space, no one could seriously push for raw any more. Of course, with raw partitions you don't have to wait for fscks either, but you still have to hope that Oracle doesn't need to do media recovery following a system crash. Especially if you've already stashed archive files onto tape. Reading those back from tapes is slower than any fsck I've ever experienced.

And I heartily agree withyour plug for PostgreSQL. This is one great package.

Quite interesting that ext3 gives the result seen here. Personally I advise people to stay away from raw devices and RAID 5! Raw was introduces years ago by Oracle as a mean to overcome performance issues on cooked filesystems on slow hardware. This has changed a lot over the past 10 years.

RAID 5 is the slowest RAID set and has horrible write times in most cases. Unfortunately it seems like every hardware vendore thinks the answer is RAID 5. What was the question?

Here is how to implmenet speedy RAID 5 in Oracle (be in archivelog mode, always!):

5 disk: A-E

Create a tablespace TS with 5 datafiles, A.dbf - e.DBF, size 100MB, intial extent 100MB. This will aloow striping in a better fashion than RAID 5. You can also use smaller strip size and smaller files, just make more files.

One last thing, TPC-C? Nobody, uses TPC-C for anything anymore, see www.tpc.org for more details.

If you really need good read performance, make sure that DB_BLOCK_BUFFERS is set high enough and pre-load all tables into cache by doing select * from table; for all your critical tables.

Also set PRE_PAGE_SGA=TRUE in your init.ora file to initlaze all the SGA pages at startup.

Want speed on load? alter table name nologging;

Also remeber to change LOG_CHECKPOINT_INTERVAL when you change the size of your redolog files (the number is # of OS blocks, not Oracle blocks) Also set LOG_CHECKPOINT_TIMEOUT to an insanly high value so you don't see checkpoints due to timeouts.

In addition to this, there are about another gazillion things that can be done, but only after getting a proper bstat/estat or statspack report.

Here is how to implmenet speedy RAID 5 in Oracle (be in archivelog mode, always!):
5 disk: A-E
Create a tablespace TS with 5 datafiles, A.dbf - e.DBF, size 100MB, intial extent 100MB. This will aloow striping in a better fashion than RAID 5. You can also use smaller strip size and smaller files, just make more files.

And where is the redundancy? This describes a raid 0 (striped) system. Not a raid 5. RAID 5 has redundancy of the disks. If one fails, you can still play happy.

not everyone will want to have a single tablespace with uniform extent sizes of 100 MB. that would work if you have one table in your application, no doubt. If you have a 3rd normal form type of schema design, you're going to have tables of different sizes.

for example, start with 2 tablespaces, with uniform extent sizes of 128KB, 8MB. The most important thing is:

1. make the extent size a multiple of the database block size.
2. make the extent size a multiple of the db_block_size * db_file_multiblock_read_count.

say the OS read size is 256 KB with an 8 KB db_block_size.
The dbmbrc should be set to 256/8 = 32. Now, that might cause the Oracle Cost-Based Optimizer to choose hash joins and full table scans, so you might choose something less aggressive like 16.

He did not mention any optimizer settings. That would certainly fall into low hanging fruit.

did he need a shared_pool_size = 128MB?
was he looking at v$sgastat?
he may have been much better off to reduce the shared_pool_size to 64MB (single user, how many statements were really being cached) and allocate the memory to the buffer cache.

did he set the parameter session_cached_cursors = 50?

Did he make sure that asynchronous and Direct IO was used for Raw partitions? Doubtful, as that is what RedHat was selling with their RHAS 2.1 edition.

How did this guy tune oracle without mentioning the multiblock read count? what type of access paths did his app use?

full table scans
fast full scans
index range scans
hash joins
were bitmap indexes used?
did he flush his buffer cache between runs?

I admire him for publishing results, but - there are other sources of detailed information that cover areas in detail that are way beyond the scope of these 2 articles.

search for papers written by Jonathan Lewis, Tom Kyte for starters. check out comp.databases.oracle.server.

check out papers up at hotsos.com and oraperf.com.

then again, if everyone knew everything about oracle, I'd be working on something else.

Geek Guides

Pick up any e-commerce web or mobile app today, and you’ll be holding a mashup of interconnected applications and services from a variety of different providers. For instance, when you connect to Amazon’s e-commerce app, cookies, tags and pixels that are monitored by solutions like Exact Target, BazaarVoice, Bing, Shopzilla, Liveramp and Google Tag Manager track every action you take. You’re presented with special offers and coupons based on your viewing and buying patterns. If you find something you want for your birthday, a third party manages your wish list, which you can share through multiple social- media outlets or email to a friend. When you select something to buy, you find yourself presented with similar items as kind suggestions. And when you finally check out, you’re offered the ability to pay with promo codes, gifts cards, PayPal or a variety of credit cards.