For the sake of brevity, I'll get right to the I/O setup. The drive configuration is shown to the right. An 80GB EIDE drive is connected to the PATA connector on the motherboard. The BIOS detects it as the first disk, which is perfect for this setup. The four SATA II drives are snapped into a drive cage, and their data connections are plugged into the four SATA ports on the motherboard. My board doesn't support hotswap, so I'll have to power off the system to replace a drive if one fails.

Hardware-based vs. Software based

The big question when doing RAID is: hardware or software?. The hardware approach requires an fairly expensive controller card, but the software approach requires a complex setup and a fast processor for in-memory checksum calculations. I did a lot of research and will try to summarize it in a table:

Option

Hardware RAID

Software RAID

Fake RAID

Has CPU Overhead

No

Yes

Yes

Requires controller hardware

Yes

No

Yes

Requires OS drivers

No

Yes

Yes

Platform-independent

Yes

No

No

Supports all RAID levels

Yes

Yes

Maybe

Data usable with different h/w or s/w

No

Yes

Maybe

The “fake RAID” column exists for cost-saving reasons. It was introduced as a “best of both worlds” solution…combining the low cost of Software RAID with the accelerated performance of a specialized disk controller. But the opposite became true: hardware advances in the late 1990's made it unnecessary. Microsoft introduced Software RAID in Windows 2000 Server, and Linux added RAID support in early 2001. The market for Fake RAID has grown smaller ever since.

As for my system–I'm sticking with a pure software RAID solution, even though I have access to a Fake RAID controller card.

The Software Setup

The software setup involves partitioning all the drives, building a raid array, formatting it, and adding it to the system configuration.

Partitioning Drives

Drive partitioning is the process of splitting a disk into different logical sections (kind of like different songs on a CD). It has to be done before the drives are usable by an operating system. Normally this is done by the operating system at install-time, but I wanted to wait until post-install to work configure the installation myself.

I completed my partitioning with cfdisk. Its a little more robust than fdisk, and the man page for fdisk recommended cfdisk for what I was doing (creating partitions for use on Linux). I want to delete any partitions that already exist, and allocate 100% of each disk to a primary partition. I also want to set the partition type to FD–the Linux RAID autodetect partition type.

This is what my console looked like before configuring the first data drive (/dev/sdb).

The partition tool seemed concerned that I wasn't marking the partition as bootable. It warned me when I hit Write, and left a message on my screen after exiting cfdisk. In my case, this is OK. I have a different disk (/dev/sda) that is bootable.

root@werewolf:~# cfdisk /dev/sdb
Disk has been changed.
WARNING: If you have created or modified any
DOS 6.x partitions, please see the cfdisk manual
page for additional information.

After repeating the above process for all 4 of my data drives (/dev/sdb, /dev/sdc, dev/sdd, and /dev/sde), I ran the fdisk -l command to see what my system's partitioning looked like:

but I'm not sure about the chunk size…what's that? According to the Software RAID HowTo, its the amount of data that will be written to a single disk.

Obviously a larger chunk size will minimize the number of disk writes, but it will increase the compute time needed to generate each parity block. There's probably a happy medium in there somewhere, but its going to be affected by the type of data being written to the array. And in my case, I know the array will be used mostly for A/V files (photos, music, and movies), so I'm willing to try out a large chunk size. The HowTo recommends 128K, so I'll go with that instead of the default-recommended 64.

Nice! The command gave my terminal back, rather than locking it up for untold hours. I was afraid I'd have to run it with nohup or put it in a background process. So…aside from the blinking lights on my hard drive bays, how can I tell when my array is created? Lucky for me, there's a file in the /proc directory that I can cat:

Formatting the Array

The command I need to run is:

mkfs.ext3 -b 4096 -E stride=32,stripe-width=96

This command uses the ext3 filesystem. I got the parameters from a calculator here. It sets the block size to 4096 bytes (only 1024, 2048, and 4096 are available on my system). The stride and stripe-width total to 128, which matches the chunksize I gave when setting up the array. (stripe-width = how much data to write, stride = how much space to leave blank [for checksumming?] ).

Adding the Array to System Startup

All the above details are for setting up an array the first time. But so far we haven't told the OS how to reassemble the array at boot time. We do that with a file called /etc/mdadm.conf. The file format is fully explained in the man page for mdadm.conf. In my case, I need to tell it about the 4 partitions that contain data, and about the array itself (raid level, number of devices, etc.).

Below is my /etc/mdadm.conf file. The last two lines contain my email address, and the name of the program that will watch for certain events md-related events and email me if something goes wrong.

This info tells me there are 4 raid partitions on my system, but only 3 are are associated with my /dev/md0 array. They are all active and working, but the array is in “clean, degraded” state (which means it is currently in a ready state with no backlog of work). The fourth partition isn't even part of the array. How do I add the partition back to the array? Its pretty simple.

That's it…the partition is added back in because it was once a part of the array and mdadm can recover it (i.e. bring it up-to-date on any changes that have been applied since it was disconnected from the array.

So looking at the statistics, I can see the partition is back, and the array status is changed to “clean, degraded, recovering”.

Replacing a bad device

Well, it eventually happens to everyone. One of my RAID drives went bad. I don't understand why–it wasn't doing anything demanding or different. But luckily I got this email, thanks to a properly configured /etc/mdadm/mdadm.conf file:

The important thing is the (F) next to the sdc1 partition. It means the device has failed. I power cycled the machine and the array came up in “degraded, recovering” status, but it failed after several hours of rebuilding. After two or three attempts, I decided the drive was bad (or at least bad enough to warrant replacing). Here are the stepssteps:

Run mdadm –remove /dev/md0 /dev/sdc to remove the bad drive from the array

Replace the faulty drive with a new one

Use fdisk as described above to setup the drive like the others

Run mdadm –add /dev/md0 -/dev/sdc1 to add the new drive to the array

After that, cat /proc/mdstat reported the array was recovering. It took nearly 6 hours to rebuild the data, but everything went back to normal. No lost data.

Benchmarking for Performance

One thing I've learned by experience is that you should benchmark a filesystem before you start using it. This isn't such a big deal on regular desktop systems where the I/O load is fairly light. But on I/O-bound servers like a database or a media server, it really matters.

The Test

Wikipedia's Comparison of File Systems led me to 3 candidates for my media server: EXT3, JFS, and XFS. EXT3 is the default filesystem on Linux (as of Summer 2009), and JFS and XFS get really good reviews on various forums. But which one has the best performance on a media server? I decided to perform a common set of tests on all 3 filesystems to found out. I wrote a script that:

Test Results

The script produced some really interesting statistics, which I'll summarize here.

FILESYSTEM CREATION

Measure

EXT3

JFS

XFS

Elapsed Time

16:01.71

0:03.93

0:08.42

Faults needing I/O

7

1

0

# filesystem inputs

944

128

280

# filesystem outputs

92,232,352

781,872

264,760

% CPU use (avg)

11%

29%

1%

# CPU seconds used

109.01

1.10

0.08

I'm not overly concerned with the time it requires to create a filesystem. Its an administrative task that I only do when setting up a new drive. But I had to notice the huge difference between EXT3 and the other formats. EXT3 took 16 minutes to create the filesystem while the others took just a few seconds. The number of filesystem outputs was similarly imbalanced. Not a good start for EXT3.

IOZONE EXECUTION

Measure

EXT3

JFS

XFS

Elapsed Time

13:27.74

10:23.15

10:57.16

Faults needing I/O

3

3

0

# filesystem inputs

576

656

992

# filesystem outputs

95,812,576

95,845,872

95,812,568

% CPU use (avg)

29%

27%

29%

# CPU seconds used

230.32

165.07

187.96

IOZONE provides some useful performance statistics for the disks. The above stats were gathered while it was running (same tests for each filesystem). EXT3 took longer to run the tests (3 minutes and 2.5 minutes longer), and took more CPU time (65 and 42 seconds more {39% and 26% extra} ). JFS has a slight advantage over XFX, but EXT3 is in a distant 3rd place.

5GB FILE CREATION

Measure

EXT3

JFS

XFS

Elapsed Time

1:01.54

1:07.51

00:56.08

Faults needing I/O

0

0

5

# filesystem inputs

312

1200

560

# filesystem outputs

9,765,640

9,785,920

9,765,672

% CPU use (avg)

38%

20%

24%

# CPU seconds used

23.88

13.72

14.00

Creation of multi-gigabyte files will be a routine event on this machine (since it will be recording TV shows daily). Each filesystem took just over 1 minute to create the file. As I expected, EXT3 had significantly higher CPU utilization than JFS and XFS (90% and 58% higher, respectively). The number of CPU seconds used was higher too (74% and 70%, respectively). These small numbers don't look significant until you think about running a media server and how much disk IO goes on.

5GB FILE DELETION

Measure

EXT3

JFS

XFS

Elapsed Time

00:00.96

00:00.05

00:00.06

Faults needing I/O

0

0

2

# filesystem inputs

0

0

320

# filesystem outputs

0

0

0

% CPU use (avg)

98%

8%

0%

# CPU seconds used

0.95

0.00

0.00

File deletion is a big deal when running a media server. People want to delete a large file (i.e. a recorded program) and immediately be able to continue using their system. But I've experienced long delays with EXT3 before – sometimes 10-15 seconds when deleting a file. The statistics here don't reflect that, but they do indicate a problem. The elapsed time is 19X and 16x longer with EXT3 than with JFX and XFS. CPU use and CPU seconds are simlar in nature.

Obviously, EXT3 is out of the running here, so I'll stop talking about it. The real decision is between JFS and XFS. Both have similar statistics, so I decided to search the internet for relevant info. Here are some sources that swayed my opinion:

This article says “Conclusion: For quick operations on large files, choose JFS or XFS. If you need to minimize CPU usage, prefer JFS.”

And the winner is...

The winner is: XFS. I've been using it for several years on my MythTV box with no issues. My recorded programs are stored on an LVM volume formatted with XFS. The volume itself spans 4 drives from different manufacturers1) and with different capacities2) and interfaces3). My recording and playback performance are great, especially when you consider that my back-end machine serves 4 front-ends (one of which is on the back-end machine). And the file system delete performance is perfect: about 1 second to delete a recording (normally a 2-6gb file).

JFS has maturity on its side–it has been used in IBM's AIX for more than 10 years. It offers good performance, has good recovery tools, and has the stamp of approval from MythTV users. But I'm going to run it on a RAID system, and there's very little internet knowledge that I could find on that combination.

In contrast, XFS has format-time options specifically for RAID situations. There have been reports of 10% CPU savings when you tell XFS about your RAID strip size at format time. This means more free CPU time for transcoding and other CPU intensive tasks.

RAID Parameter Calculator

Calculating the parameters for a RAID array is a tedious process. Fortunately someone on the MythTV website had already written a shell script to help calculate the the proper values for an array. I converted that to JavaScript, and I offer it here for your convenience. If you find any errors or improvements, please let me know.

Note: blocksize refers to the size (in bytes) of a single chunk of disk space. In Linux, that can't be larger than the size of a memory page (called pagesize). So how do you find out your pagesize? In Ubuntu, you run getconf PAGESIZE at the command line. In my case, the value is 4096. It might be slightly different on other systems.