Introduction

As a result of my experiences with an ARM-based Thecus NAS and subsequent market research, I had reached the conclusion that commercial home NASes are still relatively expensive (you are paying for their software R&D costs, particularly an idiot-proof web interface, and for ergonomic niceties such as compact size and low noise) and perform poorly when compared with conventional PCs in a dedicated NAS role. So instead of buying a commercial replacement for my Thecus N4100+ (which performs so poorly that refitting it with bigger disks would be a waste of money from a cost-benefit standpoint), I thought I'd have a go at creating my own NAS using DIET-PC and off-the-shelf x86 PC components.

I expected that for a comparable price I would be able to exceed the performance of all but the most expensive commercial home NASes, although the ergonomics of the unit would be somewhat inferior (it would be larger, noisier, and almost certainly lack hot-swap capabilities). So let's see how well I did.

Design Parameters

Capacity for at least four SATA hard disks. We could do RAID-5 with only three, but an unacceptable amount of disk space (33%) would be wasted for parity overhead.

Network throughput to a RAID-5 device at least as good as a Thecus N5200Pro, i.e. in the 30-40 MB/s range. This of course implies gigabit ethernet and a fast (let's face it: x86) CPU.

Cost of not more than AU$700 (approx US$600) for the bare unit (i.e. not including hard disks).

Software RAID. Why? Because true hardware RAID is prohibitively expensive, and because software RAID uses a well-known on-disk format and is therefore the most recoverable option in the event of a catastrophic hardware failure (e.g. you can just transplant the disks into a conventional PC chassis and use any number of mainstream FOSS distros to access the data). FakeRAID is a dumb idea regardless, since the whole idea behind fakeRAID is to relieve the O/S of the burden of disk management (whilst stealing as many CPU cycles as the O/S would have used anyway), and in this instance disk management is the sole purpose of the O/S!

Linux O/S, for no reason other than that I want to use my DIET-PC distro as the basis of the solution. NetBSD (e.g. FreeNAS), OpenSolaris, etc would of course be perfectly acceptable otherwise.

x86 CPU, for maximum software availability and cost effectiveness; two or more cores would be beneficial.

The chassis should be as small as possible (not more than twice the size of a commercial 4- or 5-bay NAS).

The device must boot from a solid-state disk such that the O/S is not resident on the conventional hard disks that the NAS will manage. Swapping on hard disks should only occur in extraordinary circumstances (e.g. fscking huge filesystems, which regrettably takes relatively huge amounts of memory).

The NAS should be as quiet and energy efficient as possible, but this consideration is secondary to all of the above.

Hardware

Component Selection

From the outset, I was aiming for Mini-ITX form-factor x86 mainboard, which is about as small as you can get whilst retaining sufficient power for software RAID. I examined all kinds of Mini-ITX integrated mainboard+CPU options, principally AMD Geode LX800, VIA C7, and Intel Atom solutions. But there were two problems. First, that there were precious few embedded mainboards that had more than two on-board SATA ports. Second, that all of the low power embedded CPUs delivered only mediocre software RAID performance, and reduction in power-consumption/heat-output/noise really isn't a major concern for a NAS, given that the hard disks are likely to defeat such objectives in any case.

I was sorely tempted by the Zotac Ion N330, but it had only three internal SATA ports plus one eSATA, which meant that I would need a bizarre loop-eSATA-cable-back-inside-case arrangement in order to support four hard disks. There was also a Via C7 board with eight SATA ports that was specifically designed for use in a NAS, but this was hugely expensive (nearly $600 by itself). Eventually I gave up the idea of using an integrated mainboard+CPU solution and started looking at mainboards that supported conventional LGA775 (Intel) or Socket AM2/AM3 (AMD) CPUs. I was surprised to find that the aggregate cost of separate mainboard and CPU was still comparable to that of the embedded boards, delivering up to 2-3 times the CPU speed at the expense of additional power consumption, heat output and cooler size/noise, which doesn't matter much for a NAS. Eventually I settled on a J&W Tech Minix 780G, which in addition to the required four internal SATA ports had a legacy 40-pin IDE connector that I could use for a boot device. This board also featured fakeRAID capable of mirroring and striping, but I chose not to use this.

For the case, I was simply looking for the smallest and cheapest Mini-ITX or Shuttle case that provided any combination of 5.25" or 3.5" internal or external bays sufficient to house four 3.5" drives. In the end it came down to the classic Morex Venus 669 (no longer made but still widely available - here's a review link) or the new Morex 6600. Although the 6600 was smaller, I was a bit dubious about the desktop stability of its vertical arrangement (it looks like you could accidentally tip it over), and thus opted for the 669. The Morex 669 case turns out to be almost entirely aluminium, so it's very lightweight, but it has a bit of a tinfoil feel about it!

My CPU of choice was the Athlon X2 5050E 2.6 GHz Dual Core; since any conventional desktop CPU was going to be overkill, I thought I should at least go for a low-power (45W) option. RAM was a pair of 1GB 800 MHz SODIMMs; I didn't need that much RAM, but a pair of DIMMs allows dual channel memory access.

For the disks, I just went for the cheapest price-per-gigabyte option available at the time, which was the Samsung EcoGreen HD154UI (1.5 Tbyte 5400 RPM 32 MB cache). I figured that the slow spin rate would be unlikely to pose a problem considering the software RAID and network bottlenecks.

The remaining components were a combination of bits that I already had lying around, such as a Compact Flash to 40-pin IDE adapter, a 1 GB CF card, some 3.5"-to-5.25" HDD adaptor brackets, and SATA data and power cables with right-angle connectors.

Component Specifications

(Apologies to Aus PC-Market for unauthorised use of their images, but I did buy most of my gear from them, and this is free advertising!)

smallest/cheapest you can find (cost estimate is for 2 GB, I actually used 1 GB)

50

Subtotal

687

4

HDD

Samsung EcoGreen HD154UI

1.5 Tbyte 5400 RPM 32 MB cache

721

Total

1408

Component Assembly

Assembly of these components was a bit fiddly - lots of cables to cram into very small spaces. There was also the question of how to mount the CF-IDE adapter (since all of the drive bays were already occupied). I ended up fixing it to the left hand side of the drive bay cage (facing the front of the case) using some screws, nuts and stand-offs scavenged from a 2.5"-to-3.5" drive bay adapter that the CF-IDE used to be attached to. With CF-IDE adapter oriented such that the CF socket was topmost, the two right-side screw holes aligned sufficiently well with the slots in the drive bay cage and the forward hard disk holes that I was able to put long screws all the way through the CF-IDE adapter PCB, through a ~7mm standoff, through the drive bay cage, and into the two hard disks in the two middle bays, thereby serving the dual purpose of holding both the adapter and the disks in place. The two left-side holes didn't align with cage holes, so I used shorter screws and nuts to hold the standoffs in place such that the nuts just press against the drive bay cage. The final result looks a bit drunken, but is very secure, and there's no risk of shorting the CF-IDE adapter PCB.

Drivers

It took a while to puzzle out exactly what chipsets I had and what the corresponding driver was. Here's the result:

Function

Chipset

Kernel driver

Audio

AMD RS780 (Azalia HDA) with Realtek ALC885 Codec

snd-hda-intel and snd-hda-codec-realtek

IDE

ATI SB700

atiixp (also available as a PATA driver, but I used IDE for more flexible DMA control)

SATA

ATI SB700

ahci (I configured the BIOS to run SATA in legacy-free AHCI mode)

Network

Marvell Yukon-2 88E8056

sky2

I2C

ATI SB700

i2c_piix4

Sensor

Winbond W83627EHG

w83627ehf

USB

ATI SB700

ehci-hcd and ohci-hcd

Graphics

AMD RS780 (ATI Radeon HD 3200)

radeon (DRM) plus userspace radeonhd Xorg driver

Compact Flash Quirks

I had some difficulty installing a working MBR and boot loader on the Compact Flash. I eventually discovered that I had to override the autodetected settings for the CF in the BIOS and turn 32-bit disk access off. After that both Extlinux and GRUB worked fine. Ultimately I went for GRUB with stage 1.5 installed in the MBR. I set the CF up with a single ext3 partition (conventional MSDOS partition table) that the kernel detects as /dev/hda1. Alarmingly, the BIOS sometimes erroneously lists the CF as an eSATA disk(!) in its summary screen, but the Linux kernel always correctly identifies it as the IDE primary master (hda, not sde).

The second CF-related problem that I encountered was that the CF (or possibly the CF-IDE adapter) did not implement DMA properly, resulting in lengthy timeouts and general misbehavior when accessing /dev/hda. Once I figured out how to set a kernel parameter to force DMA off for that device only ("ide-core.nodma=0.0" - the kernel documentation is out of date!) all problems vanished.

Userspace

Since I had a 64-bit capable CPU, I decided to use a fully 64-bit userspace in order to gain possible performance benefits from 64-bit I/O paths and memory management (I'm not convinced that in practice this will make any appreciable difference, but it gives me an opportunity to put my x86_64 port of DIET-PC 3 to practical use).

Having already tackled a lot of NAS related issues in my Thecus alternative firmware project I already had a good template for a DIET-PC 3 NAS appliance build. But dealing with newer hardware and significantly larger capacities highlighted a number of software gaps and shortfalls and more than a few scripting bugs/deficiencies. Specifically:

storage_detect had some bugs and didn't have any ext4 handling.

hotplug wasn't detecting additions/removals of RAID metadevices.

in the mount package, /etc/filesystems didn't include ext4, and a fake mount.cifs helper was needed to support "mount -t cifs ...".

there was no smartctl package for interrogating disk S.M.A.R.T. status.

there was no means of creating or manipulating EFI GUID (GPT) partition tables. I have packaged a crude but small utility "gdisk" to handle this.

the version of e2fsprogs is too old to support some of the newer ext4 features (not fixed yet, I'll have to forward port the e2compr patches).

an irqbalanced package was added to distribute interrupts evenly on SMP systems, for best performance.

my Samba "dfree" helper doesn't work when used with autofs because Samba (both 1.x and 2.x) is stupidly invoking it on the root of the "mount point" (/var/ftp/media) rather than the full path of the directory in question; I'll have to examine the Samba source and see if I can patch my way around this.

the version of the radeonhd driver in xserver-xorg-accel-radeonhd was too old to support Radeon HD 3200.

the mdadm package needed rc scripting to initialise RAID metadevices at boot time, since the kernel RAID autorun feature supports old version 0.9 superblocks only.

the udev package was generating segfaults and had to be downgraded.

the "discover" framework is inadequate for dealing with complex networking arrangements such as bridging and bonding. I am in the process of replacing this with ifupdown, which requires significant alteration of initscript, busybox, hotplug, sysvinit, xdmcp-bind and *-session packages. Bridging is needed for QEMU, and bonding for gigabit link aggregation. More about both of these later.

Fixed a bug in xwchoice.

Fixed a bug in xserver-xorg-vnc.

there were no packages for QEMU components. I have created packages named qemu-img, qemu-tapctl, qemu-roms-x86, qemu-roms-ppc, qemu-roms-sparc and various emulator engine packages (just qemu-i386 to begin with). QEMU also needs libSDL which in turn uses libXrandr, so I also created libsdl12-0 and libxrandr2 packages. See below for more information about my QEMU experiments.

What Can My Firmware do?

CIFS sharing using Samba 2.2.12. An automount hierarchy (/var/ftp/media) including both internal RAID devices and attached removable media devices is shared using "share" level security. The caller has nobody/nogroup privileges if they use anything other than root's correct credentials.

NFS sharing. Much the same thing as above, but using NFS v3.

Anonymous FTP sharing using a chroot jail (/var/ftp, one level up). The FTP daemon is a patched version of wu-ftpd, which I realise is defunct and has a reputation for security vulnerabilities, but this anonymous-only build has most of the complex features disabled at compile time. A very useful feature specific to wu-ftpd is the ability to archive/compress on the fly, e.g. you can do "get directory.zip" (where directory is a directory and directory.zip doesn't exist) to recursively grab an entire directory hierarchy in one hit.

Rsync sharing (unencrypted, using rsyncd), again using the /var/ftp chroot jail and granting only nobody/nogroup privileges. This is great for differential backups (e.g. using DeltaCopy on your Windows boxen to synchronise with NAS target addresses).

P2P sharing using Direct Connect (non-hashing). The NAS sets up its own DC Hub, connects to it and shares the contents of /var/ftp/media.

iTunes sharing using mt-daapd. Needs to be pointed at the top of an MP3 hierarchy.

Provide an X11 GUI (QVWM Window Manager) accessible via both the VGA/DVI console and the VNC protocol. Right now the GUI provides nothing useful other than XTerm and access to QEMU VM consoles, though!

Provide SSH (including SCP but not SFTP) access using Dropbear.

Recover gracefully from unexpected power outages (that is, the firmware itself can - the RAID metadevices and any QEMU VMs will have to fend for themselves). The firmware is a tmpfs UnionFS overlay on top of read-only ext3, which is pretty indestructable.

Download and install component upgrades on the fly using iPKG (if you know what you're doing) in a vaguely Debian-like way. That is, you don't have to rewrite the entire firmware, you just temporarily remount the ext3 read-write, update parts of it, refresh the UnionFS and remount the ext3 read-only again.

Run Windows (or any other QEMU x86 virtual machine)!

What Can't My Firmware Do?

Be configured via a web interface. I haven't written one yet. It would hypothetically be based around a combination of Hiawatha web server and Fast-CGI PHP5.

Provide any assistance whatsoever for initial RAID metadevice setup. Learn to use mdadm, you slackers!

Join a Windows domain (at least, Samba running under the Linux host O/S can't - it's too old and no Winbind bits are provided). IMHO, if it's a domain member then it isn't an appliance anymore!

Utilising Excess CPU Power

Since my NAS is ridiculously over-powered, having basically the same characteristics as my main PC (which is an Intel Core 2 Duo 2.66 with 2 GB of RAM), I started thinking about ways to put excess CPU cycles to use (I use CPU frequency scaling, BTW, so both cores spend most of their time at the minimum 1 GHz setting).

It seemed to me that I could make my NAS double as a second general-purpose computer, without in any way compromising its NAS role, by configuring it to run a Windows virtual machine. A Windows VM is of course not going to have 3D graphics performance sufficiently close to native to play games on, but it would serve quite nicely as a dedicated game server ("dedicated server" builds are available for many popular multiplayer games, which can make for smoother gameplay since the game server no longer has to service human interface I/O events (screen, keyboard, mouse, audio) in addition to network I/O events). A Windows VM could also be useful for running a BitTorrent application, since (in my experience to date) nothing in the Linux universe seems to come close to the performance of optimised Win32 apps (uTorrent and the like).

To achieve this, I could have used VMware Server or Workstation for x86_64, but that would have been bloaty and non-free, and I'm all about the super-lightweight and open source! Since I'm already familiar with QEMU (QEMU VMs are used as dedicated development environments for DIET-PC 3), I thought I'd see how well that performed and whether a QEMU emulation environment could be made to run - and fit - in an embedded context.

Turns out that QEMU embeds very nicely, thank you! After some custom compilation of QEMU 0.10.6 and SDL 1.2.13 to remove unnecessary features, the absolutely essential components of a 32-bit x86 emulator and all dependencies that my DIET-PC NAS build didn't already provide (which turned out to be just libSDL-1.2.so.0, libXrandr.so.2 and libXrender.so.1) comes to about 2.5 MB. Adding extra emulation engines costs about 2 MB each.

Of course, true emulation is unacceptably slow, so I needed to use an x86 acceleration feature to achieve near-native performance. The older scheme is KQEMU, which according to a somewhat dated review still yields the best performance, but since the NAS' AMD CPU has the requisite virtualisation extension, and since this is being loudly trumpeted as the way of the future, I wanted to give KVM (Kernel Virtual Machine) a go instead. I was also keen to have a look at paravirtualisation (Virtio), which is still experimental but is supported (as far as it currently goes) by QEMU 0.10.x. Virtio paravirtualisation is poorly documented, but I eventually figured it out. It's supposed to allow significantly faster access to host resources that emulated drivers that mimic specific chipsets (i.e. traditional "full virtualisation").

As an initial test, I was able to run a copy of my dietpc3-dev-x86 (Debian Lenny) virtual machine with no trouble at all; it fully supported virtio_net (ethernet driver) and virtio_block (disk driver) without any modification. I then installed a Windows XP SP3 QEMU guest from scratch, and was eventually able to find and install a working virtio_net driver (turns out the latest version can only be found in the driver ISO; kvm-guest-drivers-windows-2.zip is not the latest version!). Unfortunately there is no virtio_block driver for Windows yet (as at mid Sep 09), and even when there is it seems unlikely that bootable Windows virtio disks will be supported, at least at first. It appears that virtio disk booting will work for Linux, but only by means of loading a ROM add-on (extlinux.bin) outside of the O/S, which is very clunky.

The end result? My Windows XP VM guest works fine, although its performance is not quite what I hoped. I haven't yet written a DIET-PC script framework to configure VM guest networking using TAP and bridging at boot-time - this is difficult, because many DIET-PC scripts have the hard-coded assumption that there is only a single network interface named eth0. I will have to finish planned work to replace discover with ifup/ifdown first.

Disk Layout

RAID Mode

I primarily intend to use my NAS for archival storage of multimedia content (video and audio), with at least a basic level of storage redundancy and minimum waste of usable disk space. Hence the majority of my four 1.5 TB disks would be set aside for a RAID 5 device geared to very large sequential writes. However, a relatively slow RAID 5 is not the best choice for hosting either a swap file (should the host need to swap) or QEMU VM disk images, which in both cases involve small (~4k) random reads and writes. I want that storage to be redundant too, so an additional small RAID 10 (a.k.a. RAID 0+1) device, for miscellaneous local use by the host O/S, seemed like a good idea too.

To add to the confusion there are also three different layouts that you can use for RAID 10 - near, offset and far - and there's likewise little guidance on why and when you'd want to use each of these. I used the "far" layout for my RAID 10, with the default two copies of the data (far2).

Out of general annoyance with denary rounding practices, I made my RAID 5 as close to four true tebibytes (TiB, i.e. 4 * 1024^4 bytes) of usable space (i.e. after the 25% parity overhead) as cylinder boundaries would allow, and used the remaining disk space, approximately 63 GiB after the 50% mirroring overhead, for the RAID 10.

Superblock Format

There is an old-style (0.90) and a new-style (1.x) RAID superblock format. The most important differences are that kernel RAID autorun only works with 0.90 superblocks, such that RAIDs with 1.x superblocks have to be assembled and started from userspace instead, and that the 0.90 format limits the size of a single component to 2 TiB (similar to the
MSDOS partition table limit (see below)). See this page for the full story on RAID superblock formats, and this page for a breakdown of the 1.x layouts with advice on when to use them.

Since my RAID 10 is - and will continue to be - small, and since it is convenient to have it available as early as possible (e.g. for activating a swap file that resides on it), I used a 0.90 superblock for it. The RAID 5 as it currently stands could also have used 0.90, but I used 1.1 instead in anticipation of possible future resizing that may result in components larger than 2 TiB. I used 1.1 rather than 1.0 or 1.2 so that components will never be misidentified as raw filesystems by magic numbers typers that look at the start of a partition (e.g. blkid, vol_id, file) and to make the process of enlarging the RAID a little easier (by not having to relocate the RAID superblock).

Chunk Size

Metadevice Partitioning

I also wanted to make allowances for iSCSI. When exporting a "disk" via iSCSI, it is more efficient to export a block device than a file (though both are possible) so that the client can use low-level block access. Moreover, if your iSCSI client is Windows, it will insist on creating a partition table on the iSCSI "disk". Whichever way you do it, you also have to ensure (unless you export the device read-only) that the iSCSI client will have exclusive write access to the "disk".

This is equally true of virtual disks for QEMU VM guests. It's more efficient to give a VM low-level access to a block device than high-level access to a file on a filesystem, and the VM will likewise want to put a partition table on that device, and require exclusive write access to it. The first and third statements are also true for host swap space.

I was therefore tempted to put partition tables on my RAID metadevices, in particular my RAID 10, so that I could segregate space reserved for iSCSI, QEMU virtual disks, and swap, and allow block-mode access to them. However, I eventually decided against this, on the grounds that (a) it would be difficult to reclaim disk space from over-sized or unwanted partitions, and (b) second-order partitions (a partitioned device that is itself a partition) are difficult to deal with using standard tools such as fdisk. I chose to leave both of my RAID metadevices unpartitioned, create ext4 filesystems on them, create virtual disks as files on the filesystems, and tolerate the slight performance degradation of accessing them in this manner.

Component Disk Partitioning

Another consideration was the format to use for the partition tables of the underlying physical disks from which my RAID devices were constructed. It is not possible to make a partition bigger than 2 TiB using a traditional MSDOS partition table. I could have used MSDOS partition tables, because no single partition can exceed ~1.4 TiB using 1.5 TB disks, but this would not have been very futureproof - what would happen if in future I wanted to replace my 1.5 TB disks with (say) 3.0 TB disks? I would then have to not only resize my RAID 5, but reshape it as well (since I'd have to split the 3 TB into 2 TiB + remainder). I therefore decided to use EFI GUID (GPT) partition tables instead.

Performance Tuning

This turns out to be a bit of a black art. Parameters affecting performance need to be adjusted at several different levels and also need to align vertically. Resolving issues at one level just exposes a new performance bottleneck at another.

Here are some things that I looked at:

RAID Chunk Size

The most frequently discussed performance parameter is what RAID chunk size to use ("chunk size" is more commonly referred to as "stripe size", but Linux RAID doesn't use this terminology, and "stripe" means something different when referring to ext2/3/4 filesystems, so I'll keep using the word "chunk"). Unfortunately there is no clear-cut public wisdom on this. The answer you'll usually see is something like "it depends on the nature and use of the data, bigger is often better but you'll have to do your own tests to find the best size".

After performing many tests I eventually concluded that performance differences attributable to chunk size were pretty minor compared with other factors such as filesystem and TCP/IP optimisation, and that it wasn't worth spending any more time on. I settled on 4096k chunks for the RAID 5, and 256k chunks for the RAID 10. In hindsight perhaps the RAID 10 chunk size was a bit too big, but I'm not sure if I can be bothered destroying and recreating it.

Filesystem Parameters

I'm not even going to get into the whole "which filesystem format is best for terabyte-sized devices?" argument. Suffice it to say I chose ext4, because it was convenient (it's standard in recent kernel.org kernels), and it was demonstrably better than ext3.

"-i N" sets the bytes-per-inode ratio, which will determine the filesystem's inode density. If you set this too small, then you are wasting significant space by reserving it for inodes that will never be used (because you will run out of data blocks long before you run out of inodes). My rule of thumb is that if you have 40-50% inode utilisation at the point that the filesystem is 100% full, then you've judged it about right. The ratios I used above were one inode per 16 MiB on the RAID 10 (because it will be used for only a very small number of very large files (virtual disks and swap files, upwards of 4 GiB each)), and one inode per one Mib on the RAID 5 (because I expect that it will mostly contain audio and video data (typically 5-10 MiB per audio file and 350-700 MiB per video file). Using these values I'll no doubt fall well short of my 40-50%-inode-utilisation-when-full target, but it will nonetheless result in a couple of orders of magnitude less wastage than the default of one inode per 8 Kib (with which you'd probably still be seeing 0% inode utilisation when the filesystem is full!).

"-m 0" sets the space reserved for the superuser to 0% (the default for a very large filesystem is 1%). The idea of "emergency breathing room" is really not useful unless the filesystem in question contains an operating system, and 1% of something huge is still something huge!

"-j" adds a journal, which is the default for ext4 and thus not strictly necessary. I tossed around the idea of using an external journal, but decided against it (see below regarding the uselessness of journaling).

"-J size=64" sets the size (in MiB, minimum 1024 blocks, maximum 102400 blocks) of the journal smaller than the default (which for large filesystems is 128 MiB). Why? To reduce space wastage, because journaling is actually pretty useless considering the kind of data that I'm writing. Why? Because, by default, only file metadata (data about files, not file contents) is journaled, and I'll be writing very large files, in all likelihood one at a time, such that the rate of inode change is trivial. You can, by means of a "data=journal" mount option, use the journal for file data as well as file metadata, but that would just create a severe bottleneck and cause hard disk head thrash. I carefully selected the 64 MiB value such that it would still meet the minimum journal size requirement in the event that the (RAID 5) filesystem were enlarged to maximum value that the "-E resize=" option will allow (see below).

The "-E stride=N,stripe-width=N" arguments optimise for best performance against a specific RAID layout. "Stride" here means the same thing as chunk size (but in 4096 byte blocks, hence one quarter the value), and "stripe-width" is stride size times the number of data-bearing components in the device: 3 for my RAID 5 (4 - 1 parity), and either 4 or 2 for my RAID 10 depending on whether you treat it as RAID 0 or RAID 1 (I wasn't sure, and used 2; maybe I should have used 4?).

"-E resize=N" sets the maximum number of blocks that the filesystem can be resized to. The default allows resizing up to 1024 times the current size, which is just dumb and is never going to happen, and thereby wastes space on block group descriptor table reservations. My figures allow for a factor of 16 enlargement, which I think is more realistic considering technological obsolescence.

"-G N" sets the "clumpiness" of block group metadata. Apparently when using the "flex_bg" option (on by default for ext4) you can improve performance on large filesystems with high inode transaction rates by creating larger clumps of such data, which you should trade off against the increased risk of filesystem corruption in the (unlikely) event of damage to that small area. IIRC the default is 16, so I've made the RAID 10 less clumpy and the RAID 5 more clumpy.

"-L label" is just a label. "Mine" and "yours" indicate that md0 is for the NAS' own private use, whereas md1 is for shared data.

Another option that I could have been explicit about is "-b N", which sets the filesystem block size. However, at present it cannot be increased beyond the default (for any non-tiny filesystem) of 4096, in any case.

Post-Creation Parameters

These options are mostly redundant or not meaningful in the DIET-PC environment, but it doesn't hurt to be explicit. "-c -1" and "-i 0" disable scheduled fsck checks, which are not a good idea considering how long - and how much memory - it takes to fsck terabyte-sized filesystems. "-e remount-ro" tells the kernel to remount the filesystem read-only if an inconsistency is detected, rather than the default which is to (try to) continue as if nothing happened; this seemed a sensible precaution since we never routinely fsck the filesystem. "-o journal_data_writeback" sets a default mount option that changes the default journal behaviour for better performance at the expense of safety; this isn't strictly necessary because DIET-PC uses this mount option explicity (see below).

Mount Parameters

I tweaked the DIET-PC storage_detect script such that the options that automount will use with ext4 devices are:

barrier=0,data=writeback,noexec,nosuid,nodev,noatime

"barrier=0" is ext4-specific and seems to improve write performance a little. I don't understand the (seemingly undocumented) barrier concept, but whatever they are, under heavy write load my kernel was generating messages saying that they were being disabled in any case.

"data=writeback" is a well known tweak for ext3/ext4 to change journal behaviour such that data writes may occur out of synch with metadata commits, which improves performance noticeably over the default of "data=ordered", at the expense of crash recovery safety.

"noexec,nosuid,nodev" are routine security paranoia. DIET-PC is careful about dropping privileges, wherever possible, to prevent an unprivileged user who might have gained shell access from executing their own code in order to launch further attacks against the O/S to attempt to gain superuser access. This doesn't stop the user from copying data to /tmp and running it, but considering the resource constraints of an embedded system and the likely size and shared-library-dependency complexity of the code they're trying to use, it will certainly slow them down.

"noatime" turns off completely pointless metadata writes that occur when reading files on a filesystem. Does anyone ever use the "access time" attribute these days? I think not.

TCP/IP Stack Tweaks

Sysctl

For the most part, kernel defaults are pretty good, but they're nonetheless geared mostly toward small- and average-sized packets. The consensus seems to be that for better throughput of bulk data transfers, you need to up the rmem/wmem buffer sizes. Here are the tweaks I used (based mostly on advice from this site). IP forwarding is also required for non-performance-related reasons (to pass packets to/from outside hosts to the QEMU VM).

DIET-PC lacked any facility to perform sysctl tweaks, so I added a sysctl applet to the Busybox build and reconfigured initscript (/etc/rc) to apply settings from /etc/sysctl.conf, if present.

Ifconfig

Increasing the transmit queue length is another tweak that is often recommended for fast, reliable network links. These commands were added to "pre-up" directives in the relevant /etc/network/interfaces stanzas.

ifconfig eth0 txqueuelen 1000
ifconfig eth1 txqueuelen 1000

Jumbo Frames

There's also the contentious issue of gigabit jumbo frames (i.e. a Maximum Transmission Unit (MTU) of more than the traditional 1500). Do they help, or not? I'm not sure. They seem to worsen read speeds, and I'm not convinced that there is a real-world improvement in the write throughput of very large files. Even if there is, jumbo frames seem (for writes) to be more important for the client than they are for the NAS. I got better results using a mid-size frame (MTU 4500) rather than the maximum of 9000.

Module Parameters

The Intel Pro/1000CT seems to work more efficiently if it is told to use MSI rather than MSI-Edge, and to use as much CPU as it likes rather than throttle interrupts. So I set these options via /etc/modprobe.d/e1000e.conf:

options e1000e IntMode=1 InterruptThrottleRate=0

Ethtool

Here are some more NIC-driver-specific tweaks. The first sets magic packet wake-up capability on the Marvell Yukon-2 88E8056, the second maximises the ring buffer sizes on the Intel Pro/1000CT. There are lots of other parameters that can be modified using ethtool, notably including such things as TCP/UDP offload capabilities, but in my case the default values seemed fairly optimal already. These commands were added to "pre-up" directives in the relevant /etc/network/interfaces stanzas.

ethtool -s eth0 wol g
ethtool -G eth1 rx 4096 tx 4096

Tuning the Client

The NAS's TCP/IP stack is only half the problem. There's also the question of the TCP/IP stack at the other end!

It's pretty likely that your client is a Windows O/S of some sort, in which case you should consider using TCPOptimizer to adjust your TCP/IP settings. TCPOptimizer is intended for use with ADSL Internet links rather than high-speed LANs, and the slider in the GUI only goes up to 20 Mbit/s, but using this setting will nonetheless yield considerably better throughput than Windows defaults for LAN file transfers. TCPOptimizer also assumes an MTU of 1500, and will rather unhelpfully insert MTU keys with value 1500 throughout the HKLM\SYSTEM\CurrentControlSet\Services\TcpIp\Parameters area of your registry, thereby defeating your attempts to use jumbo frames. To work around this, after applying the "Optimal settings" for 20 Mbit/s, you should select the "Custom settings" radio button and then assign the correct MTU to the relevant interfaces using the MTU text box on the middle right, and then apply those changes.

To enable jumbo frames in Windows you also have to configure the relevant NIC driver to use a larger than normal MTU, which involves context clicking on the relevant Network Connections folder entry, selecting Properties, then Configure, then the Advanced tab, and then changing the value of some vendor-and-product specific variable (it was labelled "Maximum Frame Size" in my case). There are typically lots of other low-level tuneables here also, including receive and transmit buffer sizes, interrupt throttling settings, and TCP/UDP offload settings. In my case I endeavoured to match what I did on the NAS server side, which was increase buffer sizes, turn off interrupt throttling, and ensure all offload capabilities are turned on.

Unfortunately using ethernet bridging can also defeat attempts to use jumbo frames. In my case I usually have a bridge that includes my physical gigabit adaptor and a virtual TAP adaptor used by my QEMU virtual machines. The TAP adaptor driver doesn't accept MTU values greater than 1500, and the bridge dumbs down to the level of its least capable member, such that in practice it never sends jumbo frames unless I remove the TAP adaptor from the bridge. Sigh.

Link Aggregation

Somewhat disappointed by the results of my initial throughput tests, and thinking that it would be a good idea for dealing with heavy load in any case, I decided to add a second gigabit NIC to my NAS and use the kernel "bonding" driver to aggregate the two network links. This will double your network throughput, right? Well, yes, but not in the way that you probably think it will. But I'm getting ahead of myself.

Since the mainboard's sole expansion slot is a single PCIe 4x, my only option was a PCIe 1x NIC. I bought an Intel Pro/1000CT for AU$65 (including shipping, as at Aug 2009) and threw that in. Works fine using the e100e driver. If I were running Windows (<= XP/2003), I wouldn't be able to do link aggregation with this NIC, because it doesn't match the Marvell chipset I have on the mainboard, but unlike Windows, Linux does not depend on third-party OEM drivers to do this, and its generic "bonding" feature doesn't care what the underlying hardware is.

Bonding didn't work properly until I set module delay parameters to allow time for the NICs to come up. I did this using the following /etc/modprobe.d/bonding.conf file:

options bonding mode=2 miimon=100 updelay=200 downdelay=200

The bonding driver has a lot of different operational modes, and it takes a few re-readings of Documentation/networking/bonding.txt (in any recent 2.6 kernel tree) to fully understand the differences. The mode that seems most useful for my purposes is balance-xor (mode 2). This mode ensures that each client IP address remains associated with a particular NIC, but different clients are distributed equally across the two NICs. I also tried balance-rr (mode 0), 803.2ad (mode 4) and balance-alb (mode 6) also, but balance-rr and balance-alb didn't perform well in the single-client scenario, and unfortunately my cheap 8-port Repotec switch doesn't support 803.2ad.

The bottom line is that, while link aggregation increases the total bandwidth available to the NAS, it typically isn't going to improve throughput for a single client much (unless that client is also using link aggregation). In fact due to processing overhead it may even slow things down significantly. Where it will help is when the NAS is simultaneously servicing multiple clients. In my opinion the benefits outweigh the cost, so I decided to keep using it.

Implementation Difficulties

Traditional DIET-PC only supports a single ethernet interface named "eth0". I've been unhappy about this limitation for some time, but the need for bonding (and bridging, for QEMU), combined with Busybox's recent "ifupdown" applet release, was the catalyst I needed to complete work that I already had planned to replace the over-complicated "discover" mechanism that DIET-PC used with the better known (if simpler) ifup/ifdown mechanism used by Debian and other distros.

This is pretty crude, but it gets the job done. No /etc/network/if[{pre,post}-]{up,down}.d scripts are needed. Some "bond0: received packet with own address as source address" kernel messages appear in the logs, but these seem benign, and from what I've read it's not clear from whether this is the product of a genuine misconfiguration or of over-zealous kernel coders.

IRQ Balancing

When using multiple gigabit NICs concurrently, your system will generate a lot of interrupts. By default, on a multiple-core system, the interrupt handling work will fall to your first core (CPU 0), and leave all other cores idle. Obviously, you can get better performance by distributing interrupts more evenly.

You can do this manually by adjusting the values in /proc/irq/number/smp_affinity; the values are CPU masks, with each bit representing a CPU, such that setting the mask to 2^N (two to the Nth power) will bind that interrupt to CPU N (numbering from zero). Each NIC typically has one or two associated interrupts, so by statically assigning them to different cores you will achieve crude load balancing.

Alternatively, you can just install irqbalance software to take care of it all for you automatically. This sounded like a much better idea to me, so I created an irqbalanced DIET-PC package that starts the IRQ balance daemon, and am now using this.

Samba Tweaks

Although it's not an efficient protocol for bulk data transfers on high speed networks (FTP is significantly faster!), I expect that most of the time you'll probably use CIFS to access your NAS (by means of a Windows drive mapping). If this is the case, then it helps to include the following lines in /etc/samba/smb.conf (DIET-PC's cifsd package includes these by default):

socket options = IPTOS_LOWDELAY
use sendfile = yes
block size = 4096

Outcomes

Here are my best ATTO results for a CIFS drive mapping from the NAS to a (TCP/IP stack optimised) Windows XP client.

Configuration

RAID 10

RAID 5

Single NIC, MTU 1500, untuned, Overlapped I/O

Single NIC, MTU 1500, untuned

Configuration

RAID 10

RAID 5

Single NIC, MTU 4500, untuned

Dual NIC, MTU 4500, tuned

Yes, I realise that these results look a bit insane. Obviously we are seeing a lot of write-behind caching effects - writes are being reported as complete to the client when they are not, resulting in writes that are apparently faster than reads, which never happens in reality. Unfortunately there's not much that can be done about this - the "Direct I/O" check box only inhibits client-side buffering, not server-side buffering, or the disks' own 32 MiB cache. Best to ignore the red bars and concentrate on the green.

Real-World Tests

Repeatedly copying three seasons of a TV show, totalling approximately 18 GB (17 GiB) in 51 files (typically 350 MB each) to a NAS CIFS share has never yielded throughput better than 43.7 MB/sec, with typical results closer to 30 MB/sec. This belies the ATTO statistics which claim throughput up to approximately 75 MB/sec.

Such copies will push the gigabit network utilisation in Windows Task Manager up to 30-45%. I suspect that 100% on this graph actually represents the maximum theoretical throughput of 200 gigabit/s (i.e. 100 up plus 100 down), in which case you'd never get more than 50% for a write-only activity in any case. This suggests to me that the gigabit network is basically being pushed to its limit no matter what performance tuning activities I undertake, and is always going to be the bottleneck for my NAS solution. The best tests that I can manage for local (non-network) writes, which are to cram as much data as possible into the tmpfs /tmp (i.e. RAM) and then move it onto a RAID mount point, suggest that the RAID 5 will write at up to 80 MB/sec, and the RAID 10 at approximately 200 MB/sec.