pages tagged tags/tech/floss/zfshttp://richardhartmann.de/tags/tech/floss/zfs/
richardhartmann.deRichard &#x27;RichiH&#x27; Hartmann: Potpourri IRichard &#x27;RichiH&#x27; Hartmannhttp://richardhartmann.de/blog/posts/2012/04/18-potpourri-i/http://richardhartmann.de/blog/posts/2012/04/18-potpourri-i/
tags/random/potpourritags/random/ranttags/tech/RAIDtags/tech/clitags/tech/floss/btrfstags/tech/floss/zfstags/tech/storagetags/travel/russiatags/travel/trans-siberian-railwaytags/world/asiaWed, 18 Apr 2012 00:39:43 +02002013-02-09T12:08:02Z<h1>Google rant</h1>
<p>Seems venting can be a worthwhile once in a while; the content
of <a href="http://richardhartmann.de/blog/posts/2012/04/13-woegle-is-me/">this post</a>
made its way into Google. No promises of any kind were made, but
it's good to know that this is getting some exposure within Google
:)</p>
<p><a href=
"https://code.google.com/p/chromium/issues/detail?id=123913">That
was quick.</a></p>
<h1>moreutils et al</h1>
<p>Steve Kemp <a href=
"http://blog.steve.org.uk/moreutils_makes_a_lot_of_sense_to_me.html">
is right</a>; <a href=
"http://packages.debian.org/search?keywords=moreutils">moreutils</a>
is very useful in some situations. And yes, it's a pity that we all
have our own scripts without a central repository to host them all.
The problem is one of diminishing returns and information overload;
if any given collection becomes too large, it becomes tedious to
look through it. You may have the Most Awesome Ever solution to any
given problem, but unless whoever needs it can find it quickly,
it's useless to them.</p>
<p>Thorsten Glaser suggested <a href=
"https://evolvis.org/projects/shellsnippets/">this collection</a>,
but it suffers from the same basic problem: Not enough content? You
won't find what you need. Too much content? You won't find what you
need. Finding the right balance is highly non-trivial.</p>
<h1>Linux storage</h1>
<p>Russell Coker shared <a href=
"http://etbe.coker.com.au/2012/04/17/zfs-btrfs-cheap-servers/">yet
another chapter</a> of his continuing quest for modern, reliable,
software-based storage on Linux. The short version is "ZFS is not
<em>that</em> good on Linux while btrfs is not ready for prime
time, yet." That's, unfortunately, not much of a surprise.
<code>fsck.btrfs</code> is still too new and the lack of RAID 5/6
is an absolute show-stopper as far as I am concerned. <a href=
"https://en.wikipedia.org/wiki/Copy-on-write">COW</a> solves the
writing speed issues so, hopefully, that old argument can finally
be laid to rest. And on top of the "wastng" of more disks for RAID
10, the minimum number of failed disks that can lead to data loss
on RAID 10 is two whereas it's three for RAID 6. No-brainer,
imo.</p>
<p>Anyone who does not agree with Dell's pricing structure should
seriously consider looking at <a href=
"http://www.supermicro.com">SuperMicro</a>, by the way. Cheap(ish),
reliable, high disk densities.</p>
<h1>Password hashes</h1>
<p>Saku Ytti talks about <a href=
"http://blog.ip.fi/2012/04/we-dont-understand-hashes.html">password
hashing</a> and the relative merits of different algorithms.</p>
<p>While I agree that <a href=
"http://en.wikipedia.org/wiki/Bcrypt">bcrypt</a> incorporates a
good approach inasmuch it's designed to be slow to compute and can
be made slower as needed, I think he focuses on one aspect while
disregarding all others.</p>
<p>Still, the premise that deterring brute force attacks on known
hashes is the only concern is wrong. Avoiding collisions,
reasonable certainty that there are no computational shortcuts,
irreversibility, pseudo-random output, full use of the output's
possible values, and thorough analysis by experts in the field are
important concerns, as well.</p>
<p>Brute forcing known hashes is hardly the only attack. When brute
forcing a service, i.e. against an unknown hash, no matter if the
system is remote or local, the computational cost of hashing is
mostly irrelevant. Rainbow tables help mitigate computational cost
when trying to extract clear text in case the hash is known. And if
there are a lot of collisions or an easy way to find them, knowing
the clear text password is not needed. Simply use whatever
generates the correct output and you're done.</p>
<p>Again, I am not saying that bcrypt is a bad idea; it's just that
there are more concerns than computational cost alone.</p>
<h1>Travel</h1>
<p>If you know your way around Novosibirsk, please contact me. We
have almost no idea what to do while there other than visiting the
railroad museum; part of the reason why we are stopping there is to
have a break after ~50 hours on the train and because Novosibirsk
is just an awesome name.</p>
<p>Our focus while there will be on seeing interesting industrial
or research sites. Given that there are large industrial complexes
and formerly secret cold war secret research cities situated around
Novosibirsk it would be a pity not to see any of them. We may end
up hiring a taxi via the hotel and having that drive us around a
bit. Somewhat boring and potentially costly, but a feasible
backup.</p>
<h1>Blogging frequency</h1>
<p>I have to admit being am a bit worried about spamming the
various aggregators which are fed from this blog. Writing this
catchall post is an attempt to mitigate this. While I am getting a
surprising amount of positive feedback, I would be interested to
hear anything negative, if applicable, as well. You know how to
reach me.</p>
/blog/posts/2012/04/18-potpourri-i/#commentsRichard &#x27;RichiH&#x27; Hartmann: RAID sucksRichard &#x27;RichiH&#x27; Hartmannhttp://richardhartmann.de/blog/posts/2012/02/RAID-sucks/http://richardhartmann.de/blog/posts/2012/02/RAID-sucks/
tags/tech/RAIDtags/tech/floss/btrfstags/tech/floss/zfstags/tech/storageSun, 19 Feb 2012 04:14:01 +01002012-02-19T23:11:22Z<h1>Intro</h1>
<p>RAID sucks, and so do all other Free alternatives in the Linux
world.</p>
<p>Having been in exactly the same situation several times in the
past, I have been following <a href=
"http://etbe.coker.com.au/2012/02/06/reliability-raid/">Russell</a>
<a href=
"http://etbe.coker.com.au/2012/02/10/starting-with-btrfs/">Coker's</a>
<a href=
"http://etbe.coker.com.au/2012/02/11/magic-btrfs-raid/">posts</a>
regarding data integrity with interest. FWIW, the correct answer to
"BTRFS (sic) and Xen" is: subvolumes.</p>
<p>Now that Fedora has, <a href=
"http://www.h-online.com/open/news/item/Fedora-not-to-switch-to-Btrfs-in-version-16-1319827.html">
once again</a>, decided to <a href=
"http://www.h-online.com/open/news/item/Fedora-puts-back-Btrfs-deployment-yet-again-1436704.html">
postpone</a> their switch to Btrfs as default file system, I
decided to write up my own take on this topic.</p>
<p>Any and all my considerations are made under the assumption that
important data is backed up while semi-important data is at least
living on two different machines. Beware, this is a long post, but
I do like to think it's well worth reading.</p>
<h1>RAID</h1>
<h2>Disk failures</h2>
<p>Let's recap the current situation for "traditional" file systems
like xfs, ext{2..4}, etc. In case you need to freshen up on the
technical details go <a href=
"http://en.wikipedia.org/wiki/RAID">here</a>.</p>
<ul>
<li>Single disks: Fine for many use cases like Laptops, Desktops,
etc, they are what most of us are using for most storage
needs.</li>
<li>RAID 0:If you are using that, you are most likely doing it
wrong. Barring increasing caches for SSD-based volumes on machines
which you simply take out of your cluster if there is <em>any</em>
kind of problem, I don't know of not a sigle valid use case. Other
than to get rid of data, that is.</li>
<li>RAID 1: The default for small servers and important machines
which don't need a lot of disk space.</li>
<li>RAID 5: OK for personal storage servers, avoid once disks
become larger than 500GB-1TB, depending on personal preference. To
write data, you need to calculate new checksums and thus read all
data from all corresponding slices, impacting write performance
significantly.</li>
<li>RAID 6: The RAID 10 people will disagree, but I like this RAID
level best. While your write performance will definitely take a
hit, you can mitigate this by decreasing your volume sizes. The
extra cost in ways of power, disks, controllers, and rack space are
a price we <em>gladly</em> pay for the ability to lose two disks
and still retain all data.</li>
<li>RAID 10: Great write performance until the day when two disks
in the same RAID position die right after each other and your
company's main mail storage dies. Yes, this has happenend to me and
it <strong>sucks</strong>.</li>
<li>RAID 2,3,4,50,60,foo: Mostly irrelevant in the real world;
disregard.</li>
</ul>
<h2>Silent corruption</h2>
<p>All the RAID levels with redundancy are fine and dandy if you
lose a whole disk and put in a new one. Of course, that does not
help you the tiniest bit once you get read errors. Controllers
which manage RAID 1 or 10 may or may not compare the data while
it's being read and may or may not tell you about discrepancies.
They can then toss a coin and give you either result. There is no
way to determine which is correct. RAID 2, 3, 4, 5, and 6 could, in
theory, verify the data as it's being read, but that would limit
your read rate to that of a single disk so no one does that.</p>
<p>We actually had a massive web presence fail over night, once. No
one knew what the cause was as there hadn't been any kind of access
logged. As that particular deployment is done directly from a VCS,
we simply ran a diff and found one change. We traced the failure
down to a syntax error; a one character change made everything
fail. Looking at the ASCII table, it was clear that one single bit
had flipped. This sucked. A lot.</p>
<h2>Mitigation</h2>
<p>So you end up <a href=
"http://en.wikipedia.org/wiki/Data_scrubbing">scrubbing</a> your
RAID sets on a weekly basis, recovering from silent corruption with
the help of your RAID 6's two parity stripes. And even though you
schedule the scrubbings with the least priority, you still take a
performance hit when randomly seeking a lot.</p>
<h1>ZFS</h1>
<p>Some smart people at Sun set out to fix these problems and more,
and fix it they did. The "Z" in ZFS is meant to imply that this is
the last file system you will ever need and if not for its license,
this might have worked. Of course, "last" and "forever" mean "not
now" and "10 years" respectively, in computer terms. The limits
within ZFS have been chosen so that the entropy needed to create a
disk able to max out a ZFS volume would, literally, require our
oceans to boil. Let's just say that you won't encounter these
limits any time soon.</p>
<p>While that is a nice thing to know, it does not say anything
about data integrity.</p>
<p>There are several mechanisms ensuring data integrity built in
ZFS, the most fundamental being extensive checksumming. Checksums
are vastly superior to live comparisions which RAID would have to
perform as it does not need to sacrifice any significant i/o
reading checksum data. With HDD access, you are i/o-bound, and not
CPU-bound, anyway, so performing the checksum calculations is
laughably cheap when factoring in the added safety. Everything, be
data, metadata, inodes, you name it, is checksummed. Every read
operation <em>must</em> go through the checksumming functions,
ensuring that all data which reaches your user space is correct. If
ZFS can't verify the checksum, it will not deliver any data at all,
ensuring noticeable, as opposed to silent, corruption. So how does
ZFS recover when the checksum does not match the data?</p>
<h2>copies={1..3}</h2>
<p>The easy way is to store several copies of your data within your
volume. Mount a ZFS volume or subvolume with copies=2 or 3 and all
data that is being written <em>from then on</em> will be copied
twice or three times. Trivial in principle, but powerful when built
into a file system.</p>
<h2>RAIDz{1..3}</h2>
<p>This is where things start to get interesting. RAIDz1, 2, and 3
will allow you to lose 1, 2, or 3 disks respectively while still
retaining your data; at the obvious expense of the storage capacity
of as many disks. Basically, RAIDz2 is ZFS' variant of RAID 6, but
it's so much better than mere RAID 6.</p>
<h2>Moo!</h2>
<p>RAIDz will not assign fixed slices and fill them up like RAID
does. It will look at the current write speeds and use appropriate
slice sizes dynamically. As a direct result, data which has been
written at roughly the same time will always be near other data
written at the same time. This is very nice for functionality like
snapshots, writeable subvolumes and other things and is called
<a href="http://en.wikipedia.org/wiki/Copy-on-write">COW</a>, ZFS'
variant of super cow powers. This also enables ZFS to write data,
read it back and verify the checksum and only <em>then</em> point
to the new data. Atomic commits done right.</p>
<h2>Keep rollin'</h2>
<p>As ZFS has to keep track of what data is still in use in which
volumes and subvolumes anyway, it knows which regions are free to
be reclaimed. Instead of filling up random places, ZFS will roll
over your disks, overwriting unused data on the fly.</p>
<p>This bears repeating: Contrary to RAID 5 and 6, ZFS will never
need to read old data in order to write new data. Write performance
galore and anyone still stuck on RAID 10 can finally enjoy the
increased data security.</p>
<h1>Btrfs</h1>
<p>To be completely honest, I do not know Btrfs as well as I know
ZFS. If I get anything wrong, correct me. If I offend anyone with
my outsider's interpretation, that is not my intention.</p>
<p>Btrfs has been initiated by Oracle way before it ever even
thought about buying Sun or history might have run a different
course, for worse or a lot worse. The closing of Java, Solaris,
ZFS, Hudson, OpenOffice (since then "gifted" aka thrown away), and
others makes me think the latter. As there are few technological
developments which caused me as much stress, overtime and pain as
OCFS2, I am naturally wary of file systems sponsored by Oracle and
their data focus integrity/availability. The fact that there's
still no way to fsck a Btrfs volume could be birthing pain or yet
another facet of Oracle's stance on this topic, I honestly don't
know. Either way, it's a good idea to be wary for now.</p>
<p>I can't say too much on Btrfs' technical underpinnings, but I
know it uses COW and that it has its own variant of RAID. Toss in
snapshots (writeable?), subvolumes and integrated block device
management, and you have the building blocks of a decent file
system.</p>
<p>Still, Btrfs is <em>not</em> ready for prime time. Btrfs has
been "one or two years in the future" for a few years now so I will
not be holding my breath. And once the first distributions are
starting to use Btrfs by default, people <em>will</em> lose data. A
year or three after that, I will feel comfortable to use it myself
and in production.</p>
<h1>Commercial solutions</h1>
<p>You can either hand a lot of money to proprietary vendors who
will not tell you a thing about how things work internally (and no
way to fix things directly) or you can buy solutions that employ
ZFS in the back-end. I prefer my storage hardware to be relatively
dumb while keeping the intelligent bits where I can see and poke
them so this is not really part of these considerations.</p>
<h1>So what can you do?</h1>
<h2>Test your disks before deploying them</h2>
<p>Easy, but vitally important:</p>
<pre>
<code>disk=/dev/foo
smartctl -a $disk
smartctl -t long $disk
badblocks -swo $vendor_$model_$serial_$timestamp.badblocks.swo $disk
smartctl -a $disk
</code>
</pre>
<p>If you are lazy, get a copy of <a href=
"https://github.com/RichiH/disktest">disktest</a>, a tool I wrote
to do exactly this. It should wait for the long self-test to
finish, can not read out all vendor names, and does not have a way
to document who ran the test, but it's a start. Patches and
feedback are, obviously, welcome.</p>
<p>And yes, I will package this soonish.</p>
<h2>Using your disks</h2>
<p>This list is surprisingly short. You can use</p>
<ul>
<li>distinct disks</li>
<li>RAID as per above</li>
<li>ZFS-FUSE</li>
<li>Debian/kFreeBSD</li>
<li>Nexenta</li>
</ul>
<p>And that's it.</p>
<p>At work, I prefer small RAID 6 volume sets. It sucks, but there
is nothing better.</p>
<p>For personal use, I have a machine with a FUSE-based RAIDz2
mounted with copies=2. Three disks can fail and my data is still
secure. While this setup is slow as molasses due to FUSE, speed is
not a consideration for me here; data safety is. A migration to
Debian/kFreeBSD in the medium term is still likely.</p>
<p>If you have things to add or disagree with me, I would love to
hear from you.</p>
<p><strong>Update:</strong> Jan Christian Kaessens pointed out
that, contrary to my experience, zfsonlinux did not bring down his
sytem in flames. As zfsonlinux is a kernel module, potential
performance is a lot better. Also, it supports ZFS Pool Version 28
as opposed to FUSE-ZFS' 23. While the really nice features are in v
29 and 30 (which are closed thanks to Oracle) v 28 still has some
nice changes.</p>
/blog/posts/2012/02/RAID-sucks/#comments