large file server/backup system: technical opinions?

Karl-Heinz Herrmann [kh1 at khherrmann.de]

Sun, 20 Jan 2008 18:09:08 +0100

Hi Tags's,

at work we are suffering from the ever increasing amount of data.
This is a Medical Physics Group working with MRI (magnetic resonance
imaging) data. In worst case scenarios we can produce something like
20GB of data in an hour scantime. Luckily we are not scanning all the
time .-) Data access safety is mostly taken care of by firewalls and
access control outside our responsibility. But storing and backups
are our responsibility.

Currently we have about 4-6 TB distributed over two
"fileservers" (hardware raid5 systems) and two systems are making daily
backups of the most essential part of these data (home, original
measurement data). The backup machines are taking more than a full
night by now and can't handle anything while backuppc is still sorting
out the new data. The machine the backup is from is fine by morning.

We will have a total of three number crunching machines over the year
and at least these should have speedy access to these data. Approx. 20
hosts are accessing the data as well.

Now we got 10k EU (~15k $US) for new backup/file storage and are
thinking about our options:

* Raid system with iSCSI connected to the two (optimally all three)
number crunchers which are exporting the data to the other hosts via
NFS. (eSATA any good?)
* an actual machine (2-4 cores, 2-4GB RAM) with hardware raid (~24*1TB)
serving the files AND doing the backup (e.g. one raid onto another
raid on these disks)
* A storage solution using fibre-channel to the two number crunchers.
But who does the backup then? The oldest number cruncher might be
able to handle this nightly along with some computing all day. But it hasn't
got the disk space right now.

The surrounding systems are all ubuntu desktops, the number crunchers
will run ubuntu 64bit and the data sharing would be done by NFS --
mostly because I do not know of a better/faster production solution.

The occasional Win-access can be provided via samba-over-nfs on one of
the machines (like it does now).

Now I've no experience with iSCSI or fibre channel under Linux. Will
these work without too much of trouble setting things up? Any specific
controllers to get/not to get? Would the simultaneous iSCSI access from
two machines to the same raid actually work?

I also assume all of the boxes have 2x 1Gbit ethernet so we might be
able to set up load balancing -- but the IP and load balancing
would also have been tought to our switches I guess -- And these are
"outside our control", but we can talk to them. Is a new multi core
system (8-16 cores, plenty RAM) able to saturate the 2xGbit? Will
something else max out (hypertransport, ... )?

Any ideas -- especially ones I did not yet think of -- or experiences
with any of the exotic hardware is very much welcome....

> Hi Tags's,
>
> at work we are suffering from the ever increasing amount of data.
> This is a Medical Physics Group working with MRI (magnetic resonance
> imaging) data. In worst case scenarios we can produce something like
> 20GB of data in an hour scantime. Luckily we are not scanning all the
> time .-) Data access safety is mostly taken care of by firewalls and
> access control outside our responsibility. But storing and backups
> are our responsibility.
>
>
> Currently we have about 4-6 TB distributed over two
> "fileservers" (hardware raid5 systems) and two systems are making daily
> backups of the most essential part of these data (home, original
> measurement data). The backup machines are taking more than a full
> night by now and can't handle anything while backuppc is still sorting
> out the new data. The machine the backup is from is fine by morning.
>
> We will have a total of three number crunching machines over the year
> and at least these should have speedy access to these data. Approx. 20
> hosts are accessing the data as well.
>
>
> Now we got 10k EU (~15k $US) for new backup/file storage and are
> thinking about our options:
>
> * Raid system with iSCSI connected to the two (optimally all three)
> number crunchers which are exporting the data to the other hosts via
> NFS. (eSATA any good?)
>
> * an actual machine (2-4 cores, 2-4GB RAM) with hardware raid (~24*1TB)
> serving the files AND doing the backup (e.g. one raid onto another
> raid on these disks)
>
> * A storage solution using fibre-channel to the two number crunchers.
> But who does the backup then? The oldest number cruncher might be
> able to handle this nightly along with some computing all day. But it hasn't
> got the disk space right now.
>
>
> The surrounding systems are all ubuntu desktops, the number crunchers
> will run ubuntu 64bit and the data sharing would be done by NFS --
> mostly because I do not know of a better/faster production solution.
>
> The occasional Win-access can be provided via samba-over-nfs on one of
> the machines (like it does now).
>
>
> Now I've no experience with iSCSI or fibre channel under Linux. Will
> these work without too much of trouble setting things up? Any specific
> controllers to get/not to get? Would the simultaneous iSCSI access from
> two machines to the same raid actually work?
>
> I also assume all of the boxes have 2x 1Gbit ethernet so we might be
> able to set up load balancing -- but the IP and load balancing
> would also have been tought to our switches I guess -- And these are
> "outside our control", but we can talk to them. Is a new multi core
> system (8-16 cores, plenty RAM) able to saturate the 2xGbit? Will
> something else max out (hypertransport, ... )?
>
>
> Any ideas -- especially ones I did not yet think of -- or experiences
> with any of the exotic hardware is very much welcome....
>
>
>
> Karl-Heinz

Not sure on your budget but if you got a tape library and an SL500 and some
tape drives, use Veritas NetBackup it would take care of that no problem.

Although a tape library for 4-6TB is probably over-kill, if you had 100TB+
you may want tape

But if you want a real solution, I'd go with an SL500 and 2-4 LTO-3 or LTO-4
drives. LTO-3 tape is 400GB uncompressed, LTO-4 is 800GB, but LTO-3 is
currently the sweet spot for $38-40/tape.

> Hi Tags's,
>
> at work we are suffering from the ever increasing amount of data.
> This is a Medical Physics Group working with MRI (magnetic resonance
> imaging) data. In worst case scenarios we can produce something like
> 20GB of data in an hour scantime. Luckily we are not scanning all the
> time .-) Data access safety is mostly taken care of by firewalls and
> access control outside our responsibility. But storing and backups
> are our responsibility.
>
>
> Currently we have about 4-6 TB distributed over two
> "fileservers" (hardware raid5 systems) and two systems are making daily
> backups of the most essential part of these data (home, original
> measurement data). The backup machines are taking more than a full
> night by now and can't handle anything while backuppc is still sorting
> out the new data. The machine the backup is from is fine by morning.
>
> We will have a total of three number crunching machines over the year
> and at least these should have speedy access to these data. Approx. 20
> hosts are accessing the data as well.
>
>
> Now we got 10k EU (~15k $US) for new backup/file storage and are
> thinking about our options:
>
> * Raid system with iSCSI connected to the two (optimally all three)
> number crunchers which are exporting the data to the other hosts via
> NFS. (eSATA any good?)
>
> * an actual machine (2-4 cores, 2-4GB RAM) with hardware raid (~24*1TB)
> serving the files AND doing the backup (e.g. one raid onto another
> raid on these disks)
>
> * A storage solution using fibre-channel to the two number crunchers.
> But who does the backup then? The oldest number cruncher might be
> able to handle this nightly along with some computing all day. But it hasn't
> got the disk space right now.
>

Have a look at coraid, they make very reasonably priced appliances with
upto 15Tb capacity depending on the raid-config you create.
It's AoE storage but has been working reasonably well for us, don't
expect stellar performance but it should sufficient for your backup needs.

I have them deployed in two different configs:

* GFS-clustered with a 5-node cluster on top
* Standalone node with hot-standby

The latter option provides good performance, GFS1 is suffering from
lock-contention due to heavy writing and many many files in our setup.
I would definitly not recommend that if you need speedy access.

We currently buy dells at a reasonable pricepoint with 4Tb storage each,
maybe that would be interesting for the number-crunchers?
Fibre storage and backup is going to be a tight fit with you budget...

I'd try and move the backupschedule to contiunuously if at all possible,
but if that's possible at all is impossible to extract from your problem
description.
That way you'd open up your backupwindow. It does assume seperate
architectures for backup and crunching.
Depending on the time requirements for the data on the fileservers you
could move them to the backup ?

I've used an old SCSI tape drive way back when these had 4GB/tape --
and frankly the data handling was a pain in the ass (tar streams).

> Not sure on your budget but if you got a tape library and an SL500
> and some tape drives, use Veritas NetBackup it would take care of
> that no problem.

Well -- the SL500 would be outside our budget. Also the specs are quite
a bit more than what we would need in the few coming years.

The SL48 on the other hand might just about fall into budget range.

> Although a tape library for 4-6TB is probably over-kill, if you had
> 100TB+ you may want tape

Right now we have about these 6TB on drives. This is growing and we
have to archive the old data, we can't just throw them out at some
time.

> But if you want a real solution, I'd go with an SL500 and 2-4 LTO-3
> or LTO-4 drives. LTO-3 tape is 400GB uncompressed, LTO-4 is 800GB,
> but LTO-3 is currently the sweet spot for $38-40/tape.

One thing is not yet quite clear to me. I connect that SL500 (or SL48)
via FC or SCSI to a computer. Then the whole SL500 looks like one giant
tape? Or how is this represented to the outside? So for
archive/retrieval I would definitely need an additional software
(like Veritas NetBack you mentioned)?

With the current budget of ~15k$ we need basically both -- new disk
space and a way to back up the new disk space. So we might have to
stick to backuppc as software and two raids -- one data, one backup and
plan for a tape archiving system next year.

Can the tape handle something like "raid1"? I've no good feeling
putting data on the tape and deleting all other copies. That's also the
reason why I woul try to adapt the hard drive space first so we at
least can accomodate this years growing data needs (including a second
copy on different hard drives).

Is there also software around which would transparently pull old data
of a disk array and store them on tape? and retrieve the files if you
access them? Research center Juelich had that years ago when I was
doing my PhD there. What price tags would we be talking then?

Ah.. I see: Suns "BakBone NetVault" can do D2D2T..... I'll go read some
more.... Thanks again for pointing these Tape systems out to me.

> Hi Justin,
>
> thanks for your suggestion to look into the SUN Tape Solutions.
>
> I've used an old SCSI tape drive way back when these had 4GB/tape --
> and frankly the data handling was a pain in the ass (tar streams).

Without (enterprise backup software) such as NetBackup (#1) or other, (Legato,
or others) it is very painful.

>> Not sure on your budget but if you got a tape library and an SL500
>> and some tape drives, use Veritas NetBackup it would take care of
>> that no problem.
>
>
> Well -- the SL500 would be outside our budget. Also the specs are quite
> a bit more than what we would need in the few coming years.
>
> The SL48 on the other hand might just about fall into budget range.

Ok.

>> Although a tape library for 4-6TB is probably over-kill, if you had
>> 100TB+ you may want tape
>
> Right now we have about these 6TB on drives. This is growing and we
> have to archive the old data, we can't just throw them out at some
> time.

That is where tape comes in, 6TB is nothing and if its compressible data
you'll see great returns.

>> But if you want a real solution, I'd go with an SL500 and 2-4 LTO-3
>> or LTO-4 drives. LTO-3 tape is 400GB uncompressed, LTO-4 is 800GB,
>> but LTO-3 is currently the sweet spot for $38-40/tape.
>
> One thing is not yet quite clear to me. I connect that SL500 (or SL48)
> via FC or SCSI to a computer. Then the whole SL500 looks like one giant
> tape? Or how is this represented to the outside? So for
> archive/retrieval I would definitely need an additional software
> (like Veritas NetBack you mentioned)?

The SL500 connects via (either Fiber Channel or SCSI)- that is the
robotic controller, which is at the top of the unit.

The drives are connected separately via either (Fiber Channel or SCSI).

> With the current budget of ~15k$ we need basically both -- new disk
> space and a way to back up the new disk space. So we might have to
> stick to backuppc as software and two raids -- one data, one backup and
> plan for a tape archiving system next year.

One nice thing about tape is it does not require power and it also is nice
in the event of a disaster or someone accidentally running rm -rf on the
wrong directory or directory/ext3/filesystem corruption/etc.

> Can the tape handle something like "raid1"? I've no good feeling
> putting data on the tape and deleting all other copies. That's also the
> reason why I woul try to adapt the hard drive space first so we at
> least can accomodate this years growing data needs (including a second
> copy on different hard drives).

You could backup what you have on disk and then run incrementals over them,
LTO-2/3/4 technology it quite good, as long as you keep you clean your tape
drives regularly, they're fairly reliable.

> Is there also software around which would transparently pull old data
> of a disk array and store them on tape? and retrieve the files if you
> access them? Research center Juelich had that years ago when I was
> doing my PhD there. What price tags would we be talking then?

Some companies actually do this for their web orders/etc-- you would need
to create scripts that pull the files off the tapes and backup as needed.
It is a single command either way in NetBackup (bpbackup or bprestore).

>
> Ah.. I see: Suns "BakBone NetVault" can do D2D2T..... I'll go read some
> more.... Thanks again for pointing these Tape systems out to me.

NetBackup can also do this.

Including the veritas-bu mailing list on this thread as well, they may
also have some good insight into your problem.

> with upto 15Tb capacity depending on the raid-config you create.
> It's AoE storage but has been working reasonably well for us, don't
> expect stellar performance but it should sufficient for your backup
> needs.

I had never heard of AoE before.... the kernel module works reliable I
understand from the above? When they say 2x1GBethernet -- can this be
easily load balanced? Or would that be useful for connecting to two
different hosts only?

> I have them deployed in two different configs:
> * GFS-clustered with a 5-node cluster on top
> * Standalone node with hot-standby

Can you comment on GFS vs. NFS for a small number (~10) of hosts with
mostly read access? Might GFS be something to consider for NFS
replacement?

> We currently buy dells at a reasonable pricepoint with 4Tb storage
> each, maybe that would be interesting for the number-crunchers?

we are shopping for an AMD 8x quad core as soon as they exist in bug
free stepping and want to put some 64GB RAM in that. We were thinking
quite a while about cluster vs. SMP multi core system. Finally we
decided for regular image reconstruction and post processing it doesn't
matter and some people in our workgroup do finite element grid
caclulations and inverse problems (EEG source localisations) and for
that LOTS of RAM to keep the grid data out of swap are a very good
thing. Also standard tools like matlab and toolboxes are able to make
good use of multiple cores and less so of distributed clusters it
seems. The other number crunsher will probably be Intel with less cores
but better performance per core for the less parallellisable stuff.

> Fibre storage and backup is going to be a tight fit with you budget...

We would have tried to put the two FC controllers in the budget for the
two number crunshers. Otherwise yes, that would be to big a chunk out of
the 15k$.

> Additionally I would want to seperate the workloads:
> * fast-diskaccess for numbercrunching
> * reliable but slow access for backup

Hm.. yes. Right now planning to run some scratch drives (maybe even
raid0) in the crunshers for fast local access. Once doen the data can
be put out on some storage via NFS.

> I'd try and move the backupschedule to contiunuously if at all
> possible, but if that's possible at all is impossible to extract from
> your problem description.

Not with the current software (backuppc). We've one rather mediocre box
which handles a secondary backup without a hitch for quite some time
now. But that's the offsite remote backup which doesn't do much
otherwise. The "primary backup" is simply a second raid in the main
fileserver and while that is running the fileserver is awfully slow. So
we need to get the backuper away from the backupped data. Or maybe
plenty of cores and two individual RAID controllers might help?

during daytime lots of files will change but basically there wouldn't
be a serious problem with backing stuff up as soon as it changes. Could
you recommend software doing that?

> That way you'd open up your backupwindow. It does assume seperate
> architectures for backup and crunching.
> Depending on the time requirements for the data on the fileservers
> you could move them to the backup ?

That's under discussion here. We've plenty of dicom files (i.e. medical
images) which are basically sets of files in a dir, size varies from a
few k to maybe 3MB each. Now we don't use much of the older data, so
these could be moved into some kind of long time storage and some time
penalty to get them back wouldn't hurt much. But attached to these is a
data base keeping track of meta data and we have to be careful not to
break anything. The dicom server ctn handles this data base and accepts
files from other dicom nodes (like MR scanner) and stores the files.
Unfortunately the guys writing ctn forgot the cleaning tools (move,
remove, ...) and we are putting some effort in writing tools right now.

Also from analyzing the disk space usage these dicom images seem to grow
steadily but rather slowly. The major mass of data recently are raw
data from the scanner which can easily become 4 to 6 GB each and
represent a short expermient of 15 minutes. We will work more and more
on these raw data for experimental image reconstruction. One of the
number crunchers jobs will be to read these 5GB junks and spit out
50-300 of single images, so reading large continuous data and writing
many small files (no GFS for that I presume).

Coming back to the coraid AoE boes...... apart from that extensibility
by plugging in another AoE device once the first is full I can't really
see a big difference to an actual computer (lets say 4x4 cores), 8-16GB
RAM and 24 1TB drives connectzed to 2 PCI express(x8) RAID controllers.
We got an offer for something similar at 12kEU (a little bit too much
right now, but drive cost should be dropping). But the coraid had some
5 to 7 k$ price tags without the drives. And we don't get the
computer running the backup software. The hot swap bays and redundant
power was also there. Am I missing something the coraid can do natively
that a computer running Linux could not easily replicate?

>> with upto 15Tb capacity depending on the raid-config you create.
>> It's AoE storage but has been working reasonably well for us, don't
>> expect stellar performance but it should sufficient for your backup
>> needs.
>>
>
> I had never heard of AoE before.... the kernel module works reliable I
> understand from the above? When they say 2x1GBethernet -- can this be
> easily load balanced? Or would that be useful for connecting to two
> different hosts only?
>

New versions support loadbalancing out of the box without any advanced
trickery, make sure you get those if you buy them.
The older once had a hardware issue, the secondary nic shared the
PCI-bus with something else (forgot what) which ate up so much PCI
bandwidth that loadbalancing the network traffic would actually result
in a reduction of performance

If you use them make sure you use the kernel-module supplied by coraid,
the kernel lags several versions usually and the coraid ones perform
much better in general.

> Can you comment on GFS vs. NFS for a small number (~10) of hosts with
> mostly read access? Might GFS be something to consider for NFS
> replacement?
>

Sure, GFS is a filesystem designed by Sistina and bought by RedHat. It's
primary goal is to allow several hosts in a cluster to share the same
shared storage pool over network and write to it concurrently. Nodes in
the cluster see each others updates. It performs reasonably well but
very poorly under specific workloads.
RedHat is aware of this problem and has redesigned the cluster locking
and filesystem symantics to counter the problem, this is integrated into
the mainline kernel.
Sadly no one considers it production quality yet(!).

If you do straight readonly access with only ~10-20 hosts NFS is
definitly the way to go.
If properly tuned it scales nicely in the read-only version and performs
better than gfs.
Downside is that you introduce a Single Point of Failure with the NFS
server, but the downside of a GFS cluster is the overhead of locking
between nodes.
Apart from that GFS needs a cluster and thus cluster architecture, main
requirement is that each node needs to be able to powerdown a
non-communicating node to prevent run-away nodes to cause filesystem
corruption. It's not really complicated or expensive but it adds up

We've been able to scale NFS readonly to roughly 250-400 hosts without
any problems, though not in the data-volumes you are talking about.

>> We currently buy dells at a reasonable pricepoint with 4Tb storage
>> each, maybe that would be interesting for the number-crunchers?
>>
> we are shopping for an AMD 8x quad core as soon as they exist in bug
> free stepping and want to put some 64GB RAM in that. We were thinking
> quite a while about cluster vs. SMP multi core system. Finally we
> decided for regular image reconstruction and post processing it doesn't
> matter and some people in our workgroup do finite element grid
> caclulations and inverse problems (EEG source localisations) and for
> that LOTS of RAM to keep the grid data out of swap are a very good
> thing. Also standard tools like matlab and toolboxes are able to make
> good use of multiple cores and less so of distributed clusters it
> seems. The other number crunsher will probably be Intel with less cores
> but better performance per core for the less parallellisable stuff.
>

I would need a much better understanding of the process and workloads
involved to be able to say something meaningful in a technical sense.
Wouldn't it be possible to split and parallelize work so it can process
chunks of data, that would allow you to use more but lower powered machines.

In my experience, anything that you need to buy at the top of the
performance spectrum is overpaid.
if you can work out a way to do the same work with 8 quad-cores with 8Gb
RAM servers, you might spend 25% of what you would spend on a
top-of-the-line server.

>> Additionally I would want to seperate the workloads:
>> * fast-diskaccess for numbercrunching
>> * reliable but slow access for backup
>>
>
> Hm.. yes. Right now planning to run some scratch drives (maybe even
> raid0) in the crunshers for fast local access. Once doen the data can
> be put out on some storage via NFS.
>

Sounds like a plan depending on the data-security the company or
organisation needs.
If the data is hard or impossible to reproduce, some people are bound to
get extremely pissed if one disk in the stripe-set fails.

> Not with the current software (backuppc). We've one rather mediocre box
> which handles a secondary backup without a hitch for quite some time
> now. But that's the offsite remote backup which doesn't do much
> otherwise. The "primary backup" is simply a second raid in the main
> fileserver and while that is running the fileserver is awfully slow. So
> we need to get the backuper away from the backupped data. Or maybe
> plenty of cores and two individual RAID controllers might help?
>

Maybe the problem is in the current software you are using ? I've never
heard from backuppc.
For continuous backup I'd look at the usual suspects at first, rsync,
tar etc.

Continous backup needs to be designed carefully, if someone actually
deletes a file from primary storage and the backup is near
instantaneous, you will find that you have no way to restore "human errors"

> during daytime lots of files will change but basically there wouldn't
> be a serious problem with backing stuff up as soon as it changes. Could
> you recommend software doing that?
>

I'm hearing rumours that they are working on two-way syncing and even
three-way syncing but haven't had time to research yet.
Be aware that drbd is near instantaneous and thus suffers from the
problem above.

> That's under discussion here. We've plenty of dicom files (i.e. medical
> images) which are basically sets of files in a dir, size varies from a
> few k to maybe 3MB each. Now we don't use much of the older data, so
> these could be moved into some kind of long time storage and some time
> penalty to get them back wouldn't hurt much. But attached to these is a
> data base keeping track of meta data and we have to be careful not to
> break anything. The dicom server ctn handles this data base and accepts
> files from other dicom nodes (like MR scanner) and stores the files.
> Unfortunately the guys writing ctn forgot the cleaning tools (move,
> remove, ...) and we are putting some effort in writing tools right now.
>

Great
It's been a struggle to get that through here as well but we currently
have redistribution software written which does the following:
* calculate % fill-ratio compared to the other nodes in the storagepool
* redistribute (aka pull in case of a lower ratio, push in case of a
higher ratio)
* update meta-index
* delete data after verification on new location

We run that every time we expand the storage pool with extra machines.
This is the distributed storage pool which has replaced the coraids by
the way.

> Also from analyzing the disk space usage these dicom images seem to grow
> steadily but rather slowly. The major mass of data recently are raw
> data from the scanner which can easily become 4 to 6 GB each and
> represent a short expermient of 15 minutes. We will work more and more
> on these raw data for experimental image reconstruction. One of the
> number crunchers jobs will be to read these 5GB junks and spit out
> 50-300 of single images, so reading large continuous data and writing
> many small files (no GFS for that I presume).
>

Mmm interesting problem.
* So you need to keep the originals (4-6GB) around to do processing on.
* Each processing job spits out 50-300 files between 5K - 3000K
* Crunchers are cpu-bound but need fast disk-access to originals
* Everything needs to be accessible for at least 6-12 months
* You need backup

correct ?

> Coming back to the coraid AoE boes...... apart from that extensibility
> by plugging in another AoE device once the first is full I can't really
> see a big difference to an actual computer (lets say 4x4 cores), 8-16GB
> RAM and 24 1TB drives connectzed to 2 PCI express(x8) RAID controllers.
> We got an offer for something similar at 12kEU (a little bit too much
> right now, but drive cost should be dropping). But the coraid had some
> 5 to 7 k$ price tags without the drives. And we don't get the
> computer running the backup software. The hot swap bays and redundant
> power was also there. Am I missing something the coraid can do natively
> that a computer running Linux could not easily replicate?
>

Difference is price, IIRC we've been buying them for 7k including a full
set of 750Gb SATA disks.
They're listed on the coraid site for $4k without drives.
I've bought the large RAID systems as well and the price you quote
doesn't sound strange to me, I would definitly not bet on getting a
better offer for that configuration. Harddisk prices are dropping but
not that fast.

Coraids can be attached to a single host, depending on traffic a single
host could be the basis for 3-4 coraids.

Apart from that, you're right, actually internally it looks like the
coraids run a modified version of plan9 judging from the cli.