Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

An anonymous reader writes "Hi there ! I'm looking for a simple solution to backup a big data set consisting of files between 3MB and 20GB, for a total of 24TB, onto multiple hard drives (usb, firewire, whatever) I am aware of many backup tools which split the backup onto multiple DVDs with the infamous 'insert disc N and press continue', but I haven't come across one that can do it with external hard drives (insert next USB device...). OS not relevant, but Linux (console) or MacOS (GUI) preferred... Did I miss something or is there no such thing already done, and am I doomed to code it myself ?"

USB would be just the most retarded way to go for something like this, its too slow and he's gonna be swapping worse than when we used to have to back up things to CDs.

I'm guessing he's going USB because he don't have the cash to buy a NAS of that size but you can always jury rig you a NAS, its really not hard. We did something similar at the last shop I worked at when the boss scored a ton of SCSI drives at an auction and ended up with nearly a Tb NAS when the average HDD was 40Gb. Here is how you do it..

You take a couple of full size towers, bigger the better, preferably twinkies as it makes the job a LOT easier. You strip 'em to the frames and use a couple of spot welds to make them into one giant case along with another couple of weld to mount a shitload of drive cages into the case. Then you take a cheap server or even desktop board, all that matters is it has a shitload of PCI slots which you fill with controller cards, SCSI in our case but SATA today, mount the board along with a big PSU to feed the drives and voila! One big ass DIY NAS unit that can hold a huge pile of drives. Just to finish our white trash conversion we tied on a Walmart box fan to keep the sucker cool and stuck it in a corner, worked great.

The only software that I think would work with USB is Paragon Drive Backup [drive-backup.com] as you can have it split by just about any size you want. They also have their own Linux based recovery media but damned if i know if you can get the software as a Linux installer, never ran into that situation to need it in that way. I know its worked great for me making OS images and backing up files and folders onto USB drives but if you're gonna be splitting to a ton of little drives then you are just gonna have to swap, no way out of that. If you want to fill the drives up then set Paragon to a small size, say 700Mb, but good fucking luck checking your backup as the amount of swapping you're gonna do is just insane.

If the OP's porn collection can be logically broken up at some level, eg:

/porn/blonde/porn/brunette/porn/redhead

then the backup software could create one job for each directory, and multiple USB disks could be attached at once giving increased throughput. USB3 also increases speed to the point where the 7200RPM disk itself will become the bottleneck.

So at 100MB/second per disk write speed with 4 disks going at once (assuming the source disks are capable of this supplying this volume of data and there are no other throughput limitations), you could do it in 16 hours, or 24 hours with more realistic margins.

If it turns out that the source data is not porn (unlikely) and is highly compressible, then it could be done in far less time.

I have a setup here where the server's video media is about 8tb in size. That backs up via rsync to the backup server which is in another room over rsync. It contains a large number of internal and external drives. None of them are over 2tb in capacity. The main drive has data separated into subfolders and the rsync jobs back up specific folders to specific drives.

A few times I've had to do some rearranging of data on the main and backup drives when a volume filled up. So it helps to plan ahead to save time down the road. But it works well for me here.

The only thing with rsync you need to worry about is users moving large trees or renaming root folders in large trees. This tends to cause rsync to want to delete a few TB of data and then turn around and copy it all over again on the backup drive. It doesn't follow files and folders by inode, it just goes by exact location and name.

I help mitigate this by hiding the root folders from the users. The share points are a couple levels deeper so they can't cause TOO big of a problem if someone decides to "tidy up". If they REALLY need something at a lower level moved or renamed, I do it myself, on both the source and the backup drives at the same time.

Another alternative is to get something like a Drobo where you can have a fairly inexpensive large pool of backup storage space that can match your primary storage. This prevents the problem of smaller backup volumes filling up and requiring data shuffling, but does nothing for the issue of users mucking with the lower levels of the tree.

Agreed. Best thing I ever did was get a computer case with a SATA sled bay, like one of these [newegg.com]. It won't help with breaking up the files, but a plain SATA connection will be many times faster and many times cheaper than getting external USB drives (because you don't have to keep paying for external case + power supply). After you copy it over, you just store the bare drives in a nice safe place.

This assumes it's a one-time or rare thing. If you do want access or the backup process is a regular thing, then an NAS or RAID setup is probably more convenient so that you don't have to keep swapping drives in and out.

If you're just using RAID to make a bunch of disks look like a single logical unit, consider mhddfs [mindlesstechie.net]. It's a FUSE filesystem which makes a bunch of disks look like a single unit. I've used it for storing backups - it works as advertised.

IIRC there were one or two caveats like a lack of hard link support so make sure you try all your use cases before relying on it.

Although SATA is more widespread and avoids any reduction in performance you might get from putting an intermediate layer in front of the native interface of the drive. A large drive is going to require a wall wart and all of those will need to be looked after.

The problem with case+power supply is not the cost but the fact that it is something else to lose. This goes for the extra cabling too.

Plus with a bare drive you can buy with performance in mind since the driv

If he's looking for reliability in a backup, then his choice of disks is going to be a factor. A drive with consumer grade chances of URE is going to die in a handful of writes and reads. USB grade drives (Caviar Green anyone?) aren't known for their reliability. Something like a Hitachi Ultrastar RE has a very very low chance of encountering a URE, so will be much more reliable.

Yes, Bacula is the only real solution out there that isn't going to cost you an arm and a leg, and that allows you to switch easily between any backup medium. As long as your mySQL catalog is intact restoration is a synch...

Did I mention it supports backup archiving as well if you want duplicate copies for Tapes being shipped off site...

Yes, Bacula is the only real solution out there that isn't going to cost you an arm and a leg, and that allows you to switch easily between any backup medium.

Except for good old tar, which is present on all systems.

Most people are probably not aware that tar has the ability to create split tar archives. Add the following options to tar:-L <max-size-in-k-per-tarfile> -M myscript.sh... where myscript.sh echoes out the name to use for the next tar file in the series. It can be as easy as a for loop checking where the tar file already exists and returning the next hooked up volume where it doesn't.Or it could even unmount the current volume and automount the next volume for you. Or display a dialogue telling you to replace the drive.

One advantage is that you can easily extract from just one of the tar files; you don't need all of them or the first-and-last like with most backup systems. Each tar file is a valid one, and at most you need two tar files to extract any file, and most of them just one.

Tar multivolume can, of course, be combined with tar's built in compression.

Assuming you're not worried about backup speed, you could use a four-bay external hard-drive enclosure in combination with RSYNC and LVM on any linux variety. I don't know if they all do, but the MediaSonic HF2-SU3S2 supports 3TB hard drives per bay, which means that two of them could be used in conjunction to provide 24TB of backup storage. Since you can make a large volume out of the full 24TB using LVM, you could even use something like dd to write to the disk (RSYNC with the archive option would be a be

If not a RAID (those tend to fail just as hard) get at least two, possibly three copies of each file on separate drives. The last thing you want is to wait for RAIDs to recover and watch them fail during recovery, with your only copy of a file on them.

For that much data you want a RAID since drives tend to fail if left sitting on the shelf, and they also tend (for different reasons) if they are spinning. Basically: buy a RAID enclosure, insert drives so it looks like one giant drive, then copy files. For 24TB you can use eight 4TB drives for a 6+2 RAID-6 setup. Then if any two of the drives fail you can still recover the data.

Yeah... though I suspect with the price premium for 4TB drives - they're huge - and the cost of an 8-port RAID6 capable RAID card you're considerably above the budget he was going for. If this is like "projects" or something I'd probably suggest the human archiving method - split your live disk into three areas, "work in progress" and "to archive" and "archive". Your WIP you back up completely every time, your "to archive" you add to the latest archive disk (plain, no RAID), and make an index of it so you c

More to the point - Do what the parent post said, but use something like FreeBSD or Solaris and ZFS with a raidz3 setup (essentially RAID6), which gives you block level dedup, snapshotting, compression, encryption, etc.

As mentioned already, RAID is not a backup solution. While it will likely work fine for a while, the risk [datamation.com] of a catastrophic failure rises as drive capacity increases. From the linked article:

With a twelve -terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent - meaning that RAID 5 has no functionality whatsoever in that case. There is always a chance of survival, but it is very low.

Granted, this is talking about RAID 5, so let's naively assume that doubling the parity disks for RAID 6 will halve the risk... but then since we're trying to duplicate 24 terabytes instead of twelve, we can also assume the risk doubles again, and we're back to being practically guaranteed a failure.

Bottom line is that 24 terabytes is still a huge amount of data. There is no reliable solution I can think of for backing it all up that will be cheap. At that point, you're looking at file-level redundancy managed by a backup manager like Backup Exec (or whatever you prefer) with the data split across a dozen drives. As also mentioned already, the problem becomes much easier if you're able to reduce that volume of data somewhat.

Quite the contrary, and that's my point. The errors here aren't just "let's try again" failures. They're unrecoverable, final, data-is-gone-forever errors, and the chances of encountering one are very high with so much data. Resilvering such a large array is practically impossible (as described in the article I linked to). Without resilvering and having blocks spread among disks, losing one disk means you've lost a little bit of everything, so all your data is corrupt, rather than just the fraction that was

You didn't read what I said. Yes, ZFS+Snapshots, but you also need at least Sun Cluster replication and tape backup. ZFS + Snapshots doesn't save you from fires, floods, software bugs and ill-will. It does save you from idiots, and disk failure though.

Why not tape, backup RAID, SAN or some other dedicated backup hardware solution?24TB is well within the range that a professional solution would be required.Given a harddisk size of ~1TB, making a single backup to 24 disk isn't a backup; it's throwing data in a garbage can.More than likely atleast one of those disks will die before it's time.

Yup. spool to tape. get a SDLT600 tape cabinet and call it done. if you get a 52 tape robot cabinet you will have space to not only hold a complete backup but a second full backup in incrementals that will all run automatically. Plus it has the highest reliability.

And anyone whining about the cost. If your 24Tb of data is not worth that much then why are you bothering to back it up?

No kidding. For $2400, you get 24x TB HDs and a bookkeeping nightmare if you ever actually resort to the "backup." For $3k, you get a network-ready tape autoloader with 50-100TB capacity and easy access through any number of highly refined backup and recovery systems.

Now, if the USB requirement is because that's the only way to access the files you want to steal from an employer or government agency, then the time required to transfer across the USB will almost guarantee you get caught. Even over the weekend. You should come up with a different method for extracting the data.

However to have reasonable retrieve rate (going through 24 TB of data will rake some days over USB2), You better split the dataset in multiple smaller sets. That also has the advantage that if one disk chrashes (AND Consumer grade USB disk will chrash!) not your entire dataset is lost.

For that reason (diskfailure), do not use some linux spanning disk feature. File systems are lost when one of the disks they write on are lost. Unless you use a feature that can handle lost disks (Raid/ Zraid)

And last but not least: Test your backup. I have seen myself cheap USB interfaces failing to write the data to disk without a good error messages. All looks ok until you retreive the data and some files are corrupted.

USB 2.0 provides 480Mbps of (theoretical) bandwidth. So unless you go Gigabit all over your network (not unreasonable), you won't beat it with a NAS. Even then, it's only 1-and-a-bit times as fast as USB working flat-out (and the difference being if you have multiple USB busses, you can get multiple drives working at once). And USB 3.0 would beat it again. And 10Gb between the client and a server is an expensive network to deploy still.

You don't need a Gigabit connection everywhere, just on your computer and the NAS directly connected to your computer.

USB2 is not a very good option. For some reason, I've been getting poor performance from Linux with storage mounted via USB. Your best bet is eSata. If you can't install eSata, but have a Gigabit eithernet connection then go that route. USB2 is the connection of last resort when talking about backing up 24TB.

USB 2.0 provides 480Mbps of (theoretical) bandwidth. So unless you go Gigabit all over your network (not unreasonable), you won't beat it with a NAS. Even then, it's only 1-and-a-bit times as fast as USB working flat-out (and the difference being if you have multiple USB busses, you can get multiple drives working at once).

The 480Mbps is nowhere near what you will see in practise, unlike network speeds which are far closer to the rated maximum. Most USB drives I've seen top out at somewhere between 25 and 30MByte/sec, and if there are no other bottlenecks it isn't unusual to see 100Mbyte/sec from a gbit switched network. My main desktop pulls things from the fileserver at around 80Mbyte/sec, which is as fast as local reads tend to be on that array. So you are right about 100mbit networks: that'll be the bottleneck not USB, bu

What your attemting isn't easy, it's actually difficult.Buy a cheap and big refurbished workstation or rackmount server, install a few extra SATA controllers and maybe a new power supply, hook up 12 2TB drives, install Debian, check out LVM and your all set.

Messing around with 12 - 24 external HDDs and their power supplys is a big hassle and asking for trouble. Don't do it. Do seriously go through the possibilty of building your own NAS. You'll be thankfull in the end and it won't take much longer, it might even go faster and be cheaper if you can get the parts fast.

What your attemting isn't easy, it's actually difficult.Buy a cheap and big refurbished workstation or rackmount server, install a few extra SATA controllers and maybe a new power supply, hook up 12 2TB drives, install Debian, check out LVM and your all set.

Messing around with 12 - 24 external HDDs and their power supplys is a big hassle and asking for trouble. Don't do it. Do seriously go through the possibilty of building your own NAS. You'll be thankfull in the end and it won't take much longer, it might even go faster and be cheaper if you can get the parts fast.

Way to redefine the problem instead of working within the specifications.

Perhaps:1. The poster ALREADY has a NAS and wants to have airgapped or even offsite/offline backup.

2. External HDDs are fast, common, reasonably cheap, and do not have a single point of failure (e.g., the tape backup drive in many suggested alternatives)

I'm interested in this question. I use this general setup, but on a smaller scale. I cannot put a NAS in a safety deposit box. I cannot ensure that my "backup" NAS would not be drowned in a flood, burned in a fire, fried by a lightning strike...

Let's pretend the poster is not an idiot, and answer the actual question. If he has 24TB of data, IT'S ALREADY ON DAS/NAS. Geesh.

Let's pretend the poster is not an idiot, and answer the actual question. If he has 24TB of data, IT'S ALREADY ON DAS/NAS. Geesh.

Don't assume he was the one that created his current storage solution. It could be a turnkey solution that he purchased, like one of those movie storage devices we read about on slashdot earlier this year.

If he installed his current storage configuration himself then why did he need to ask this question on Slashdot? I don't see any particular bad answers, and no one is insulting

Because downloading 3.6Tb to restore from a backup for just one day is pretty ridiculous for someone on a home broadband?

Backup to external servers is ridiculous for anyone without university-sized access to the net. Hell, the school I work for try to back up 10Gb to a remote server each night and it often fails because it took too long (and we're only allowed to do that because we're a school - the limits for even business use on the same connection are about 100Gb a month).

USB is for a second working copy.Backups should also ensure durability of the copy, while USB HDD have a shorter lifespan than a normal HDD which in turn has shorter lifespan than tapes, the usual medium for durable backups.

Backup tapes were designed precisely for the problem you have. LTO-5 tapes are about 1.5TB, if I remember right. Stored correctly they shouldn't give any problems when you come to retrieve whatever is backed up. Most archiving efforts use backup tape, and they can't all be wrong:)

Actually handling all those tapes and recovering data from them is very expensive in manpower and time, and can be very awkward for recovering data. Those tapes, and tape drives, are also _expensive_. They're useful for sites that require secure off-site storage, or encrypted off-site storage, but for most environments today they are pointless. Easily detachable physical storage has become very inexpensive, far more economical, and is far less vulnerable to the vulnerabilities of mishandling SCSI connection

Actually, for a data set this large it will probably work out only very slightly more expensive - and the benefit to be gained is worth it IMHO (in speed if nothing else - USB disks are *slow* and eat a lot of CPU). I live in the UK so I'll work in GBP. I think US prices are likely to be cheaper but the relative sizes will be similar.

I'd figure around ~£1100 for drive and SAS interface plus £500-700 for 24TB worth of media. Throw in an extra 2TB drive to spool to before you write to tape as well

The slow disk is why you use rsync or other such efficient mirroring technologies. The tapes have a limited lifespan, they require significant maintenance, and have been prone to far too many mechanical failures and expensive downtime in my experence. The disks can actually be simultaneously connected for casual "read" access with a reasonable USB hub and possibly an additional USB card.

You've also left out the cost of recovery time for users. Swapping tapes to get recovery of arbitrary files is rather awkw

Whether tape or disk is appropriate really depends what you are intending to use the backup for and how important your data is. You might even choose to use a mixture of the two.

If it's your only backup, I would suggest that it's not wise to leave it permanently online in the way you suggest; that leaves you open to any number of potential issues which your backup is supposed to protect you from (OS bug, misconfiguration, lightning strike, power failure, overheating,...). Tape libraries have the same issue

I have just seen "PAR" a couple of times here on slashdot, haven't used it, but it seems great for this: http://en.wikipedia.org/wiki/Parchive [wikipedia.org] . You need enough redundancy to allow one USB drive to fail. And I would rather get a SATA bay and use "internal" drives than having to deal with external USB drives. Get "green" drives, they are slow but cheap.

A 24TB NAS is not very hard to assemble. Relatively cheap, and basically transfers data at Gb speed - assuming that you populate it with fast disks. Set one up with RAID and you're away. Personally, I would do it with a low end server and a big-ass RAID array. That way, you can really control its behaviour via the OS. Linux is ferpect for this kind of thing.

Get an old computer... anything will work really. You have to know someone that has one laying in their basement. Plug your drives into that. share the drives on your network. Use any general backup software and sequentially backup what you need to backup over the network. Now it will do it overnight and you really don't care how long it takes. It can even do it every night. If you want it safe from fire and such.... build a box out of 2x4s and Drywall scraps form homedepot. Make it 5 sheets thick and it'll

build a box out of 2x4s and Drywall scraps form homedepot. Make it 5 sheets thick and it'll withstand any housefire you could possibly have

I find that statement suspect. I am not saying you are wrong but extraordinary claims require extraordinary evidence.

I have seen some pretty nasty house fires, the kind were the fire department sprays water on the neighbors houses to keep them from catching rather than try to do anything about the one that is actually burning. With all the modern synthetic materials in furniture, carpeting, and other flooring a house fire can hit 600 degrees and stay that way for hours.

Sometimes the easiest way to duplicate (back up) data is to simply duplicate the hardware it's already on. If it's on a 16-disk (x 2TB) NAS system, build another one. If it's on tape, buy more tapes, if it's on random HDD's scattered all over the place, then you have bigger problems to deal with first (like building a NAS box)!

I do things like this all the time with a data set about half of that, ~ 12TB. You didnt say anything about what the data is but from the request and the fact you mentioned USB I would gather this is your typical warez hording mp3/flac, mkv, apps and also a personal picture and video collection of fam.

Here is a checklist i would execute similiar to mine. I find the most reliable way to keep your data over the years is by following a checklist or procedure and choosing when to move to the next storage pl

# rsync -avz/this/that. Split your directories corresponding to the sizes of your drives. If on Linux, run smartctl -H/dev/sdX to check your disk health and if possible, take the HDD's our of their usb enclosures and connect them directly to SATA for faster xfer speeds. These drives will 9/10 mount just like a normal drive since usually they are just a normal drive housed in an enclosure.

Plug all the disks into a USB hub. Ensure that each one has a unique volume name eg bak1, bak2...
The old skool way is to make a little tar script and use volume spanning.
Otherwise, configure all the disks as a single JBOD and run DejaDup.

populate it with 4GB drives and create two RAID5 (or one RAID6) array, then you've got 24 or 28 TB of backup space, without having to change drives or break up your backup into smaller chunks.

But really, your backup methodology is broken; you need to organize the data into manageable chunks because aside from a large dedicated backup server/SAN, there is no reliable (don't tell me tape is reliable) backup solution for a such a large quantity of data in a single chunk.

What I do for backups: in my 24-bay server I have eight large drives in a (HARDWARE) RAID5 array (were 4TB drives available at the time I'd have gone RAID6) and rsync the virtualized server contents to that, then archive them into tarballs, and send copies of them across the LAN to another server that is running (HARDWARE) RAID5 as well. Every once in a while I back up the critical data (source, scripts, financial data, production web sites,/etc, and so forth but not the program binaries nor system binaries which are easily recreated or reinstalled, respectively) to optical media and external hard drives.

So what I have in summary is:* Massive server with a backup array separate from the production array* Separate backup server running another array (again, using a quality HARDWARE RAID controller. Safeguard your data and don't bother with Intel, Adaptec, Promise, or Highpoint "hybrid" RAID)* Periodic backups of non-recreatable data to USB drives and optical media that are moved off site.

It's not mentioned by the Author, so I might be assuming too much but if he's trying to write to USB Drives as opposed to a RAID of some sort I figured he wanted to be able to read the drives individually, prehaps on a different machine without a network connection between them.

The drobo won't allow that, the file system is spread across all the drives.

I guess it kind of depends on what the author needs to do with the drives when he's finished writing to them.

We did this exact thing using WD Green drives for our 18Tb backup problem. Got two of 'em, planning on using their built-in rsync for onsite/off siting the data. Unfortunately, the units never broke 1MB/s transfer, and no amount of work with Drobo yielded faster performance reliably. Both of our units are now sitting unused, ($2500 each!), and we put the drives into a RAID-50 8 bay USB3 enclosure. The new unit runs about 150x faster, and ended up costing $400 (prices are for enclosures only, drives were additional).

Most disappointing was Drobo's support- they just seemed to shrug a lot, and were hyper-agressive about closing trouble tickets.

I have a Synology NAS and I'm very pleased with it. I don't have anywhere near the volume of data the OP has though. One thing with a NAS is that you'll be subject to the networks available bandwidth and, depending on your set up, this could make backing up lots of data pretty darn tedious. And might annoy admin (and other users). So while a decent portable raid might be the better option, it might be better to find one that just plugs in rather than use the network. Might find one that can be setup to use

The 200GB range drives in my main server have been trundling along for many years while I have a pile of 0.5-2TB hard drives I need to go through and get warrantied (three of them Caviar blacks). Not impressed with the big drives.

Are you REALLY sure that you want to use USB HDDs? The cost savings of using a box of HDDs may well be offset by the hassle in finding the backup software, the manual labor of swapping them, finding the correct drive to retrieve a certain file, etc.

How about a pair of Synology DS1512+ NASes? In addition to getting all of the storage online at all times, you get RAID support, etc.

No reason why they can't all be attached at once. with 3TB disks, and 8 USB3 ports, you'ld only need to plug them all in to do the backup then remove them all to take them offsite when the backup is done.

A few portable NAS's holding 4 disks each might be a better option, but don't exclude USB for its simplicity.

Yes. The above tar command is really from a time when cp did not have r and p options (and still likely doesn't on some systems so it's worth knowing). OTOH, you can add in the z option (compress) if you're doing something networky (though you'll probably want to throw in netcat or ssh too in that case). Of course, if you're doing that, rsync is probably the better option if available and leads to some interesting backup options going forward.

I mean, if i copied 200 gig across 3 drives in a jbod raid, could i plug just one drive in to access the information on another machine? Suppose my laptop only has 2 usb ports and i do not have a hub plus i'm running a different OS, does this mean i can't look for information on the set?

I have never used JBOD for raid, I have however used regular mirrored and stripped raids with and without fault tolerance (raid 5 and 10 or a mirrored stripe for instance) and know this can b

Seems like a very bad idea to me. You'll have trouble creating a JBOD device without connecting all the drives simultaneously. Also, you're basically increasing the chance that the entire JBOD volume will be broken as the number of drives goes up. If you've got one drive failing, you'll be lucky to get any data back at all.

To my mind, Bacula would be a good choice as you can set up virtual tapes that will correspond to the drives and you can set the backup to wait for the operator to swap over the drive a