Posted
by
CmdrTaco
on Wednesday March 07, 2007 @11:00AM
from the hey-look-it's-chris dept.

eldavojohn writes "Google is transferring data the old fashioned way — by mailing hard drive arrays around to collect information and then sending copies to other institutions. All in the name of science & education. From the article, 'The program is currently informal and not open to the general public. Google either approaches bodies that it knows has large data sets or is contacted by scientists themselves. One of the largest data sets copied and distributed was data from the Hubble telescope — 120 terabytes of data. One terabyte is equivalent to 1,000 gigabytes. Mr. DiBona said he hoped that Google could one day make the data available to the public.'"

The 'morons' are the IEEE and their standards are recognised by the ISO. What you consider 'established usage' was not established enough to be accepted by the hard drive manufacturers, which was the primary place where the prefixes were in use. We're not going to go back to the inconsistent, ambiguous de facto standards just because the new ones annoy you. Do you also reject the redefinition of the foot to a standard length?

Yea, yea, yea. And you also believe a hacker isn't someone who maliciously breaks into computer systems, it's just a curious innocent person right... crackers are the criminals! Give it up. The general public is never going to adopt "Tebibyte" into the language because terabyte sounds much more fucking cool.

WHO CARES?!? I have worked with mathematicians that did not squabble over these terms so why the hell are we?!? My mother who can hardly turn a computer on knows damn well that 1000 megabytes is roughly 1 gigabyte. Now lets get back to the topic. It seems Google would have some brilliant way to push a terabyte through the "tubes" instead of just mailing drives, how archaic.

That's not the problem, the problem is, when you buy a X GB drive, you don't know what you're getting until you find the fine print. Some manufacturers provide different sizes of the same labeled drive, differing only in whether it's "1 GB = 1,000,000 KB" or "1 GB = 1,000,000,000 B"

So if you buy a set for RAID one day, the next day they may no longer stock the drive you need and your vital information is put at unnecessary risk because... what, because the hard drive manufacturers can't decide whether they want to screw you out of 7% (using 1 GB = 1 billion bytes) or 5% (using 1 GB = 1 million kilobytes, which they curiously agree on equaling 1024 billion bytes. What a coincidence that KB is 2^10, but GB is 10^9?)

Think about that for a moment before you lambast the argument for proper labeling of drives.

I'm old and interested enough to know what REALLY happened through the history:First, as taught in any school book and computer manual through history (see Apple, Amiga, Microsoft, Commodore): 1024 bytes = 1Kilobyte, 1024 Kilobyte = 1 Megabyte etc. because the computer could only calculate in exponents of 2 (1 and 0) and 20MB (20480 kilobyte) was about the largest size hard drive you could get.

A Kilobyte is 1024 (2^10) bytes. A Megabyte is 1024 Kilobytes or 1,048,576 bytes (2^20) and a Gigabyte is 1024 Mega

The annoying part for me today is that flash memory is in powers of two (64 MB, 128 MB, 256 MB, 512 MB, etc.), be it for cameras or in USB thumbdrives, yet the units are metric, not binary (stating 1 MB == 1 million bytes on the packaging).

When I see a power of 2 next to the units, I expect the units to be in a power of 2 too.

This is absolutely the most cost effective way of transferring large amounts of data like this. If you do the calculations on terrabyte size files, sneakernet (of FedEx net) is actually faster and less expensive. We also went to one of Jim Grey's seminars when he was here giving an Organick Memorial Lecture and he made an incredibly compelling demonstration using a variety of data types. We ended up talking with him for some time after about new projects we are engaging in that will also be generating terrabytes of data and his suggestion was to pass applications rather than data which was interesting.

This is becoming more and more the norm in scientific research and Google's work is quite welcome.

FedEx delivered what appeared to be a ton of broken office chairs to Google headquarters this morning. When asked for the sender's ID, the severely beaten FedEx courier would only reply that the sender wished to remain anonymous.

Here's what happened when I FedExed my RMA to Newegg, packed very carefully. Note the bent motherboard - I didn't even know you could do that. The good news is that FedEx paid part of my claim... they paid $100 plus the $8.33 that the FedEx store charged me to fax in the claim forms. The bad news is that they did not refund my original shipping or pay more than $100 on the over $280 of damage that they did. It also took about 4 hours of phone calls to even convince FedEx that I was not the seller, and

As a former UPS employee, (I worked as a package handler, the guy that beats the shit out of your boxes as he loads them on the truck) I will never ship anything of value without paying extra for the insurance. when you do that, a couple of things happen:

the item goes into a big bag (by itself, not mixed with other items) with red/white stripes, so employess know not to mess with it)

it gets hand-carted to the destination truck, and is the last thing to be loaded, and first unloaded

only seasoned workers ever touch your package, and generally care about the state that it's in

Or maybe you are just lucky. I don't ship that often, and FedEx has to date managed only once or twice to get a package through without undue delay or damage. As to the $100 they paid this time, I had never before that had any of their insurance honored. I wasn't about to pay for it again on the off chance that it'd work out for once. You can hardly hold the end result against my decision, obviously made without knowledge of the outcome. Besides, insurance is meant to cover damage due to normal mishand

Besides, insurance is meant to cover damage due to normal mishandling, such as dropping a box by mistake, not the kind of (at least nearly) intentional damage that must have been involved in my case. Or maybe you have a theory of how my box got squashed that badly in the normal course of FedEx's business.

I still don't know where you get that idea. Insurance is meant to handle any kind of damage, including being completely destroyed in plane crashes, car accidents, train derailments, theft, loss, and anythi

What I mean is that being run over by a truck is not within the realm of what a person buying FedEx insurance contemplates. When you ship something, you can assume a certain level of negligence is possible - such as dropping your package from a height of 4 feet or setting a package that leaks liquids on top of it. You don't normally think that FedEx will be so careless as to run your package over with a truck.

I use the word "intentional" because it wouldn't surprise me if the kind people at my local Fe

I'm with you, although I have seen FedEx and UPS both damage a lot of packages. I think that their automated systems are a lot rougher on packages than AirBorne Express / DHL or the USPS's Parcel Post. But if you don't insure it, you're accepting that risk when you give them the goods.A while back I bought a radio-controlled airplane, pre-assembled. It came in a big box, most of which contained the wing. So it was fairly fragile, but well packed, in tri-wall. Got it sent UPS, with insurance for the full val

the insurance remedy was to return it to the origination address and ask to see an original purchase receipt to award the insurance claim

Sorry to nitpick, but this scam has been around for ages - you broke something, oh no! I'll send it to myself and pretend UPS did it. Hell, I even saw it in Seinfeld. Not that you were doing this, but what you tried is pretty suspicious to an outside observer.

They need SOME proof of value or even that the box was actually full to fight this type of fraud, and the

Customers in general ought not to be held to know FedEx's corporate structure. I did indeed use the Newegg-provided label. As to my prior shipment broke by UPS, of course I realize that there is the potential for scams. I was shipping Christmas presents to myself because it was cheaper and, on average, safer than trying to check them on my return flight. See my other replies in this thread for more on the FedEx $100 insurance situation.

Customers in general ought not to be held to know FedEx's corporate structure.

I don't know if, in this age, this is wise. With so many corporations buying up major parts of our lives like food, communications, salaries, and transportation, I would challenge you to take a look at the structure of the different entities that affect you daily. The unfortunate fact is that every decision you make needs to be researched to find the most appropriate course of action based on who is behind the marketing. Su

I would agree with you, except that I don't think that the average consumer should be held to that level of sophistication. This is mostly a cheaper cost avoider issue, for me. Who can more efficiently discover the relevant information? Clearly, the answer here is FedEx.

I remember an article I read on this I think back in the year 2000. The was a research scientist who built a standardized platform (That is to say, a specific PC case with a certain number of hard drive bays, and certain network cards) so that he could exchange data with other universities. They would fill up the data on the networked PC, and they could ship it to any of the participating projects, knowing that they'd get back the same hardware in return.

Yeah, there have been a number of folks using variations on this theme for a while now. It's been interesting that network performance really has not followed the same performance curve as storage and CPU throughput. Add to that the growing amount of data being pushed through "consumer" pipes from people obtaining broadband and pushing sources such as YouTube and company and you have the makings for a bandwidth crunch. This of course is the reason for separate academic and government Internet paths, but

In fact, at some universities engaging in data intensive projects, it is not uncommon for them to occupy the entire bandwidth of the university in off hours to transfer data around the country to various collaborators.

Even using the full bandwidth between Internet2 connected Unis, it would still take 2~3+ days to transfer 250Tb of data.

10Gb/s is close to the max you can do with one frequency. That will all change once they start pumping multiple colors down their fiber. Their bandwidth will explode & Go

Internet bandwidth hasn't kept up, but local bandwidth definitely has. My network throughput is more than capable of transmitting data faster than my hard drives are able to write it. And I wouldn't even agree about the net bandwidth. I have a 15mb connection where I used to have a 56k.

We have been sending two DVDs, with about 6-8 GB data, around every month for updates. Now we are trying rsync, which in our view has been more convenient.

The article and the GP is about sending large amounts of data, as in terabytes. In this discussion, 8 GB is tiny, and is easily downloaded much faster than even express mail. Besides, rsync won't really help if all your data is unique (such as astronomical data). Rsync really helps when very little of your data set changes between updates, such as ba

The page you linked to had a smart idea. Rather than just have the raw disks, create some sort of architecture inside to allow for rapid transmission of the data from the vehicle upon arrival. I could see specialized vehicles that have been hardened against an accident with an inverter to power the drives that have external fiber optic ports hooked up to massive, high speed RAID arrays to rapidly dump the contents to another system at the location and upload content for the next destination.

As always the people of the world own the data. The copyright holders are, however, given a short term monopoly on making copies of it, with certain exceptions.

I hope Google isnt going to say they do like they want to with the old books theyre scanning.

Google has not, as far as I know, claimed "ownership" or even copyright on anything they've scanned. They have, however, created their own database of metadata about the works, which they use to enable people to more easily find specific items in the original data.

Everytime you download a hubble picture will it have a google watermark?

Umm, maybe. Why do I care if they add watermarks to it? If they are in the way

The ownership of data is presumably a case-by-case thing that depends on what the data is and how it was acquired.For example, Google does not own the copyright on out-of-copyright books that it scans in (nobody does, by definition.) At best, it might own the copyright on the scan that it did, but that's really unlikely--copyright protects creative expression and a straight scan doesn't add any.

However, they probably have some rights under unfair competition law because they have gone through a lot of work

So, if Google takes the raw data and does that color assignment itself, well, the result is theirs.

I'm not so sure that the result in theirs, necessarily. They'd need to properly attribute it. Many science archives have rules about how to properly attribute their work.

Don't get me wrong -- many of the scientists want people to use their data (eg, see The Astronomer's Data Manifesto [ivoa.net]), but they also want to know who's using it, because it's how they justify the value of their projects, and the costs incurr

Attribution is different from copyright. For example, say you have a novel scientific idea which you write about in some scientific journal and that I read your article and publish my own article, using your idea without attribution.Now, what I've done would reasonably upset you, but there is no law (at least in the US) that requires me to attribute your ideas to you. In fact, under those facts, I completely own the copyright in my article and you have no legal remedy. Now, there may be repercussions--I

I really don't like the idea of a "private" (yes i know its publically traded) company having control of this public information.

You do know many government agencies already outsource IT and other projects to "private" companies who have all this government generated information, right?

The data was paid for by tax payers. Google will inevitably make money from this otherwise they wouldn't be doing it.

Yeah, and right now Microsoft makes money off of selling them the OS and office suite. This isn't a question of if the government will be paying for the ability of their employees to do word processing, it is just a matter of how much and which companies will be getting the money. I don't trust Google any less than I do MS, who currently supplies

So, what you're saying is that this public data shouldn't be copied? It's not like they're taking all of the data and destroying the originals.

There's destroying and then there's locking away. There are people pushing for laws that say one person's copy of a public domain work is copyrighted by that person for the typical term and that no one else may make a copy from that copy without permission. It's specifically about granting broadcasters copyright over their rebroadcast of a public domain work, but i

Don't say I didn't warn you guys about this "don't be evil thing." First they start swapping TB for "academic" purposes, then maybe some avian influenza in some apartments around Mountain View, and next thing you know, they'll be a smallpox outbreak and we will coincidentally receive advertisements on gmail that we can buy the cure for a few thousand dollars from one of their Adsense "partners."

we use binary units because formatted capacity is measured in binary units.

It seems you haven't read my previous post I was linking to. Please do:)Your affirmation is wrong. The correct affirmation would be "we use binary units because some OSes reports formatted capacity in binary units".

Proof I've read your post in its entirety is that I was going to write "MS Windows" (like I did in the aforementionned post) instead of "some OSes":) . My server at home is a FreeBSD, I launched fdisk and it reports size in "Meg", neither MB nor MiB. So I can't say:) What command did you ente

Just because they want to help and release lots of open source software doesn't mean they have to release the family jewels.

If the average Slashdotter applied the same flawed logic to Microsoft, you'd have to say they're big open source sponsors too. After all, Microsoft has released GB of free source code for utilities, etc. for decades. Sure, the code mostly only works with their proprietary "family jewels" (the OS and development tools), but why quibble?

I've been thinking that the only home use app lots of HD storage space would be A/V. Now, I guess when 10 PB of HD are $100-1120, then we'll be able to get copies of these 120 TB of hubble data or TBs of other datasets to fill up those future home PB HDs. One day we'll need home exabyte HD to store and play around with public PB datasets.I can only hope that bandwidth can keep up. How long would it take to transfer a 120 TB bit torrent file over either cable or dsl?

I understand the whole "HDD w/ a common filesystem = more compatibility" thing, but wouldn't it be easier to simply send along some tapes of a type appropriate to the format/type that the scientific institution uses? LTO-3 can do 800GB compressed, SDLT can do up to 600... and neither is susceptible to data loss when it gets bounced too hard by FedEx/UPS/DHL/Whatever. (plus it would make for a lighter package, wouldn't require some poor IT schmuck to disassemble a server or wait forver for USB to transfer all of it, etc...)

wouldn't it be easier to simply send along some tapes of a type appropriate to the format/type that the scientific institution uses?

There are basically two reasons one would choose to use HDDs over tapes: compatibility and price.

Compatibility: Sure, one scientific institution may have standardized on a specific type of tape, but what about all the rest? Pretty much everyone in the world can read a standard HDD formated with a well known filesystem.

The reason for not using tapes is exactly because of the compression. The time it takes to compress that data and then send the data to the tape takes a lot of time. That same process would have to be repeated on the other end.

Besides, using HDD for transfer means immediate access to the same data on the other end with speeds that are unmatched with tape backup systems. It might also be worthy to note that data sets that large usually are stored on large RAID systems like this one from LSI Logic, http [lsilogic.com]

The "TeraScale SneakerNet" paper posted earlier [arxiv.org] anticipates and answers that. They ship a fully assembled computer with processor, RAM, OS and network interface. Plug it in to the wall, plug it in to the network and assuming you had previously agreed on a networking protocol, you're rolling as soon as it boots! No restoration, no decompressing, immediate access to the data.

Does anyone have a Linux distro for this specific purpose? Preferably tiny enough to fit onto a USB key and optimized for bandwidth, p

Why is a Kilobyte 1024 bytes, if "Kilo" means 1000, both according to the SI and the greeks (Kilo is derived from khilioi). If 1 kg = 1000g, 1 kV = 1000V, 1 km = 1000m, why should hard disks break the pattern?

When we're talking about addressable computer memory, approximating the kilobyte to 1024 is a convenience, but since Terabyte gives such a huge error, and makes absolutely no sense for data transfer or disk sizes, it's really time we stopped this illogical naming convention just because some engineers found a term convenient 40 years ago.

But I have long since buried my problem with using the SI prefix with byte to mean a power of 2, actually not sure i ever had one, I just accepted it. I am happy with the 1024b=1Kb, 1024Kb=1Gb and 1024Gb=1Tb. The usable space is lower in the case of non-volatile storage anyway, 1Tb never means 1024Gb might be closer to 1000Gb (i don't know).

Because only real nerds have a problem with 1KB being 1024 bytes rather than 1000 bytes, and kibibytes or whatever you want to call them is a really stupid name. Who wants to have to deal with buying 1.073741 gigabyte DIMMs for their PC when we can just agree instead that a gigabyte is a power of two, not a power of ten?As for why it's different for disks to RAM, disk manufacturers discovered a long time ago that they could make more money by using SI rather than binary measures for disk size, because it ar

When we're talking about addressable computer memory, approximating the kilobyte to 1024 is a convenience, but since Terabyte gives such a huge error, and makes absolutely no sense for data transfer or disk sizes, it's really time we stopped this illogical naming convention just because some engineers found a term convenient 40 years ago.Yes, it's so funny when all these guys just keep arguing why 1024bytes should really be 1000bytes because they don't want to care that it's history, it's practical, it wor

There are more uses than just sending data. I'm using removable hard drive trays instead of dual-booting my machine. Swap the tray, reboot, I'm running Ubuntu. Repeat and its XP. I only keep that one as it came free with the PC, boot it up now and then to keep it updated. It makes life easy when you know that you can't possibly fsck up your regular installation when playing with a new distribution or whatever. Never needed to send one to anyone else, but that might be a huge support possibility for family?