Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

purkinje writes "Astronomy is getting a major data-gathering boost from computer science, as new tools like real-time telescopic observations and digital sky surveys provide astronomers with an unprecedented amount of information — the Large Synoptic Survey Telescope, for instance, generates 30 terabytes of data each night. Using informatics and other data-crunching approaches, astronomers — with the help of computer science — may be able to get at some of the biggest, as-yet-unanswerable cosmological questions."

Actually, the speed at which science funding migrates from one flavour-of-the-month to the next clearly exceeds the speed of light. If we could turn that speed violation into workable time travel we could start processing the data mountain (astronomy data is not alone here) about 3000 years ago so that it is complete by lunch-time Sunday.

My biggest issue would be if there is too much information. What if the scientists are using the wrong search queries and missing something important? Or maybe something important is just buried on page 931 of a 2,000 page data report.
Still, it's better than the opposite problem, of just not having the data to search.

There's no such thing as too much data in a case like this, assuming that they can store it all. Even if it's too much to parse now, it won't be in a few years. Get as much data as we can now, while there's funding for it.

Disk I/O and the ability to backup that data can be a bitch. Especially if the delta changes overlap within a 24-hour period. Of course, there are ways of addressing this problem with multiple servers, but that comes at a financial cost. Also, SAN and DAS technology still lags behind in I/O compared to the explosive growth in storage capacity.

Personally, I have clients that deal with 30+ TB worth of science data. Data retention is a major headache for me because as of four years ago, they only needed 2TB of

What if the scientists are using the wrong search queries and missing something important? Or maybe something important is just buried on page 931 of a 2,000 page data report?

Which is pretty much the same problem astronomy has had since roughly forever... Looking in the wrong place. Looking at the wrong time. Looking in the wrong wavelength. Look for the wrong search terms. Looking on the wrong page... It's all pretty much the same.The sky and the data will be there tomorrow and they'll try ag

Many sciences are experiencing this trend. A branch of biochemistry known as metabolomics is a growing field right now (in which I happen to be participating). Using tools like liquid chromatography coupled to mass spectrometry we can get hundreds of megabytes of data per hour. Even worse is the fact that a large percentage of that data is explicitly relevant to a metabolomic profile. The only practical way of analyzing all of this information is through computational analysis, either through statistical techniques used to condense and compare the data, or though searches on painstakingly generated metabolomic libraries.

That is just my corner of the world, but I imagine that many of the low hanging fruits of scientific endeavor have already been picked, going forward, I believe that the largest innovations will come from the people willing to tackle data sets that a generation ago would be seen as insurmountable.

Yes, the piracy sciences have been particularly hard hit. Modern piracy enegineering can easily generate the equivalent of 10 blu-rays, or 500 gigabytes, per day. Modern data reduction tools such as x264 have been developed to deal with this data overload, and can frequently reduce a 50GB bluray by more than 10:1 down to 8GB or less without a significant loss of information in the processed data.

Hm, small world--I'm also in metabolomics (more on the computational end than the biological side of things, what I like to call computational metabolomics). I was going to write a post similar to your own, but more generalized for those who aren't familiar with the biology behind it.
The issue now is that well established informatics/statistical/computer science approaches are used as general tools in biology/astronomy/biochemistry, and there is a great need to formulate novel algorithms to take advantage

I download Linux distro torrents faster than "hundreds of megabytes per hour".At that speed, a full day's worth of data is only a few GB, or roughly 10,000 less than discussed in TFA.Still, analysing even a few GB of data a day is no task for mere men.

"Still, analysing even a few GB of data a day is no task for mere men."

Unless it's a word document or power point presentation in which someone has embedded an uncompressed video or bunch of uncompressed images. Then you can get through it in about 5 minutes flat, not counting the half hour it takes Word/Powerpoint to load.

No, in all seriousness though, it really depends what the data is. That's why I'm not keen on this arbitrary "many gigabytes of data" metric which articles like this are supposed to wow u

Annnnd... we have a winner. GalaxyZoo uses tens of thousands of underutilized, superfluous, non-specialized 'carbon units' for pattern recognition, which they're really really really good at, that is, 800mS after looking at an image -> elliptical, spiral, irregular... "Hmmm, hey, that's funny... wait... WTF --- let's post this to the forum, where hundreds of other random carbon units will weigh it, and a For Really Astronomer(TM) will be checking it out inside 24 hours if it creates enough buzz..." see Hanny's Voorwerp [wikipedia.org] for the quintessential example.

Software that could 'be surprised' would be nice, but it's a long, long way off.

I'm not an expert in Astronomy, but in general, I don't think you can collect too much data, as long as its stored in an at least somewhat intelligible format. This way, even if professional astronomers miss something today, amateurs and/or future astronomers will have tons of data to pick apart and scavenge tomorrow.

Plus, more data should make it easier to test hypotheses with more certainty. Hopefully, the data will be made publicly available after the gatherers have had a shot or two at it.

30TB per day works out to about 10 petabytes per year. If you compare this to the total amount of data produced in a year (from all human sources), around a zetabyte, it's not that huge. In fact, IIRC, the yearly transfer rate of the internet is around 250 exabytes. The people with the really hard job of data processing are internet search engines. Not only do they have to through several orders of magnitude more data, they have to do it faster, and with much less clearly defined queries.

Sounds like another task for IBM's Watson.
The way I understand the problem, most scientists must be in cohorts with skilled CS folk to generate the types of answers from such large datasets, or they must be half cs folk themselves in order to traverse such scales of data. Quite an undertaking when professionals should be focused in one area. Let alone conveying the ideas of either field to the other how they themselves see/understand it.
However the dawn of asking Watson or Enterprise to figure something

We just need network and disk drive sizes to keep doubling at the rate they have, and we'll be laughing about how we thought 30TB/night was going to be a problem.

SDO finally launched last year with a date rate of over 1TB/day... and all through planning, people were complaining about the data rates... it's a lot, but it's not insurmountable as it might've been 8 years ago, when we were looking at 80 to 120GB disks.

Although, it'd be nice if monitor resolutions had kept growing... if anything, they've gotten worse the last couple of years.

(Disclaimer : I work in science informatics; I've run into Kirk Bourne at a lot of meetings, and we used to work in the same building, but we we deal with different science disciplines)

In fact, they just started blasting the site. I actually live next door to the LSST's architect, which is pretty cool.

Astronomers generate a tremendous amount of data, bested only by particle physicists. Storing it all is a challenge, to put it mildly. Backup is basically impossible.The real problem is that the data lines that go from the summit to the outside world are still not fast. The summits here are pretty remote and even when you get to a major road, it's still in farm country. And then getting it out of the country is tough--all of our network traffic to North America hits a major bottleneck in Panama, so if you're trying to mirror the database or access the one in Chile, it can be frustratingly slow.

As far as I understand it, the data will be available also to the general public. I assume that means they will need to have a global network of caches?

Possibly. It depends on how much the general public actually wants to download the data; if it is just selected images instead of the bulk (most of which will be boring "not much happening here" stuff) then serving it from a single site will be quite practical.

Astronomers generate a tremendous amount of data, bested only by particle physicists.

Earth scientists will merrily generate far more — they're purely limited by what they can store and process, since deploying more sensors is always possible — but they're mostly industrially funded, so physicists and astronomers pretend to not notice.

The first Pan-STARRS scope with its 1.3-gigapixel camera has been doing science for a little while now, and I think it might do something like 2.5TB a night. That's still a lot of disk (and keep in mind that they originally planned to have 4 of those scopes), but I think their pipeline reduces it all to coordinates for each bright thingy in the frame and then throws away the actual image (though I could be wrong).

i) telescope time is a scarce resource. If I need an image of a galaxy X I might have to wait years to get telescope time for it. If galaxy X has already been observed once and the data stored then I can do my new research (e.g. datamining) on the existing data. Nobody knows in
advance which data is going to be interesting to future researchers so triage is almost impossible.

ii) telescopes have finite lifetimes. Once the telescope/instrument ceases to

Ok, I know this doesn't solve the problem of actually ANALYZING the data but for storing and moving the data around, what's the best compression algorithm for astronomical (I mean the discipline, not the size!) data.

I used to work for a company that developed a really good compression algorithm using wavelets. At the time it was the only one to be accepted by A-list movie directors (the people with the real power in Hollywood); they refused to go with any of the JPEG or MPEG variants (this was before JPEG

Normally they don't. Compression algorithms, almost by definition, create artifacts that are difficult if not impossible to distinguish from potentially interesting data. So science imagery is almost always saved in 'raw' format, unless you have no other option like with your Gallileo example. Imagine applying a dead pixel detection to an astronomy image: 'poof!', all the stars magically disappear!

Not all compression algorithms are lossy, though the lossless ones aren't nearly as space-efficient.But some form of lossy compression might work too; it would be easy to filter the images so, for instance, any "nearly-black" pixel is set to black. Add some RLE and you have compression.The key to lossy compression is having a way to determine what type of data isn't as important and approximating that data.

Which is why I said "for instance". I don't know what the researchers are looking for, but I'm pretty sure the researchers themselves have a decent understanding what data they want. In contrast to what dargaud mentioned above, most researchers set out to find specific data to prove or disprove a theorem; they only a specific subset of all data collected. Very few researchers try and discover things in a random set of raw data.If all you want to know is the amount of stars in a specific picture, you can kee