Posted
by
kdawson
on Tuesday October 13, 2009 @04:08AM
from the peta-here-a-peta-there dept.

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."

Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night.

Err no it doesn't, and no they aren't. The telescope hasn't been built yet? First light isn't scheduled until late in 2015.

thats absolutely not true. the process is vastly different when it comes to working with 100 MB or 10 petabytes. lets take databases for instance. if you have 100MB of data, you can just store the entire database on one server. when it comes to 100 PB of data, its even difficult to find the hardware capable of storing that much data. you need to start looking at distributed systems and distributed systems is such a broad field in itself.

when i graduated in 2005, a lot of the techniques i was taught worked great for working with database systems that handled a few hundred thousand rows. then i got a job at an internet company that had tables with over 80 million rows. all that normalization stuff i learned in school had to be thrown out. times may have changed now, but when i was in school, not only did i not learn how to handle "internet scale" data sets, i was taught the wrong methods to handle large data sets.

undergrad college students should at least get a basic intro to large data sets, if not have a class completely dedicated to learning on how to work with those data sets. school is supposed to prepare you for the work force. at least give the students the option to take a class that covers those topics if they want to go into those industries. i sure wish i had that option

Really really big numbers can be hard for the human brain to get a grip on. But more to the point, operating at large scales presents problems unique to the scale. Think of baking cookies. Doing this in your kitchen is a familiar thing to most people. But the kitchen method doesn't translate well to an industrial scale. Keebler doesn't use a million gallon bowl and cranes with giant beaters on the end. They don't have ovens the size of a cruise ships. Just because you can make awesome cookies in your kitchen doesn't qualify you one bit to work for Keebler.
Whether it's cookies or scientific inquiry it's a good idea to prepare students to process things on the appropriate scale.

One example: I deal with healthcare claims. We keep everything normalized on insertion, but we also create some redundant, denormalized tables (data warehousing). Almost every query needs the same basic claim information, but I'm doing it in a query with one or two joins instead of 10.

If something goes south with my manipulated tables, or I need a strange field, I still have my source data in a pure form. For a standard query, though, I can operate an order of magnitude faster by adding redundant tables that only have to be written once on insert by a trigger.