Time domain is newLsstars (?) - PetabytesLSST – total dataset in the 100 Petabytes

scientific data doubles every year – successive generations of inexpensive sensors (new CMOS) and exponentially faster computing – this changes the nature of scientific computing across all areas of science- but it’s harder to extract knowledge

challenges:access – move analysis to the datadiscovery – typically discovery is done at the edges so more doesn’t give us much more… but opening up other orthogonal dimensions gives us more discoveries. federation still requires data movementanalysis – only max NlogN algorithms possible,

data analysis-on all scales across multiple locationsassumption has been that there’s one optimal solution, that we just need a large enough data set… with unlimited computing powerbut this isn’t true any more, randomized incremental algorithms

analysis of scientific data- working with Jim Gray – coping with the data explosion starting in astro with the SDSS… first data release from Sloan was 100GB, now the dataset is about 100TB- interactions with every step of the scientific process

Jim Gray:- scientific computing revolving around data – take analysis to data, need scale-out solution for analysis- scientists give the database designer top 20 questions in English, and the database designer or computer scientist can design the database accordingly- build a little something that works today, and then build bigger /scale and build something working tomorrow – go from working to working – build what the world looks like today, not for tomorrow

Public use of the skyserver- prototype in data publishing- 500 Million web hits in 6 years, 1M distinct users but only 15k astromers in the world- 50,000 lectures to high schools- delivered >100B rows of datainteractive workbench- sign up, own database, run query, pipe to own database… analysis tools to transfer only plot and not entire database over the wires- 2,400 power users

GalaxyZoobuilt on SkyServer, 27M visual galaxy classification by the publicDutch school teacher discovery

Virtual Observatorycollaboration of 20 groups15 countries – international virtual observatory allianceinterfaces were different, no central registry, underlying basic data formats were agreed uponsociological barriers are much more difficult than technical challengesTechnology- petabytes- save, move, some processing near the instrument (for example) in Chile- funding organizations have to understand the computing costs over time- open ended modular system- need Journal for Data (overlay to bridge the gap so that data sets don’t get lost) – curation is key, who does the long-term curation of data

Pan-STARRS- detect killer asteroids- >1Petabyte/year- 80TB SQL Server database built at JHU – largest astro db in the world

Life Under Your Feet (http://lifeunderyourfeet.org/en/default.asp)- role of soil in global change- a few hundred wireless computers with 10 sensors each , long term continuous data, complex database of sensor data, built from the sky server

once the project is online, linear growth, exponential growth comes from new technologies new cameras … future might come from individual amateur astronomers using 20MB cameras on their telescopes and systematically gathering data

more growth coming from simulations (software is also an instrument)(example of one that was so big the tape robot was inaccessible so the data is never used)

also need interactive, immersive usages (like for turbulence)- store every time slice in the database- turbulence.pha.jhu.edu (try it today!)

Amdahl’s Laws for a balanced system – we’ve gone farther and farther from these

Comparisons of simulations and their data generation vs. what’s available in the computersdata analysis is maxing out the hardware because of the 10-100 TB – no one can really do anything with over 50 TBIO limitations for analysiswe’re a factor of 500 off of what we can do to get to a 200TFlop Amdahl machine

they built a high io system using cheap components

the large datasets are here, the solutions are not – systems are choking on IOscientists are cheapdata collection is separated from data analysis(big experiments just collect data and store it, scientists come along later and analyze the data- decoupled)

How do users interact with petabytes- can’t wait 2wks to do a sql query on a petabyte- python crawlers- partition queries- MapReduce/Hadoop – but can’t do (or are very difficult) complex joins you need to do for data analysis

William Gibson – “The future is already here. It’s just not very evenly distributed”

data cloud vs. HPC or HTCjournal for data?- with ApJ?- example postdoc writes a paper, only a table goes into the suppl data, journal can't take the terabytes real data - need another archive for this that's linked to science article.

This is my blog on library and information science. I'm into Sci/Tech libraries, special libraries, personal information management, sci/tech scholarly comms.... My name is Christina Pikas and I'm a librarian in a physics, astronomy, math, computer science, and engineering library. I'm also a doctoral student at Maryland. Any opinions expressed here are strictly my own and do not necessarily reflect those of my employer or CLIS. You may reach me via e-mail at cpikas {at} gmail {dot} com.