Copy of 35Min SciDB at NERSC, Analyze and Share Terabytes of Data

Comments (0)

Transcript of Copy of 35Min SciDB at NERSC, Analyze and Share Terabytes of Data

SciDB Testbed at NERSC Powerful Back-end to an Online Web ServiceEasy-to-use, fast, interactive analytic frameworkWant to accelerate your Discover Process?To Try out SciDB at NERSCEmail yyao@lbl.govAny Science Project Welcome- Share/Analyze Terabytes of Science DataScientific Discovery Through Data- an Iterative ProcessDecision ParalysisArray Like Science Data- More Common than you thinkSciDB for Array Data- Easy and PowerfulCase Study: OpenMSIOnline Mass Spectrometry AnalysisOld Style Science GatewaySearch Catalog. Download File. Do-your-own AnalysisClimate Simulation Output-Terabytes of Output Per RunBrain MRI ImageMany of people -> 4th DimensionXYZPersonGene Labeling- Large Sparse ArrayGene (Billions)Feature (Thousands to Millions)ScienceGatewayInteractive AnalysisSciDB Testbed @ NERSCPartner up with Science Teams10+ Science ProjectsComplicated AlgorithmsMultiple Science Domains:Astronomy, Climate, Bio-imaging, GenomicSmart New Science Gateway with SciDB as BackendAllow Complicated Queries that Aggregate TB of data, and return an AnswerSciDB allows to search through 100s of GB of RAW data and find images features inside itXYM/ZSpectrum Taken from Sloan Digital Sky Survey+==+Spectra for 0.5 Billion Objects=+Student InternInfrastructure/middle-ware is very important for efficiencyTB data, no MPI, don't worry about Parallel IO. The Ideal Analysis FrameworkWrite your own data analysis code for an HPC system?Understand the Parallel Architecture Learn about MPIIO or some file formatWorry about parallel programmingWhat the #?X is an OST?Why are my files corrupted?Why isn't my sort working with 1000 cores?Pre-processing/LoadingQuery: get all interesting data, aggregate over some dimension, then do a K-means clusteringOLDNEWYushu Yao/LBNL-NERSCCatalog of Billions of StarsDeclinationRight Ascension-90 Deg90 Deg0 Deg360 DegBig Data needs advanced mathematics: -Statistics / Machine learning / Mining at ScaleBig Data Tasks in SciDB@NERSC ProjectsSciDB Case Studies and Performance Comparisons Match Supernova Observation with SimulationSimulation of Supernova Explosion produces (many thousands to millions) spectraWeb user can search/plot like any other gatewayNEW: Web user can upload a spectrum, and Odetta will find the most "similar" spectrum in the database:Very Compute/IO intensiveFor 55K spectra, SciDB returns result in 10sec. (comparing to 20min in PosgreSQL+Python)For 1Million Spectra, SciDB return result in 2minMATCHGiven 2 sets of observed objects, return the objects observed in both sets (~300GB for 1billion stars)Spatial query not efficient in SQLCross Matching Catalogs of Stars in the SkyIn SciDB, you can lay out the stars in a 2D table, and overlay the them. In parallel. 50 times faster to match 1 billion stars in SciDB (5min) than PostgreSQL (5hr)Metagenome Analysis WorkflowAggregate E.g.count(score>10)BiclusteringSparse array, each cell contain some properties (e.g. a score). 3.5 Billion non-empty cells (0.5TB)When to Use SciDB on Your Data10+GB Data (and will grow)Looks like an array (dense or sparse)Write Once (or accumulate slowly) and read a lotLots of Filtering and Aggregating Want to do Joins like SQLDo most of calculation inside the DatabaseLinear Algebra on your Data20 Jesup Nodes (8 Core, 24GB Memory each)Too little memory!!!StorageCommodity 512GB SSD (OZC Vertex 4)NGF PROJECTCarver IB NetworkNext Steps for the NERSC SciDB Test BedGet Broader Audience:Automate creation/resizing of SciDB ClustersUser-controlled SciDB Instances (start/stop their own SciDB cluster)NGF-backed storageSame Hardware (Almost)Still the same:Kick-off a new project by holding their hand toLoad first batch of dataDo the first round of analysis When not so use SciDBSciDB is for Analysis, NOT TransactionsFor crunching through large data and return 1 small result, not return millions of small results at high throughput