Title: Data Explosion: Science with Terabytes
1Data Explosion Science with Terabytes
- Alex Szalay, JHUand Jim Gray, Microsoft Research
2Living in an Exponential World
- Astronomers have a few hundred TB now
- 1 pixel (byte) / sq arc second 4TB
- Multi-spectral, temporal, ? 1PB
- They mine it looking for new (kinds of) objects
or more of interesting ones (quasars),
density variations in 400-D space correlations
in 400-D space - Data doubles every year
- Data is public after 1 year
- So, 50 of the data is public
- Same access for everyone
3The Challenges
Exponential data growth Distributed
collections Soon Petabytes
Data Collection
Discovery and Analysis
Publishing
New analysis paradigm Data federations,
Move analysis to data
New publishing paradigm Scientists are
publishers and Curators
4New Science Data Exploration
- Data growing exponentially in many different
areas - Publishing so much data requires a new model
- Multiple challenges for different communities
- publishing, data mining, data visualization,
digital library, educational, web services
poster-child, - Information at your fingertips
- Students see the same data as professional
astronomers - More data coming Petabytes/year by 2010
- We need scalable solutions
- Move analysis to the data!
- Same thing happening in all sciences
- High energy physics, genomics, cancer
research,medical imaging, oceanography, remote
sensing, - Data Exploration an emerging new branch of
science - Currently has no owner
5Advances at JHU
- Designed and built the science archive for the
SDSS - Currently 2 Terabytes, soon to reach 3 TB
- Built fast spatial search library
- Created novel pipeline for data loading
- Built the SkyServer, a public access website for
SDSSwith over 45M web hits, millions of
free-form SQL queries - Built the first web-services used in science
- SkyQuery, ImgCutout, various visualization tools
- Leading the Virtual Observatory effort
- Heavy involvement in Grid Computing
- Exploring other areas
6Collaborative Projects
- Sloan Digital Sky Survey (11 inst)
- National Virtual Observatory (17 inst)
- International Virtual Observatory Alliance (14
countries) - Grid For Physics Networks (10 inst)
- Wireless sensors for Soil Biodiversity (BES,
Intel, UCB) - Digital Libraries (JHU, Cornell, Harvard,
Edinburgh) - Hydrodynamic Turbulence (JHU Engineering)
- Informal exchanges with NCBI
7Directions
- We understand how to mine a few terabytes
- Directions
- We built an environment now our tools allow new
breakthroughs in astrophysics - Open collaborations beyond astrophysics(turbulenc
e, sensor driven biodiversity, bioinformatics,
digital libraries, education ) - Attack problems on 100 Terabyte scale, prepare
for the Petabytes of tomorrow
8The JHU Core Group
- Faculty
- Alex Szalay
- Ethan Vishniac
- Charles Meneveau
- Graduate Students
- Tanu Malik
- Adrian Pope
- Postdoctoral Fellows
- Tamas Budavari
- Research Staff
- George Fekete
- Vivek Haridas
- Nolan Li
- Will OMullane
- Maria Nieto-Santisteban
- Jordan Raddick
- Anirudha Thakar
- Jan Vandenberg
9Examples
- Astrophysics inside the database
- Technology sharing in other areas
- Beyond Terabytes
10I. Astrophysics in the DB
- Studies of galaxy clustering
- Budavari, Pope, Szapudi
- Spectro Service Publishing spectral data
- Budavari, Dobos
- Cluster finding with a parallel DB-oriented
workflow system - Nieto-Santisteban, Malik, Thakar, Annis, Sekhri
- Complex spatial computations inside the DB
- Fekete, Gray, Szalay
- Visual tools with the DB
- ImgCutout (Nieto), Geometry viewer (Szalay),
MirageSQL (Carlisle)
11The SDSS Photo-z Sample
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
12The Analysis
- eSpICE I.Szapudi, S.Colombi and S.Prunet
- Integrated with the database by T. Budavari
- Extremely fast processing
- 1 stripe with about 1 million galaxies is
processed in 3 mins - Usual figure was 10 min for 10,000 galaxies gt 70
days - Each stripe processed separately for each cut
- 2D angular correlation function computed
- w(?) average with rejection of pixels along the
scan - Correlations due to flat field vector
- Unavoidable for drift scan
13Angular Power Spectrum
- Use photometric redshifts for LRGs
- Create thin redshift slices and analyze angular
clustering - From characteristic features (baryon bumps, etc)
we obtain angular diameter vs distance -gt Dark
Energy - Healpix pixelization in the database
- Each redshift slice is generated in 2 minutes
- Using Spice over 160,000 pixels in N1.7 time
14Large Scale Power Spectrum
- Goal measure cosmological parameters
- Cosmological constant or Dark Energy?
- Karhunen-Loeve technique
- Subdivide slices into about 5K-15K cells
- Compute correlation matrix of galaxy counts among
cells from fiducial P(k) noise model - Diagonalize matrix
- Expand data over KL basis
- Iterate over parameter values
- Compute new correlation matrix
- Invert, then compute log likelihood
Vogeley and Szalay (1996)
15Wb/ Wm
SDSS only Wmh 0.26 /- 0.04 Wb/Wm 0.29 /-
0.07
Wmh
SDSS Pope et al (2004)WMAP Verde et al.
(2003), Spergel et al. (2003)
16Numerical Effort
- Most of the time spent in data manipulation
- Fast spatial searches over data and MC (SQL)
- Diagonalization of 20Kx20K matrices
- Inversions of few 100K 5Kx5K matrices
- Has the potential to constrain the Dark Energy
- Accuracy enabled by large data set sizes
- But new kind of problems
- Errors driven by the systematics, not by sample
size - Scaling of analysis algorithms critical!
- Monte Carlo realizations with few 100M points in
SQL
17Cluster Finding
- Five main steps (Annis et al. 2002)
- Get Galaxy List
- fieldPrep Extracts from the main data set the
measurements of interest. - Filter
- brgSearch Calculates the unweighted BCG
likelihood for each galaxy (unweighted by galaxy
count) and discards unlikely galaxies. - Check Neighbors
- bcgSearch Weights BCG likelihood with the number
of neighbors. - Pick Most Likely
- bcgCoalesce Determines whether a galaxy is the
most likely galaxy in the neighborhood to be the
center of the cluster. - Discard Bogus
- getCatalog Removes suspicious results and
produces and stores the final cluster catalog.
18SQL Server Cluster
- Applying a zone strategy, P gets partitioned
homogenously among 3 servers. - S1 provides 1 deg buffer on top
- S2 provides 1 deg buffer on top and bottom
- S3 provides 1 deg buffer on bottom
P3
P
P2
Native to Server 3
P1
Native to Server 2
Native to Server 1
Total duplicated data 4 x 13 deg2. Total
duplicated work (1 object processed more than
once) 2 x 11 deg2 Maximum time spent by the
thicker partition2h 15 (other 2 servers 1h
50)
19SQL Server vs Files
- SQL Server
- Resolve a Target of 66 deg2 requires
- Step A Find Candidates
- - Input data 108 MB covering 104 deg2 (72
byte/row 1.574.656 row) - - Time 6 h on a dual 2.6 GHz
- - Output data 1.5 MB covering 84 deg240
byte/row 40.123 row - Step B Find Clusters
- - Input Data 1.5 MB
- - Time 20 minutes
- - Output 0.43 MB covering 66 deg2 40 byte/row
11.249 row - Total time 6h 20
- Some extra space is required for indexes and some
other auxiliary tables. - Scales linearly with no of servers
FILES Resolve a Target of 66 deg2 requires -
Input data 66 4 16MB 4GB - Output data
66 4 6KB1.5 MB - Time 73 hours Using 10
nodes 7.3 hours Notes Files SQL Buffer
0.25 deg 0.5
deg brgSearch z(0..1) in steps of
0.01 0.001 FILES would
require 20 60 times longer to solve this
problem for a buffer of 0.5 with steps of 0.001
20II. Technology Sharing
- Virtual Observatory
- SkyServer database/website templates
- Edinburgh, STScI, Caltech, Cambridge, Cornell
- OpenSkyQuery/OpenSkyNodes
- International standard for federating astro
archives - Interoperable SOAP implementations working
- NVO Registry Web Service (OMullane, Greene)
- Distributed logging and harvesting (Thakar, Gray)
- MyDB workbench for science (OMullane, Li)
- Publish your own data
- Ala Spectro Service, but for images and
databases - SkyServer-gt Soil Biodiversity
21National Virtual Observatory
- NSF ITR project, Building the Framework for the
National Virtual Observatory is a collaboration
of 17 funded and 3 unfunded organizations - Astronomy data centers
- National observatories
- Supercomputer centers
- University departments
- Computer science/information technology
specialists - PIs Alex Szalay (JHU), Roy Williams (Caltech)
- Connect the disjoint pieces of data in the world
- Bridge the technology gap for astronomers
- Based on interoperable Web Services
22International Collaboration
- Similar efforts now in 14 countries
- USA, Canada, UK, France, Germany, Italy, Holland,
Japan, Australia, India, China, Russia, Hungary,
South Korea, ESO - Total awarded funding world-wide is over 60M
- Active collaboration among projects
- Standards, common demos
- International VO roadmap being developed
- Regular telecons over 10 timezones
- Formal collaboration
- International Virtual Observatory Alliance (IVOA)
- Aiming to have production services by Jan 2005
23Boundary Conditions
- Standards driven by evolving new technologies
- Exchange of rich and structured data (XML)
- DB connectivity, Web Services, Grid computing
- Application to astronomy domain
- Data dictionaries (UCDs)
- Data models
- Protocols
- Registries and resource/service discovery
- Provenance, data quality
Boundary conditions
- Dealing with the astronomy legacy
- FITS data format
- Software systems
24Main VO Challenges
- How to avoid trying to be everything for
everybody? - Database connectivity is essential
- Bring the analysis to the data
- Core web services
- Higher level applications built on top
- Use the 90-10 rule
- Define the standards and interfaces
- Build the framework
- Build the 10 of services that are used by 90
- Let the users build the rest from the components
25Core Services
- Metadata information about resources
- Waveband
- Sky coverage
- Translation of names to universal dictionary
(UCD) - Registry
- Simple search patterns on the resources
- Spatial Search
- Image mosaic
- Unit conversions
- Simple filtering, counting, histograms
26Higher Level Services
- Built on Core Services
- Perform more complex tasks
- Examples
- Automated resource discovery
- Cross-identifications
- Photometric redshifts
- Image segmentation
- Outlier detections
- Visualization facilities
- Expectation
- Build custom portals in matter of days from
existing building blocks (like today in IRAF or
IDL)
27Web Services in Progress
- Registry
- Harvesting and querying
- Data Delivery
- Query driven Queue management
- Spectro service
- Logging services
- Graphics and visualization
- Query driven vs interactive
- Show spatial objects (Chart/Navi/List)
- Footprint/intersect
- It is a fractal
- Cross-matching
- SkyQuery and SkyNode
- Ferris-wheel
- Distributed vs parallel
28MyDB eScience Workbench
- Prototype of bringing analysis to the data
- Everybody gets a workspace (database)
- Executes analysis at the data
- Store intermediate results there
- Long queries run in batch
- Results shared within groups
- Only fetch the final results
- Extremely successful matches the pattern of
work - Next steps multiple locations, single
authentication - Farther down the road parallel workflow system
29eEducation Prototype
- SkyServer Educational Projects, aimed at
advanced high school students, but covering
middle school - Teach how to analyze data, discover patterns,not
just astronomy - 3.7 million project hits, 1.25 million page
views of educational content - More than 4000 textbooks
- On the whole web site 44 million web hits
- Largely a volunteer effort by many individuals
- Matches the 2020 curriculum
30Soil Biodiversity
- How does soil biodiversity affect ecosystem
functions, especially decomposition and nutrient
cycling in urban areas? - JHU is part of the Baltimore Ecosystem Study,one
of the NSF LTER monitoring sites - High resolution monitoring will capture
- Spatial heterogeneity of environment
- Change over time
-
31Sensor Monitoring
- Plan use 400 wireless (Intel) sensors,
monitoring - Air temperature, moisture
- Soil temperature, moisture, at least in two
depths (5cm, 20 cm) - Light (intensity, composition)
- Gases (O2, CO2, CH4, )
- Long-term continuous data
- Small (hidden) and affordable (many)
- Less disturbance
- 200 million measurements/year
- Collaboration with Intel and UCB(PI Szlavecz,
JHU) - Complex database of sensor data and samples
32III. Beyond Terabytes
- Numerical simulations of turbulence
- 100TB of multiple SQL Servers
- Storing each timestep, enabling backtracking to
initial conditions - Also fundamental problem in cosmological
simulations of galaxy mergers - Will teach us how to do scientific analysis of
100TBs - By the end of the decade several PB / year
- One needs to demonstrate fault tolerance, fast
enough loading speeds
33Exploration of Turbulence
- For the first time, we can now put it all
together - Large scale range, scale-ratio O(1,000)
- Three-dimensional in space
- Time-evolution and Lagrangian approach (follow
the flow) - Unique turbulence database
- We will create a database of O(2,000)
consecutive snapshots of a 1,0243 simulation of
turbulenceClose to 100 Terabytes - Analysis cluster on top of DB
- Treat it as a physics experiment,change
configurations every 2 months
34LSST
- Large Synoptic Survey Telescope (2012)
- Few PB/yr data rate
- Repeat SDSS in 4 nights
- Main issue is with data management
- Data volume similar to high energy physics,but
need object granularity - Very high resolution time series, moving objects
- Need to build 100TB scale prototypes today
- Hierarchical organization of data products
35The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
new SCIENCE!
The Big Problems
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it
- How to coexist with others
- Query and Vis tools
- Support/training
- Performance
- Execute queries in a minute
- Batch query scheduling
36(No Transcript)