Data Explosion: Science with Terabytes - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Data Explosion: Science with Terabytes

Description:

Data Explosion: Science with Terabytes – PowerPoint PPT presentation

Number of Views:357

Avg rating:3.0/5.0

Slides: 37

Provided by: alex263

Category:

more less

Transcript and Presenter's Notes

Title: Data Explosion: Science with Terabytes

1
Data Explosion Science with Terabytes

Alex Szalay, JHUand Jim Gray, Microsoft Research

2
Living in an Exponential World

Astronomers have a few hundred TB now
1 pixel (byte) / sq arc second 4TB
Multi-spectral, temporal, ? 1PB
They mine it looking for new (kinds of) objects
or more of interesting ones (quasars),
density variations in 400-D space correlations
in 400-D space
Data doubles every year
Data is public after 1 year
So, 50 of the data is public
Same access for everyone

3
The Challenges
Exponential data growth Distributed
collections Soon Petabytes
Data Collection
Discovery and Analysis
Publishing
New analysis paradigm Data federations,
Move analysis to data
New publishing paradigm Scientists are
publishers and Curators
4
New Science Data Exploration

Data growing exponentially in many different
areas
Publishing so much data requires a new model
Multiple challenges for different communities
publishing, data mining, data visualization,
digital library, educational, web services
poster-child,
Information at your fingertips
Students see the same data as professional
astronomers
More data coming Petabytes/year by 2010
We need scalable solutions
Move analysis to the data!
Same thing happening in all sciences
High energy physics, genomics, cancer
research,medical imaging, oceanography, remote
sensing,
Data Exploration an emerging new branch of
science
Currently has no owner

5
Advances at JHU

Designed and built the science archive for the
SDSS
Currently 2 Terabytes, soon to reach 3 TB
Built fast spatial search library
Created novel pipeline for data loading
Built the SkyServer, a public access website for
SDSSwith over 45M web hits, millions of
free-form SQL queries
Built the first web-services used in science
SkyQuery, ImgCutout, various visualization tools
Leading the Virtual Observatory effort
Heavy involvement in Grid Computing
Exploring other areas

6
Collaborative Projects

Sloan Digital Sky Survey (11 inst)
National Virtual Observatory (17 inst)
International Virtual Observatory Alliance (14
countries)
Grid For Physics Networks (10 inst)
Wireless sensors for Soil Biodiversity (BES,
Intel, UCB)
Digital Libraries (JHU, Cornell, Harvard,
Edinburgh)
Hydrodynamic Turbulence (JHU Engineering)
Informal exchanges with NCBI

7
Directions

We understand how to mine a few terabytes
Directions
We built an environment now our tools allow new
breakthroughs in astrophysics
Open collaborations beyond astrophysics(turbulenc
e, sensor driven biodiversity, bioinformatics,
digital libraries, education )
Attack problems on 100 Terabyte scale, prepare
for the Petabytes of tomorrow

8
The JHU Core Group

Faculty
Alex Szalay
Ethan Vishniac
Charles Meneveau
Graduate Students
Tanu Malik
Adrian Pope
Postdoctoral Fellows
Tamas Budavari

Research Staff
George Fekete
Vivek Haridas
Nolan Li
Will OMullane
Maria Nieto-Santisteban
Jordan Raddick
Anirudha Thakar
Jan Vandenberg

9
Examples

Astrophysics inside the database
Technology sharing in other areas
Beyond Terabytes

10
I. Astrophysics in the DB

Studies of galaxy clustering
Budavari, Pope, Szapudi
Spectro Service Publishing spectral data
Budavari, Dobos
Cluster finding with a parallel DB-oriented
workflow system
Nieto-Santisteban, Malik, Thakar, Annis, Sekhri
Complex spatial computations inside the DB
Fekete, Gray, Szalay
Visual tools with the DB
ImgCutout (Nieto), Geometry viewer (Szalay),
MirageSQL (Carlisle)

11
The SDSS Photo-z Sample
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
12
The Analysis

eSpICE I.Szapudi, S.Colombi and S.Prunet
Integrated with the database by T. Budavari
Extremely fast processing
1 stripe with about 1 million galaxies is
processed in 3 mins
Usual figure was 10 min for 10,000 galaxies gt 70
days
Each stripe processed separately for each cut
2D angular correlation function computed
w(?) average with rejection of pixels along the
scan
Correlations due to flat field vector
Unavoidable for drift scan

13
Angular Power Spectrum

Use photometric redshifts for LRGs
Create thin redshift slices and analyze angular
clustering
From characteristic features (baryon bumps, etc)
we obtain angular diameter vs distance -gt Dark
Energy
Healpix pixelization in the database
Each redshift slice is generated in 2 minutes
Using Spice over 160,000 pixels in N1.7 time

14
Large Scale Power Spectrum

Goal measure cosmological parameters
Cosmological constant or Dark Energy?
Karhunen-Loeve technique
Subdivide slices into about 5K-15K cells
Compute correlation matrix of galaxy counts among
cells from fiducial P(k) noise model
Diagonalize matrix
Expand data over KL basis
Iterate over parameter values
Compute new correlation matrix
Invert, then compute log likelihood

Vogeley and Szalay (1996)
15
Wb/ Wm
SDSS only Wmh 0.26 /- 0.04 Wb/Wm 0.29 /-
0.07
Wmh
SDSS Pope et al (2004)WMAP Verde et al.
(2003), Spergel et al. (2003)
16
Numerical Effort

Most of the time spent in data manipulation
Fast spatial searches over data and MC (SQL)
Diagonalization of 20Kx20K matrices
Inversions of few 100K 5Kx5K matrices
Has the potential to constrain the Dark Energy
Accuracy enabled by large data set sizes
But new kind of problems
Errors driven by the systematics, not by sample
size
Scaling of analysis algorithms critical!
Monte Carlo realizations with few 100M points in
SQL

17
Cluster Finding

Five main steps (Annis et al. 2002)
Get Galaxy List
fieldPrep Extracts from the main data set the
measurements of interest.
Filter
brgSearch Calculates the unweighted BCG
likelihood for each galaxy (unweighted by galaxy
count) and discards unlikely galaxies.
Check Neighbors
bcgSearch Weights BCG likelihood with the number
of neighbors.
Pick Most Likely
bcgCoalesce Determines whether a galaxy is the
most likely galaxy in the neighborhood to be the
center of the cluster.
Discard Bogus
getCatalog Removes suspicious results and
produces and stores the final cluster catalog.

18
SQL Server Cluster

Applying a zone strategy, P gets partitioned
homogenously among 3 servers.
S1 provides 1 deg buffer on top
S2 provides 1 deg buffer on top and bottom
S3 provides 1 deg buffer on bottom

P3
P
P2
Native to Server 3
P1
Native to Server 2
Native to Server 1
Total duplicated data 4 x 13 deg2. Total
duplicated work (1 object processed more than
once) 2 x 11 deg2 Maximum time spent by the
thicker partition2h 15 (other 2 servers 1h
50)
19
SQL Server vs Files

SQL Server
Resolve a Target of 66 deg2 requires
Step A Find Candidates
- Input data 108 MB covering 104 deg2 (72
byte/row 1.574.656 row)
- Time 6 h on a dual 2.6 GHz
- Output data 1.5 MB covering 84 deg240
byte/row 40.123 row
Step B Find Clusters
- Input Data 1.5 MB
- Time 20 minutes
- Output 0.43 MB covering 66 deg2 40 byte/row
11.249 row
Total time 6h 20
Some extra space is required for indexes and some
other auxiliary tables.
Scales linearly with no of servers

FILES Resolve a Target of 66 deg2 requires -
Input data 66 4 16MB 4GB - Output data
66 4 6KB1.5 MB - Time 73 hours Using 10
nodes 7.3 hours Notes Files SQL Buffer
0.25 deg 0.5
deg brgSearch z(0..1) in steps of
0.01 0.001 FILES would
require 20 60 times longer to solve this
problem for a buffer of 0.5 with steps of 0.001
20
II. Technology Sharing

Virtual Observatory
SkyServer database/website templates
Edinburgh, STScI, Caltech, Cambridge, Cornell
OpenSkyQuery/OpenSkyNodes
International standard for federating astro
archives
Interoperable SOAP implementations working
NVO Registry Web Service (OMullane, Greene)
Distributed logging and harvesting (Thakar, Gray)
MyDB workbench for science (OMullane, Li)
Publish your own data
Ala Spectro Service, but for images and
databases
SkyServer-gt Soil Biodiversity

21
National Virtual Observatory

NSF ITR project, Building the Framework for the
National Virtual Observatory is a collaboration
of 17 funded and 3 unfunded organizations
Astronomy data centers
National observatories
Supercomputer centers
University departments
Computer science/information technology
specialists
PIs Alex Szalay (JHU), Roy Williams (Caltech)
Connect the disjoint pieces of data in the world
Bridge the technology gap for astronomers
Based on interoperable Web Services

22
International Collaboration

Similar efforts now in 14 countries
USA, Canada, UK, France, Germany, Italy, Holland,
Japan, Australia, India, China, Russia, Hungary,
South Korea, ESO
Total awarded funding world-wide is over 60M
Active collaboration among projects
Standards, common demos
International VO roadmap being developed
Regular telecons over 10 timezones
Formal collaboration
International Virtual Observatory Alliance (IVOA)
Aiming to have production services by Jan 2005

23
Boundary Conditions

Standards driven by evolving new technologies
Exchange of rich and structured data (XML)
DB connectivity, Web Services, Grid computing

Application to astronomy domain
Data dictionaries (UCDs)
Data models
Protocols
Registries and resource/service discovery
Provenance, data quality

Boundary conditions

Dealing with the astronomy legacy
FITS data format
Software systems

24
Main VO Challenges

How to avoid trying to be everything for
everybody?
Database connectivity is essential
Bring the analysis to the data
Core web services
Higher level applications built on top
Use the 90-10 rule
Define the standards and interfaces
Build the framework
Build the 10 of services that are used by 90
Let the users build the rest from the components

25
Core Services

Metadata information about resources
Waveband
Sky coverage
Translation of names to universal dictionary
(UCD)
Registry
Simple search patterns on the resources
Spatial Search
Image mosaic
Unit conversions
Simple filtering, counting, histograms

26
Higher Level Services

Built on Core Services
Perform more complex tasks
Examples
Automated resource discovery
Cross-identifications
Photometric redshifts
Image segmentation
Outlier detections
Visualization facilities
Expectation
Build custom portals in matter of days from
existing building blocks (like today in IRAF or
IDL)

27
Web Services in Progress

Registry
Harvesting and querying
Data Delivery
Query driven Queue management
Spectro service
Logging services
Graphics and visualization
Query driven vs interactive
Show spatial objects (Chart/Navi/List)
Footprint/intersect
It is a fractal
Cross-matching
SkyQuery and SkyNode
Ferris-wheel
Distributed vs parallel

28
MyDB eScience Workbench

Prototype of bringing analysis to the data
Everybody gets a workspace (database)
Executes analysis at the data
Store intermediate results there
Long queries run in batch
Results shared within groups
Only fetch the final results
Extremely successful matches the pattern of
work
Next steps multiple locations, single
authentication
Farther down the road parallel workflow system

29
eEducation Prototype

SkyServer Educational Projects, aimed at
advanced high school students, but covering
middle school
Teach how to analyze data, discover patterns,not
just astronomy
3.7 million project hits, 1.25 million page
views of educational content
More than 4000 textbooks
On the whole web site 44 million web hits
Largely a volunteer effort by many individuals
Matches the 2020 curriculum

30
Soil Biodiversity

How does soil biodiversity affect ecosystem
functions, especially decomposition and nutrient
cycling in urban areas?
JHU is part of the Baltimore Ecosystem Study,one
of the NSF LTER monitoring sites
High resolution monitoring will capture
Spatial heterogeneity of environment
Change over time

31
Sensor Monitoring

Plan use 400 wireless (Intel) sensors,
monitoring
Air temperature, moisture
Soil temperature, moisture, at least in two
depths (5cm, 20 cm)
Light (intensity, composition)
Gases (O2, CO2, CH4, )
Long-term continuous data
Small (hidden) and affordable (many)
Less disturbance
200 million measurements/year
Collaboration with Intel and UCB(PI Szlavecz,
JHU)
Complex database of sensor data and samples

32
III. Beyond Terabytes

Numerical simulations of turbulence
100TB of multiple SQL Servers
Storing each timestep, enabling backtracking to
initial conditions
Also fundamental problem in cosmological
simulations of galaxy mergers
Will teach us how to do scientific analysis of
100TBs
By the end of the decade several PB / year
One needs to demonstrate fault tolerance, fast
enough loading speeds

33
Exploration of Turbulence

For the first time, we can now put it all
together
Large scale range, scale-ratio O(1,000)
Three-dimensional in space
Time-evolution and Lagrangian approach (follow
the flow)
Unique turbulence database
We will create a database of O(2,000)
consecutive snapshots of a 1,0243 simulation of
turbulenceClose to 100 Terabytes
Analysis cluster on top of DB
Treat it as a physics experiment,change
configurations every 2 months

34
LSST

Large Synoptic Survey Telescope (2012)
Few PB/yr data rate
Repeat SDSS in 4 nights
Main issue is with data management
Data volume similar to high energy physics,but
need object granularity
Very high resolution time series, moving objects
Need to build 100TB scale prototypes today
Hierarchical organization of data products

35
The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
new SCIENCE!
The Big Problems