Title: Public Access to Large Astronomical Datasets
1Public Access to Large Astronomical Datasets
- Alex Szalay, Johns Hopkins Jim Gray, Microsoft
Research
2Outline
- Trends
- The Sloan Digital Sky Survey
- The Cosmic Genome Project
- The SDSS database design
- The World-Wide Telescope
- Virtual Observatory Federating archives over the
world - Exploring Web Services
- Sky Query, Image Cutout
3Living in an Exponential World
- Astronomers have a few hundred TB now
- 1 pixel (byte) / sq arc second 4TB
- Multi-spectral, temporal, ? 1PB
- They mine it looking for new (kinds of) objects
or more of interesting ones (quasars),
density variations in 400-D space correlations
in 400-D space - Data doubles every year
- Data is public after 1 year
- So, 50 of the data is public
- Some have private access to 5 more data
- So 50 vs 55 access for everyone
4Science is hitting a wall
- FTP and GREP are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 10,000 disks
- At some point you need indices to limit
search parallel data search and analysis - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
5Making Discoveries
- When and where are discoveries made?
- Always at the edges and boundaries
- Going deeper, using more colors.
- Metcalfes law
- Utility of computer networks grows as the number
of possible connections O(N2) - VO Federation of N archives
- Possibilities for new discoveries grow as O(N2)
- Current sky surveys have proven this
- Very early discoveries from SDSS, 2MASS, DPOSS
6Publishing Data
Roles Authors Publishers Curators Consumers
Traditional Scientists Journals Libraries Scientis
ts
Emerging Collaborations Project www site Bigger
Archives Scientists
7Changing Roles
- Exponential growth
- Projects last at least 3-5 years
- Data sent upwards only at the end of the project
- Data will be never centralized
- More responsibility on projects
- Becoming Publishers and Curators
- Larger fraction of budget spent on software
- Lot of development duplicated, wasted
- All documentation is contained in the archive
- More standards are needed
- Easier data interchange, fewer tools
- More templates are needed
- Develop less software on your own
8Emerging New Concepts
- Standardizing distributed data
- Web Services, supported on all platforms
- Custom configure remote data dynamically
- XML Extensible Markup Language
- SOAP Simple Object Access Protocol
- WSDL Web Services Description Language
- Standardizing distributed computing
- Grid Services
- Custom configure remote computing dynamically
- Build your own remote computer, and discard
- Virtual Data new data sets on demand
9Features of the SDSS
Goal Create the most detailed map of
the Northern sky in 5 years 2.5m telescope,
Apache Point, NM 3 degree field of view ¼
of the whole sky Two surveys in one
Photometric survey in 5 bands Spectroscopic
redshift survey Automated data reduction 150
man-years of development Very high data volume
40 TB of raw data 5 TB processed catalogs
Data is public
The University of Chicago Princeton
University The Johns Hopkins University The
University of Washington New Mexico State
University Fermi National Accelerator
Laboratory US Naval Observatory The
Japanese Participation Group The Institute for
Advanced Study Max Planck Inst, Heidelberg
Sloan Foundation, NSF, DOE, NASA
10The Imaging Survey
Continuous data rate of 8 Mbytes/sec Northern
Galactic Cap drift scan of 10,000 square
degrees 24k x 1M pixel panoramic
images in 5 colors broad-band
filters (u,g,r,i,z) exposure time 55 sec
pixel size 0.4 arcsec astrometry 60 mas
calibration 2 done only in best seeing
(20 nights/year) Southern Galactic Cap
multiple scans (gt 30 times) of the same
stripe
11The Spectroscopic Survey
Elliptical galaxy
Expanding universe redshift
distance SDSS Redshift Survey 1 million
galaxies 100,000 quasars 100,000 stars Two high
throughput spectrographs spectral range 3900-9200
Ã… 640 spectra simultaneously R2000 resolution,
1.3 Ã… Features Automated reduction of
spectra Very high sampling density and
completeness
12Data Flow
13Public Data Release
- June 2002 EDR
- Early Data Release
- January 2003 DR1
- Contains 30 of final data
- 200 million photo objects
- 4 versions of the data
- Target, best, runs, spectro
- Total catalog volume 1.7TB
- See Terascale sneakernet paper
- Published releases served forever
- EDR, DR1, DR2, .
- Soon to include email archives, annotations
- O(N2) only possible because of Moores Law!
EDR
14Why Is Astronomy Data Special?
- It has no commercial value
- No privacy concerns
- Can freely share results with others
- Great for experimenting with algorithms
- It is real and well documented
- High-dimensional (with confidence intervals)
- Spatial
- Temporal
- Diverse and distributed
- Many different instruments from many different
places and many different times - The questions are interesting
- There is a lot of it (petabytes)
15Virtual Observatory
- Many new surveys are coming
- SDSS is a dry run for the next ones
- LSST will be 1TB/night
- All the data will be on the Internet
- But how? ftp, webservice
- Data and apps will be associated withthe
instruments - Distributed world wide
- Cross-indexed
- Federation is a must, but how?
- Will be the best telescope in the world
- World Wide Telescope
16SkyQuery Experimental Federation
- Federated 5 Web Services
- Portal unifies 3 archives and a cutout service to
visualize results - Fermilab/SDSS, JHU/FIRST, Caltech/2MASS Archives
- Multi-survey spatial join and SQL select
- Distributed query optimization (T. Malik, T.
Budavari) in 6 weeks - http//www.skyquery.net/
- Cutout web service annotated SDSS images
- http//skyservice.pha.jhu.edu/sdsscutout/
SELECT o.objId, o.ra, o.r, o.type, o.I, t.objId,
t.j_m FROM SDSSPhotoPrimary o,
TWOMASSPhotoPrimary t WHERE
XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 AND o.I t.j_m gt 2
17Summary
- The data is public and largely self-documenting
- Get your own copy!
- The SDSS database and web app are interesting
- Data mining challenge
- Data visualization challenge
- Educational challenge
- Web services poster-child
- Information at your fingertips
- Students see the same data as professional
astronomers - More data coming
- 1.7 TB public data by Jan 2003, 6TB coming
- The World-Wide Telescope
- Federating the astronomy archives is a CS
challenge