Title: Mara NietoSantisteban
1Large-Scale Cross-Matching with Open SkyQuery
- María Nieto-Santisteban
- Ani Thakar
- Alex Szalay, et al.
- The Johns Hopkins University
AISRP 2008 _at_ College Park, University of Maryland
2Goals
- Implement an access and cross-matching engine
that facilitates access to large digital archives
and enables new scientific discoveries by cross
correlating multi-wavelength datasets
3Handling the Large-Scale
- The 20 Spatial Queries
- Partitioning Parallelization
- Asynchronous Data Access
- Efficient Cross-Match
- Workflow Management
- Cluster Management
- Data transport
4The 20 Spatial Queries
- Single/Multi Catalog Regions
- Cone search Find objects within a circle
- Find objects within a circle satisfying a high
multi-dimensional constraints - Find the closest neighbor
- Find objects within a region
- Find objects in/outside masked regions
- Find objects near the edges of a region
- Compute the area of a region
- Find surveys covering a given region
- Find the intersection between several surveys
- Count objects from a list of regions
5The 20 Spatial Queries
- Find these 1k - 100k objects in these catalogs
- For all catalogs, extract a random sample of
existing objects within a given region - Cross-match 2 catalogs within a given region
- Cross-match n catalogs, n gt 2, within a given
region - Find objects which are in A, B, and C but not in
D - Given a sparse grid, find the closest grid point
for all objects in the catalog - Find multiple detections of the same object with
given magnitudes variations - Find all quasars within a region and compute
their distance to surroundings galaxies - more . . . open to discussion
6Partitioning Parallelization
- Zones (spatial partitioning and indexing
algorithm) - Partition and bin the data in declination zones
- ZoneID floor ((dec 90.0) / zoneHeight)
- Some tricks required to handle spherical geometry
- Place the data close on disk
- Clustered Index on ZoneID and RA
- Fully implemented in SQL
- Efficient
- Cone searches
- Cross-Match (especially)
- Enables Parallelization
- Execute the query on a data partition
- Partition the query and execute it on the full
dataset
7CasJobs, Asynchronous Data Access
- Solution to the SDSS increasing size and demand
- Astronomers workbench
- Unlimited queries against the large SDSS
databases - Minimize data movement
- Personal database, MyDB, under users full
control - Full power to create tables, stored procedures,
functions, load personal data, etc. - Collaborative environment
- Easy access to prior data releases
- Job tracking system
- Accessed through a Guide User Interface
- Accessed though a WS interface
- Not exclusive of SDSS nor Astronomy!
8Efficient Cross-Matching
Matching stars between 2MASS and SDSS DR5 74 M
x 54 M rows, 4.5 h instead of 2 days 1 degree
match between SDSSDR5 and a sparse grid 350 M x
50 k rows, 7 h instead of 1 year LSST
simulations for alert detection 6 M x 125 k
rows, 40 s Pan-STARRS on-the-fly
association 1.3 billion objects x 120 million
detections, 1.5 h
9Graywulf
- Date Tue, 2 Nov 2004 142637 -0800
- From Jim Gray ltgray_at_microsoft.comgt
- To Maria A. Nieto-Santisteban ltnieto_at_skysrv.pha.j
hu.edugt - Subject RE Scaleout
- I think your cluster finding work, the loader,
the sector stuff, the - match stuff, ... all are examples of map-reduce.
- I would like to build a system to describe these
parallel workflows and run them on a replicated
database, then take the outputs and glue them
together (map-reduce). - Make sense?
10Graywulf
User
SkyQuery
Graywulf
HP cluster
DB cluster
11Architecture
Cluster Manager (CLM)
Workflow Manager (WFM)
Perf. Monitor
Application
Query Manager (QM)
Web Based Interface (WBI)
12Open SkyQuery Next Generation
Cluster Manager (CLM)
Workflow Manager (WFM)
Perf. Monitor
Linked Servers
2MASS
SDSS
SDSS
2MASS
myDB
VoSpace
myDB
myDB
MatchDB1
MatchDBn
OSQ
Query Manager (QM)
DRL
Web Based Interface (WBI)
13Pan-STARRS
Cluster Manager (CLM)
Workflow Manager (WFM)
Performance Monitor
Linked Servers
Objects_pm Detections_pm Meta
Objects_p1 Detections_p1 Meta
Pm
P1
PS1
Objects Meta
Detections
PS1 database
Query Manager (QM)
Legend Database Full table partitioned
table Partitioned View
DRL
Web Based Interface (WBI)
14Pan-STARRS Prototype in Context
15Pan-STARRS Prototypes
SDSS includes a mirror of 11.3 lt ? lt 30
objects to ? lt 0
- Total GB of csv loaded data 300 GB
- CSV Bulk insert load 8 MB/s
- Binary Bulk insert 18-20 MB/s
- Creation Started October 15th 2007
- Finished October 29th 2007
- Includes
- 10 epochs of single image detections (2 x 5
filters) - 5 epochs of Stack detections (1 x 5
filters)
16Size of PS1 Prototype Database
Table sizes are in GB
9.6 TB of data in a distributed database
17Well-Balanced Partitions
18VO Space _at_ JHU
- - C 2.0 implementation based on new Window
Communication Foundation (WCF) - - Self-contained SQL Server 2005 backend
- VOPipe Architecture
- - Higher level services for data/work flows
- - Basis for next generation VO services such as
Open SkyQuery
19Education and Public Outreach
- Visualization tool for Open SkyQuery
- Lesson plan for high school astronomy
- Hubble Diagram with SDSS GALEX
20Summary
- The 20 Spatial Queries
- Partitioning Parallelization
- 10 TB distributed DB well balanced partitioned
- Asynchronous Data Access
- CasJobs
- Efficient Cross-Match
- 1.3 billion x 120 m in 1.5 h
- Work in progress
- Workflow Management
- Cluster Management
- Data transport
- 20 Spatial Queries benchmark
21Related Publications
- 20 Spatial Queries for an Astronomers
Bench(mark), M. Nieto-Santisteban, T. School, A.
Szalay, A. Kemper, in Proceedings of Astronomical
Data Analysis Software and Systems XVII, London,
UK, 23rd - 26th September 2007. - Probabilistic Cross-Identification of
Astronomical Sources, T. Budavari, A. Szalay, and
M. Nieto-Santisteban, in Proceedings of
Astronomical Data Analysis Software and Systems
XVII, London, UK, 23rd - 26th September 2007. - The Pan-STARRS Object Data Manager Database, J.
Heasley, M. Nieto-Santisteban, A. Szalay, A.
Thakar, AAS Meeting 210th - Honolulu, HW, USA,
5th 10th, May 2007. - LSST, the Spatial Cross-Match Challenge, María A.
Nieto-Santisteban, Alexander S. Szalay, Aniruddha
R. Thakar, Jim Gray Astronomical Data in
Proceedings of Astronomical Data Analysis
Software and Systems XVI, Tucson, AZ, USA, 15th -
18th October 2006. - When Database Systems Meet the Grid. María A.
Nieto-Santisteban, Jim Gray, Alexander Szalay,
James Annis, Aniruddha R. Thakar, William J.
OMullane, in Proceedings of ACM CIDR 2005,
Asilomar, CA, January 2005.
22Related Publications
- Cross-matching Multiple Spatial Observations and
Dealing with Missing Data, J. Gray, A. Szalay, T.
Budavári, R. Lupton, M. Nieto-Santisteban, A.
Thakar, Microsoft Technical Report MSR TR
2006-175, December 2006. - The Zones Algorithm for Finding
Points-Near-a-Point or Cross-Matching Spatial
Datasetes, Jim Gray, María A. Nieto-Santisteban,
Alexander S. Szalay, Microsoft Technical Report
MSR-TR-2006-52, April 2006. - Batch is back CasJobs, serving multi-TB data on
the Web. William OMullane, Nolan Li, Maria A.
Nieto-Santisteban, Ani Thakar, Alexander S.
Szalay, Jim Gray in in the Proceedings of the
2005 IEEE International Conference on Web
Services (ICWS 2005). Orlando, FL, July 2005. - Large-Scale Query and XMatch, Entering the
Parallel Zone. María A. Nieto-Santisteban,
Aniruddha R. Thakar, Alexander S. Szalay, Jim
Gray Astronomical Data Analysis Software and
Systems XV ASP Conference Series, Vol. 351,
Proceedings of the Conference Held 2-5 October
2005 in San Lorenzo de El Escorial, Spain. Edited
by Carlos Gabriel, Christophe Arviset, Daniel
Ponz, and Enrique Solano. San Francisco
Astronomical Society of the Pacific, 2006., p.493.