Mara NietoSantisteban

About This Presentation

Title:

Mara NietoSantisteban

Description:

Implement an access and cross-matching engine that facilitates access to large ... Edited by Carlos Gabriel, Christophe Arviset, Daniel Ponz, and Enrique Solano. ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 23

Provided by: mariaa3

Category:

more less

Transcript and Presenter's Notes

Title: Mara NietoSantisteban

1
Large-Scale Cross-Matching with Open SkyQuery

María Nieto-Santisteban
Ani Thakar
Alex Szalay, et al.
The Johns Hopkins University

AISRP 2008 _at_ College Park, University of Maryland
2
Goals

Implement an access and cross-matching engine
that facilitates access to large digital archives
and enables new scientific discoveries by cross
correlating multi-wavelength datasets

3
Handling the Large-Scale

The 20 Spatial Queries
Partitioning Parallelization
Asynchronous Data Access
Efficient Cross-Match
Workflow Management
Cluster Management
Data transport

4
The 20 Spatial Queries

Single/Multi Catalog Regions
Cone search Find objects within a circle
Find objects within a circle satisfying a high
multi-dimensional constraints
Find the closest neighbor
Find objects within a region
Find objects in/outside masked regions
Find objects near the edges of a region
Compute the area of a region
Find surveys covering a given region
Find the intersection between several surveys
Count objects from a list of regions

5
The 20 Spatial Queries

Find these 1k - 100k objects in these catalogs
For all catalogs, extract a random sample of
existing objects within a given region
Cross-match 2 catalogs within a given region
Cross-match n catalogs, n gt 2, within a given
region
Find objects which are in A, B, and C but not in
D
Given a sparse grid, find the closest grid point
for all objects in the catalog
Find multiple detections of the same object with
given magnitudes variations
Find all quasars within a region and compute
their distance to surroundings galaxies
more . . . open to discussion

6
Partitioning Parallelization

Zones (spatial partitioning and indexing
algorithm)
Partition and bin the data in declination zones
ZoneID floor ((dec 90.0) / zoneHeight)
Some tricks required to handle spherical geometry
Place the data close on disk
Clustered Index on ZoneID and RA
Fully implemented in SQL
Efficient
Cone searches
Cross-Match (especially)
Enables Parallelization
Execute the query on a data partition
Partition the query and execute it on the full
dataset

7
CasJobs, Asynchronous Data Access

Solution to the SDSS increasing size and demand
Astronomers workbench
Unlimited queries against the large SDSS
databases
Minimize data movement
Personal database, MyDB, under users full
control
Full power to create tables, stored procedures,
functions, load personal data, etc.
Collaborative environment
Easy access to prior data releases
Job tracking system
Accessed through a Guide User Interface
Accessed though a WS interface
Not exclusive of SDSS nor Astronomy!

8
Efficient Cross-Matching
Matching stars between 2MASS and SDSS DR5 74 M
x 54 M rows, 4.5 h instead of 2 days 1 degree
match between SDSSDR5 and a sparse grid 350 M x
50 k rows, 7 h instead of 1 year LSST
simulations for alert detection 6 M x 125 k
rows, 40 s Pan-STARRS on-the-fly
association 1.3 billion objects x 120 million
detections, 1.5 h
9
Graywulf

Date Tue, 2 Nov 2004 142637 -0800
From Jim Gray ltgray_at_microsoft.comgt
To Maria A. Nieto-Santisteban ltnieto_at_skysrv.pha.j
hu.edugt
Subject RE Scaleout
I think your cluster finding work, the loader,
the sector stuff, the
match stuff, ... all are examples of map-reduce.
I would like to build a system to describe these
parallel workflows and run them on a replicated
database, then take the outputs and glue them
together (map-reduce).
Make sense?

10
Graywulf
User
SkyQuery
Graywulf
HP cluster
DB cluster
11
Architecture
Cluster Manager (CLM)
Workflow Manager (WFM)
Perf. Monitor
Application
Query Manager (QM)
Web Based Interface (WBI)
12
Open SkyQuery Next Generation
Cluster Manager (CLM)
Workflow Manager (WFM)
Perf. Monitor
Linked Servers
2MASS
SDSS
SDSS
2MASS
myDB
VoSpace
myDB
myDB
MatchDB1
MatchDBn
OSQ
Query Manager (QM)
DRL
Web Based Interface (WBI)
13
Pan-STARRS
Cluster Manager (CLM)
Workflow Manager (WFM)
Performance Monitor
Linked Servers
Objects_pm Detections_pm Meta
Objects_p1 Detections_p1 Meta
Pm
P1
PS1
Objects Meta
Detections
PS1 database
Query Manager (QM)
Legend Database Full table partitioned
table Partitioned View
DRL
Web Based Interface (WBI)
14
Pan-STARRS Prototype in Context
15
Pan-STARRS Prototypes
SDSS includes a mirror of 11.3 lt ? lt 30
objects to ? lt 0

Total GB of csv loaded data 300 GB
CSV Bulk insert load 8 MB/s
Binary Bulk insert 18-20 MB/s
Creation Started October 15th 2007
Finished October 29th 2007
Includes
10 epochs of single image detections (2 x 5
filters)
5 epochs of Stack detections (1 x 5
filters)

16
Size of PS1 Prototype Database
Table sizes are in GB
9.6 TB of data in a distributed database
17
Well-Balanced Partitions
18
VO Space _at_ JHU

- C 2.0 implementation based on new Window
Communication Foundation (WCF)
- Self-contained SQL Server 2005 backend
VOPipe Architecture
- Higher level services for data/work flows
- Basis for next generation VO services such as
Open SkyQuery

19
Education and Public Outreach

Visualization tool for Open SkyQuery
Lesson plan for high school astronomy
Hubble Diagram with SDSS GALEX

20
Summary

The 20 Spatial Queries
Partitioning Parallelization
10 TB distributed DB well balanced partitioned
Asynchronous Data Access
CasJobs
Efficient Cross-Match
1.3 billion x 120 m in 1.5 h
Work in progress
Workflow Management
Cluster Management
Data transport
20 Spatial Queries benchmark

21
Related Publications

20 Spatial Queries for an Astronomers
Bench(mark), M. Nieto-Santisteban, T. School, A.
Szalay, A. Kemper, in Proceedings of Astronomical
Data Analysis Software and Systems XVII, London,
UK, 23rd - 26th September 2007.
Probabilistic Cross-Identification of
Astronomical Sources, T. Budavari, A. Szalay, and
M. Nieto-Santisteban, in Proceedings of
Astronomical Data Analysis Software and Systems
XVII, London, UK, 23rd - 26th September 2007.
The Pan-STARRS Object Data Manager Database, J.
Heasley, M. Nieto-Santisteban, A. Szalay, A.
Thakar, AAS Meeting 210th - Honolulu, HW, USA,
5th 10th, May 2007.
LSST, the Spatial Cross-Match Challenge, María A.
Nieto-Santisteban, Alexander S. Szalay, Aniruddha
R. Thakar, Jim Gray Astronomical Data in
Proceedings of Astronomical Data Analysis
Software and Systems XVI, Tucson, AZ, USA, 15th -
18th October 2006.
When Database Systems Meet the Grid. María A.
Nieto-Santisteban, Jim Gray, Alexander Szalay,
James Annis, Aniruddha R. Thakar, William J.
OMullane, in Proceedings of ACM CIDR 2005,
Asilomar, CA, January 2005.

22
Related Publications

Cross-matching Multiple Spatial Observations and
Dealing with Missing Data, J. Gray, A. Szalay, T.
Budavári, R. Lupton, M. Nieto-Santisteban, A.
Thakar, Microsoft Technical Report MSR TR
2006-175, December 2006.
The Zones Algorithm for Finding
Points-Near-a-Point or Cross-Matching Spatial
Datasetes, Jim Gray, María A. Nieto-Santisteban,
Alexander S. Szalay, Microsoft Technical Report
MSR-TR-2006-52, April 2006.
Batch is back CasJobs, serving multi-TB data on
the Web. William OMullane, Nolan Li, Maria A.
Nieto-Santisteban, Ani Thakar, Alexander S.
Szalay, Jim Gray in in the Proceedings of the
2005 IEEE International Conference on Web
Services (ICWS 2005). Orlando, FL, July 2005.
Large-Scale Query and XMatch, Entering the
Parallel Zone. María A. Nieto-Santisteban,
Aniruddha R. Thakar, Alexander S. Szalay, Jim
Gray Astronomical Data Analysis Software and
Systems XV ASP Conference Series, Vol. 351,
Proceedings of the Conference Held 2-5 October
2005 in San Lorenzo de El Escorial, Spain. Edited
by Carlos Gabriel, Christophe Arviset, Daniel
Ponz, and Enrique Solano. San Francisco
Astronomical Society of the Pacific, 2006., p.493.