Title: Large Scale OnDemand CrossMatch
1- Large Scale On-Demand Cross-Match
- with Open SkyQuery
2Need for Large Xmatch
- Science case for large xmatch
- Holy grail of multi-wavelength astronomy
- Compare data from multiple NASA ground-based
archives - Federation/xmatch fundamental to NVO mission
- Why on-demand?
- Large fraction of data outside of data centers
- This will get larger in the future
- Time for data to make it to data center already
prohibitive - Iterative xmatch refinement
- Re-try until you get it right
- Specify additional constraints from source
catalogs - Often use uploaded/private datasets
3Examples
- SDSS DR5 vs 2MASS
- 2.5 TB (200M obj) 200 GB (400M obj)
- Full or large xmatch takes days at the moment
- Typically it takes 6mths-1yr for data to be
published, even longer to distribute to mirrors - If you think SDSS is bad, wait till LSST gets
here (2013)! - Large Synoptic Survey Telescope (lsst.org)
- One SDSS every 3-4 nights!
- Petabytes of data, all public
- Fast xmatch needed both in pipeline and science
use cases
4Open SkyQuery
Open SkyQuery Architecture
- Federated query/cross-match for Virtual
Observatory (AISR project) - Distributed Web services architecture
- Portal routes queries to one or more SkyNodes
Perf query gets counts from each node Portal
prepares ExecPlan with nodes in ascending order
of counts Probabilistic (fuzzy) join at each node
with results from previous node Dropouts nodes
that have non-matches
5Open SkyQuery
- ADQL Query
- Basic SQL
- REGION
- XMATCH
- Limited XMatch
- 5k row limit for each SkyNode
- Avoids long xmatches
- Large-Scale xmatch is not possible today!
SELECT o.objId, o.ra, o.dec, t.ra, t.dec,
t.objId, o.typeFROM SDSSPhotoPrimary
o, TWOMASSPhotoPrimary t, USNOBPhotoPrimary
pWHERE XMATCH(o, t, !p) lt 3.5
AND Region('CIRCLE J2000 182.5 -0.89 8')
AND o.type 3
6How to get to large xmatch
- 1. Data partitioning
- Parallel data access for individual SkyNodes
- Declination zone-based spatial partitioning
- 2. Asynchronous workflow
- Handle long xmatch jobs in batch mode
- Browser clients cannot render large outputs
- Send output to VO-store for user to pick up
- 3. Fast data transport
- Speed up data exchange between SkyNodes
- 4. Xmatch algorithm optimizations
- Pipelining, caching etc.
71. Data Partitioning with Zones
- High Speed Data Access (AISR project)
- Two-Step Process
- 1. Distribute data homogenously among servers.
- Each server has roughly the same amount of
objects. - Objects inside servers are spatially related.
- Balances the workload among servers.
- Queries redirected to the server holding the
data. - 2. (Re)Define zones inside each server
dynamically. - Zones are defined according to some search radius
to solve specific problems - Finding Galaxy Cluster,
- Gravitational Lenses, etc.
- Facilitates cross-match queries from other NVO
data nodes.
8Mapping the Sphere into Zones
- Each Zone is a declination stripe of height h.
- In principle, h can be any number
- In practice, 30 arcsec
- South-pole zone Zone 0.
- Each object belongs to one Zone
- ZoneID floor ( (dec 90) / h )
- Each server holds N contiguous Zones.
- N is determined by
- of objects each zone contains and of servers
in cluster. - Not all servers contain the same number of zones.
- Not all servers cover the same declination range.
- Straightforward mapping between queries and
servers
9Partitioning Benchmark
- MaxBCG (Annis et al. 2002)
- Find Brightest Cluster Galaxy (BCG)
- First find cluster candidates
- Then find brightest galaxy in each
- Compare file-based TAM/Grid implementation with
SQL
SQL Server vs TAM on entire SDSS DR2 (3326 deg2,
1 TB)
SQL Server 10 Servers dual 2.6 GHz For 0.5
buffer and z-steps of 0. 001 44 hrs
TAM 10 Servers 600 MHz PIII For 0.5 buffer and
z-steps of 0. 01 9326 hrs (divide by 4 to adjust
for CPU differences) 2309 hrs
50x faster
10SDSS DR3 vs 2MASS Xmatch
- CPU performance
- Highly scalable
- Near linear
- I/O performance
- Also scalable
- Near linear
11Recent Results with Zones
- SDSS DR5 vs 2MASS xmatch
- 74M (2MASS) 54M (SDSS DR5) rows
- All the matching stars in both surveys
- 4.5 hours instead of 2 days
- Most of this time (3.5 hours) needed to prepare
data - Even Michael Way is impressed!
- Preparation of data for zone xmatch
- Separate zone table or not
- How much other information in zone table
- Can really save a lot of xmatch time if done in
advance - Classic space vs time tradeoff
122. Asynchronous Workflow
- Start with CasJobs model
- Use MyDB interface and batch system
- Tied to SQL Server at the moment, but not SDSS
- Already deployed for GALEX archive at STScI, PQ
at CalTech - Xmatch queries can invoke sync or async mode
- Next make it distributed
- Single-sign on security for VO users
- Distributed asynchronous workflow
- Route intermediate results from SkyNodes to
distributed VOStore - Send final results to VOStore with link to user
13CasJobs and MyDB
- Batch Query Workbench for SDSS CAS
- Developed with SDSS and AISR support
- Queries are queued for batch execution
- Load balancing queues on multiple servers
- Limit of 2 simultaneous queries per server
- Synchronous mode short (1 minute) queue
- Asynchronous mode batch (8 hour) queue(s)
- MyDB personal database
- 1 GB (more on demand) SQL DB for each user
- Long queries write to MyDB table by default
- User can extract output (download) when ready
- Share MyDB tables with others via groups
14MyDB Features
- Tables
- Views
- Functions
- Procedures
View data or query Create Drop Download Publish Re
name Plot
15Job Management
- Asynchronous
- Separate query and output jobs
- Procrastinator picks up jobs
- BatchAdmin DB keeps track of jobs, servers,
queues and user privileges - Job History page allows user to monitor, cancel,
and resubmit jobs
163. Mega-streaming data transport
- UDT network transfer protocol
- UDP-based Data Transfer (packet switching)
- Developed by UIC/NCDM (Grossman et al.)
- Improves on TCP, achieves much better throughput
- Can wire SDSS DR5 (2 TB) 10-20x faster
- DR5 to Asia in few hours using UDT and SECTOR
- http//sdss.ncdm.uic.edu
- Preliminary tests with SkyNodes promising
- Geographically distributed SkyNodes would really
benefit from fast data transport - Currently large xmatch only feasible for local
SkyNodes - SOAP protocol from NCDM
- Enhanced SOAP clients and servers compatible with
UDT - Teraflows using high-performance Web services
174. XMatch Algorithm
- Pipelining
- Send partial results upstream in chunks
- Stream aggregation (similar to caching)
- Take advantage of replication of data in large
xmatches - Combine streams with common data
- Optimize fuzzy join
18Conclusions
- Large scale on-demand cross-match is critical for
NVO success and widespread acceptance - Even more important for upcoming mega-surveys
like PAN-STARRS, LSST - Cant be done currently with Open SkyQuery
- Need data partitioning, asynchronous workflow and
fast data transport to get us to the promised
land - Techniques applicable to other domains