Large Scale OnDemand CrossMatch - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Large Scale OnDemand CrossMatch

Description:

Holy grail of multi-wavelength astronomy. Compare data from multiple NASA & ground ... Large Synoptic Survey Telescope (lsst.org) One SDSS every 3-4 nights! ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 19

Provided by: tha62

Category:

more less

Transcript and Presenter's Notes

Title: Large Scale OnDemand CrossMatch

1

Large Scale On-Demand Cross-Match
with Open SkyQuery

2
Need for Large Xmatch

Science case for large xmatch
Holy grail of multi-wavelength astronomy
Compare data from multiple NASA ground-based
archives
Federation/xmatch fundamental to NVO mission
Why on-demand?
Large fraction of data outside of data centers
This will get larger in the future
Time for data to make it to data center already
prohibitive
Iterative xmatch refinement
Re-try until you get it right
Specify additional constraints from source
catalogs
Often use uploaded/private datasets

3
Examples

SDSS DR5 vs 2MASS
2.5 TB (200M obj) 200 GB (400M obj)
Full or large xmatch takes days at the moment
Typically it takes 6mths-1yr for data to be
published, even longer to distribute to mirrors
If you think SDSS is bad, wait till LSST gets
here (2013)!
Large Synoptic Survey Telescope (lsst.org)
One SDSS every 3-4 nights!
Petabytes of data, all public
Fast xmatch needed both in pipeline and science
use cases

4
Open SkyQuery
Open SkyQuery Architecture

Federated query/cross-match for Virtual
Observatory (AISR project)
Distributed Web services architecture
Portal routes queries to one or more SkyNodes

Perf query gets counts from each node Portal
prepares ExecPlan with nodes in ascending order
of counts Probabilistic (fuzzy) join at each node
with results from previous node Dropouts nodes
that have non-matches
5
Open SkyQuery

ADQL Query
Basic SQL
REGION
XMATCH
Limited XMatch
5k row limit for each SkyNode
Avoids long xmatches
Large-Scale xmatch is not possible today!

SELECT o.objId, o.ra,     o.dec, t.ra, t.dec,
    t.objId, o.typeFROM     SDSSPhotoPrimary
o, TWOMASSPhotoPrimary t,     USNOBPhotoPrimary
pWHERE XMATCH(o, t, !p) lt 3.5
AND    Region('CIRCLE J2000 182.5 -0.89 8')
AND    o.type 3
6
How to get to large xmatch

1. Data partitioning
Parallel data access for individual SkyNodes
Declination zone-based spatial partitioning
2. Asynchronous workflow
Handle long xmatch jobs in batch mode
Browser clients cannot render large outputs
Send output to VO-store for user to pick up
3. Fast data transport
Speed up data exchange between SkyNodes
4. Xmatch algorithm optimizations
Pipelining, caching etc.

7
1. Data Partitioning with Zones

High Speed Data Access (AISR project)
Two-Step Process
1. Distribute data homogenously among servers.
Each server has roughly the same amount of
objects.
Objects inside servers are spatially related.
Balances the workload among servers.
Queries redirected to the server holding the
data.
2. (Re)Define zones inside each server
dynamically.
Zones are defined according to some search radius
to solve specific problems
Finding Galaxy Cluster,
Gravitational Lenses, etc.
Facilitates cross-match queries from other NVO
data nodes.

8
Mapping the Sphere into Zones

Each Zone is a declination stripe of height h.
In principle, h can be any number
In practice, 30 arcsec
South-pole zone Zone 0.
Each object belongs to one Zone
ZoneID floor ( (dec 90) / h )
Each server holds N contiguous Zones.
N is determined by
of objects each zone contains and of servers
in cluster.
Not all servers contain the same number of zones.
Not all servers cover the same declination range.
Straightforward mapping between queries and
servers

9
Partitioning Benchmark

MaxBCG (Annis et al. 2002)
Find Brightest Cluster Galaxy (BCG)
First find cluster candidates
Then find brightest galaxy in each
Compare file-based TAM/Grid implementation with
SQL

SQL Server vs TAM on entire SDSS DR2 (3326 deg2,
1 TB)
SQL Server 10 Servers dual 2.6 GHz For 0.5
buffer and z-steps of 0. 001 44 hrs
TAM 10 Servers 600 MHz PIII For 0.5 buffer and
z-steps of 0. 01 9326 hrs (divide by 4 to adjust
for CPU differences) 2309 hrs
50x faster
10
SDSS DR3 vs 2MASS Xmatch

CPU performance
Highly scalable
Near linear

I/O performance
Also scalable
Near linear

11
Recent Results with Zones

SDSS DR5 vs 2MASS xmatch
74M (2MASS) 54M (SDSS DR5) rows
All the matching stars in both surveys
4.5 hours instead of 2 days
Most of this time (3.5 hours) needed to prepare
data
Even Michael Way is impressed!
Preparation of data for zone xmatch
Separate zone table or not
How much other information in zone table
Can really save a lot of xmatch time if done in
advance
Classic space vs time tradeoff

12
2. Asynchronous Workflow

Start with CasJobs model
Use MyDB interface and batch system
Tied to SQL Server at the moment, but not SDSS
Already deployed for GALEX archive at STScI, PQ
at CalTech
Xmatch queries can invoke sync or async mode
Next make it distributed
Single-sign on security for VO users
Distributed asynchronous workflow
Route intermediate results from SkyNodes to
distributed VOStore
Send final results to VOStore with link to user

13
CasJobs and MyDB

Batch Query Workbench for SDSS CAS
Developed with SDSS and AISR support
Queries are queued for batch execution
Load balancing queues on multiple servers
Limit of 2 simultaneous queries per server
Synchronous mode short (1 minute) queue
Asynchronous mode batch (8 hour) queue(s)
MyDB personal database
1 GB (more on demand) SQL DB for each user
Long queries write to MyDB table by default
User can extract output (download) when ready
Share MyDB tables with others via groups

14
MyDB Features

Tables
Views
Functions
Procedures

View data or query Create Drop Download Publish Re
name Plot
15
Job Management

Asynchronous
Separate query and output jobs
Procrastinator picks up jobs
BatchAdmin DB keeps track of jobs, servers,
queues and user privileges
Job History page allows user to monitor, cancel,
and resubmit jobs

16
3. Mega-streaming data transport

UDT network transfer protocol
UDP-based Data Transfer (packet switching)
Developed by UIC/NCDM (Grossman et al.)
Improves on TCP, achieves much better throughput
Can wire SDSS DR5 (2 TB) 10-20x faster
DR5 to Asia in few hours using UDT and SECTOR
http//sdss.ncdm.uic.edu
Preliminary tests with SkyNodes promising
Geographically distributed SkyNodes would really
benefit from fast data transport
Currently large xmatch only feasible for local
SkyNodes
SOAP protocol from NCDM
Enhanced SOAP clients and servers compatible with
UDT
Teraflows using high-performance Web services

17
4. XMatch Algorithm

Pipelining
Send partial results upstream in chunks
Stream aggregation (similar to caching)
Take advantage of replication of data in large
xmatches
Combine streams with common data
Optimize fuzzy join

18
Conclusions

Large scale on-demand cross-match is critical for
NVO success and widespread acceptance
Even more important for upcoming mega-surveys
like PAN-STARRS, LSST
Cant be done currently with Open SkyQuery
Need data partitioning, asynchronous workflow and
fast data transport to get us to the promised
land
Techniques applicable to other domains

Write a Comment

User Comments (0)