Large Scale OnDemand CrossMatch - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Large Scale OnDemand CrossMatch

Description:

Holy grail of multi-wavelength astronomy. Compare data from multiple NASA & ground ... Large Synoptic Survey Telescope (lsst.org) One SDSS every 3-4 nights! ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 19
Provided by: tha62
Category:

less

Transcript and Presenter's Notes

Title: Large Scale OnDemand CrossMatch


1
  • Large Scale On-Demand Cross-Match
  • with Open SkyQuery

2
Need for Large Xmatch
  • Science case for large xmatch
  • Holy grail of multi-wavelength astronomy
  • Compare data from multiple NASA ground-based
    archives
  • Federation/xmatch fundamental to NVO mission
  • Why on-demand?
  • Large fraction of data outside of data centers
  • This will get larger in the future
  • Time for data to make it to data center already
    prohibitive
  • Iterative xmatch refinement
  • Re-try until you get it right
  • Specify additional constraints from source
    catalogs
  • Often use uploaded/private datasets

3
Examples
  • SDSS DR5 vs 2MASS
  • 2.5 TB (200M obj) 200 GB (400M obj)
  • Full or large xmatch takes days at the moment
  • Typically it takes 6mths-1yr for data to be
    published, even longer to distribute to mirrors
  • If you think SDSS is bad, wait till LSST gets
    here (2013)!
  • Large Synoptic Survey Telescope (lsst.org)
  • One SDSS every 3-4 nights!
  • Petabytes of data, all public
  • Fast xmatch needed both in pipeline and science
    use cases

4
Open SkyQuery
Open SkyQuery Architecture
  • Federated query/cross-match for Virtual
    Observatory (AISR project)
  • Distributed Web services architecture
  • Portal routes queries to one or more SkyNodes

Perf query gets counts from each node Portal
prepares ExecPlan with nodes in ascending order
of counts Probabilistic (fuzzy) join at each node
with results from previous node Dropouts nodes
that have non-matches
5
Open SkyQuery
  • ADQL Query
  • Basic SQL
  • REGION
  • XMATCH
  • Limited XMatch
  • 5k row limit for each SkyNode
  • Avoids long xmatches
  • Large-Scale xmatch is not possible today!

SELECT o.objId, o.ra,     o.dec, t.ra, t.dec,
    t.objId, o.typeFROM     SDSSPhotoPrimary
o, TWOMASSPhotoPrimary t,     USNOBPhotoPrimary
pWHERE XMATCH(o, t, !p) lt 3.5
AND    Region('CIRCLE J2000 182.5 -0.89 8')
AND    o.type 3
6
How to get to large xmatch
  • 1. Data partitioning
  • Parallel data access for individual SkyNodes
  • Declination zone-based spatial partitioning
  • 2. Asynchronous workflow
  • Handle long xmatch jobs in batch mode
  • Browser clients cannot render large outputs
  • Send output to VO-store for user to pick up
  • 3. Fast data transport
  • Speed up data exchange between SkyNodes
  • 4. Xmatch algorithm optimizations
  • Pipelining, caching etc.

7
1. Data Partitioning with Zones
  • High Speed Data Access (AISR project)
  • Two-Step Process
  • 1. Distribute data homogenously among servers.
  • Each server has roughly the same amount of
    objects.
  • Objects inside servers are spatially related.
  • Balances the workload among servers.
  • Queries redirected to the server holding the
    data.
  • 2. (Re)Define zones inside each server
    dynamically.
  • Zones are defined according to some search radius
    to solve specific problems
  • Finding Galaxy Cluster,
  • Gravitational Lenses, etc.
  • Facilitates cross-match queries from other NVO
    data nodes.

8
Mapping the Sphere into Zones
  • Each Zone is a declination stripe of height h.
  • In principle, h can be any number
  • In practice, 30 arcsec
  • South-pole zone Zone 0.
  • Each object belongs to one Zone
  • ZoneID floor ( (dec 90) / h )
  • Each server holds N contiguous Zones.
  • N is determined by
  • of objects each zone contains and of servers
    in cluster.
  • Not all servers contain the same number of zones.
  • Not all servers cover the same declination range.
  • Straightforward mapping between queries and
    servers

9
Partitioning Benchmark
  • MaxBCG (Annis et al. 2002)
  • Find Brightest Cluster Galaxy (BCG)
  • First find cluster candidates
  • Then find brightest galaxy in each
  • Compare file-based TAM/Grid implementation with
    SQL

SQL Server vs TAM on entire SDSS DR2 (3326 deg2,
1 TB)
SQL Server 10 Servers dual 2.6 GHz For 0.5
buffer and z-steps of 0. 001 44 hrs
TAM 10 Servers 600 MHz PIII For 0.5 buffer and
z-steps of 0. 01 9326 hrs (divide by 4 to adjust
for CPU differences) 2309 hrs
50x faster
10
SDSS DR3 vs 2MASS Xmatch
  • CPU performance
  • Highly scalable
  • Near linear
  • I/O performance
  • Also scalable
  • Near linear

11
Recent Results with Zones
  • SDSS DR5 vs 2MASS xmatch
  • 74M (2MASS) 54M (SDSS DR5) rows
  • All the matching stars in both surveys
  • 4.5 hours instead of 2 days
  • Most of this time (3.5 hours) needed to prepare
    data
  • Even Michael Way is impressed!
  • Preparation of data for zone xmatch
  • Separate zone table or not
  • How much other information in zone table
  • Can really save a lot of xmatch time if done in
    advance
  • Classic space vs time tradeoff

12
2. Asynchronous Workflow
  • Start with CasJobs model
  • Use MyDB interface and batch system
  • Tied to SQL Server at the moment, but not SDSS
  • Already deployed for GALEX archive at STScI, PQ
    at CalTech
  • Xmatch queries can invoke sync or async mode
  • Next make it distributed
  • Single-sign on security for VO users
  • Distributed asynchronous workflow
  • Route intermediate results from SkyNodes to
    distributed VOStore
  • Send final results to VOStore with link to user

13
CasJobs and MyDB
  • Batch Query Workbench for SDSS CAS
  • Developed with SDSS and AISR support
  • Queries are queued for batch execution
  • Load balancing queues on multiple servers
  • Limit of 2 simultaneous queries per server
  • Synchronous mode short (1 minute) queue
  • Asynchronous mode batch (8 hour) queue(s)
  • MyDB personal database
  • 1 GB (more on demand) SQL DB for each user
  • Long queries write to MyDB table by default
  • User can extract output (download) when ready
  • Share MyDB tables with others via groups

14
MyDB Features
  • Tables
  • Views
  • Functions
  • Procedures

View data or query Create Drop Download Publish Re
name Plot
15
Job Management
  • Asynchronous
  • Separate query and output jobs
  • Procrastinator picks up jobs
  • BatchAdmin DB keeps track of jobs, servers,
    queues and user privileges
  • Job History page allows user to monitor, cancel,
    and resubmit jobs

16
3. Mega-streaming data transport
  • UDT network transfer protocol
  • UDP-based Data Transfer (packet switching)
  • Developed by UIC/NCDM (Grossman et al.)
  • Improves on TCP, achieves much better throughput
  • Can wire SDSS DR5 (2 TB) 10-20x faster
  • DR5 to Asia in few hours using UDT and SECTOR
  • http//sdss.ncdm.uic.edu
  • Preliminary tests with SkyNodes promising
  • Geographically distributed SkyNodes would really
    benefit from fast data transport
  • Currently large xmatch only feasible for local
    SkyNodes
  • SOAP protocol from NCDM
  • Enhanced SOAP clients and servers compatible with
    UDT
  • Teraflows using high-performance Web services

17
4. XMatch Algorithm
  • Pipelining
  • Send partial results upstream in chunks
  • Stream aggregation (similar to caching)
  • Take advantage of replication of data in large
    xmatches
  • Combine streams with common data
  • Optimize fuzzy join

18
Conclusions
  • Large scale on-demand cross-match is critical for
    NVO success and widespread acceptance
  • Even more important for upcoming mega-surveys
    like PAN-STARRS, LSST
  • Cant be done currently with Open SkyQuery
  • Need data partitioning, asynchronous workflow and
    fast data transport to get us to the promised
    land
  • Techniques applicable to other domains
Write a Comment
User Comments (0)
About PowerShow.com