PS1 PSPS Object Data Manager Design - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

PS1 PSPS Object Data Manager Design

Description:

ODM Prototype [MN] Hardware/Scalability [JV] How Design ... Naming convention. Uncompress batch. Read batch. Verify Batch. Batch. Verification. Verify Manifest ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 54

Provided by: juliea2

Category:

more less

Transcript and Presenter's Notes

Title: PS1 PSPS Object Data Manager Design

1
PS1 PSPSObject Data Manager Design

PSPS Critical Design Review
November 5-6, 2007
IfA

2
Outline

ODM Overview
Critical Requirements Driving Design
Work Completed
Detailed Design
Spatial Querying AS
ODM Prototype MN
Hardware/Scalability JV
How Design Meets Requirements
WBS and Schedule
Issues/Risks
AS Alex, MN Maria, JV Jan

3
ODM Overview

The Object Data Manager will
Provide a scalable data archive for the
Pan-STARRS data products
Provide query access to the data for Pan-STARRS
users
Provide detailed usage tracking and logging

4
ODM Driving Requirements

Total size 100 TB,
1.5 x 1011 P2 detections
8.3x1010 P2 cumulative-sky (stack) detections
5.5x109 celestial objects
Nominal daily rate (divide by 3.5x365)
P2 detections 120 Million/day
Stack detections 65 Million/day
Objects 4.3 Million/day
Cross-Match requirement 120 Million / 12 hrs
2800 / s
DB size requirement
25 TB / yr
100 TB by of PS1 (3.5 yrs)

5
Work completed so far

Built a prototype
Scoped and built prototype hardware
Generated simulated data
300M SDSS DR5 objects, 1.5B Galactic plane
objects
Initial Load done Created 15 TB DB of simulated
data
Largest astronomical DB in existence today
Partitioned the data correctly using Zones
algorithm
Able to run simple queries on distributed DB
Demonstrated critical steps of incremental
loading
It is fast enough
Cross-match gt 60k detections/sec
Required rate is 3k/sec

6
Detailed Design

Reuse SDSS software as much as possible
Data Transformation Layer (DX) Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
Schema and Test Queries
Database Management System
Scalable Data Architecture
Hardware
Query Manager (QM CasJobs for prototype)

7
High-Level Organization
8
Detailed Design

Reuse SDSS software as much as possible
Data Transformation Layer (DX) Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
Schema and Test Queries
Database Management System
Scalable Data Architecture
Hardware
Query Manager (QM CasJobs for prototype)

9
Data Transformation Layer (DX)

Based on SDSS sqlFits2CSV package
LINUX/C application
FITS reader driven off header files
Convert IPP FITS files to
ASCII CSV format for ingest (initially)
SQL Server native binary later (3x faster)
Follow the batch and ingest verification
procedure described in ICD
4-step batch verification
Notification and handling of broken publication
cycle
Deposit CSV or Binary input files in directory
structure
Create ready file in each batch directory
Stage input data on LINUX side as it comes in
from IPP

10
DX Subtasks
DX
Initialization Job FITS schema FITS reader CSV
Converter CSV Writer
Batch Ingest Interface with IPP Naming
convention Uncompress batch Read batch Verify
Batch
Batch Verification Verify Manifest Verify FITS
Integrity Verify FITS Content Verify FITS
Data Handle Broken Cycle
Batch Conversion CSV Converter Binary
Converter batch_ready Interface with DLP
11
DX-DLP Interface

Directory structure on staging FS (LINUX)
Separate directory for each JobID_BatchID
Contains a batch_ready manifest file
Name, rows and destination table of each file
Contains one file per destination table in ODM
Objects, Detections, other tables
Creation of batch_ready file is signal to
loader to ingest the batch
Batch size and frequency of ingest cycle TBD

12
Detailed Design

Reuse SDSS software as much as possible
Data Transformation Layer (DX) Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
Schema and Test Queries
Database Management System
Scalable Data Architecture
Hardware
Query Manager (QM CasJobs for prototype)

13
Data Loading Pipeline (DLP)

sqlLoader SDSS data loading pipeline
Pseudo-automated workflow system
Loads, validates and publishes data
From CSV to SQL tables
Maintains a log of every step of loading
Managed from Load Monitor Web interface
Has been used to load every SDSS data release
EDR, DR1-6, 15 TB of data altogether
Most of it (since DR2) loaded incrementally
Kept many data errors from getting into database
Duplicate ObjIDs (symptom of other problems)
Data corruption (CSV format invaluable in
catching this)

14
sqlLoader Design

Existing functionality
Shown for SDSS version
Workflow, distributed loading, Load Monitor
New functionality
Schema changes
Workflow changes
Incremental loading
Cross-match and partitioning

15
sqlLoader Workflow

Distributed design achieved with linked servers
and SQL Server Agent
LOAD stage can be done in parallel by loading
into temporary task databases
PUBLISH stage writes from task DBs to final DB
FINISH stage creates indices and auxiliary
(derived) tables

Loading pipeline is a system of VB and SQL
scripts, stored procedures and functions

16
Load Monitor Tasks Page
17
Load Monitor Active Tasks
18
Load Monitor Statistics Page
19
Load Monitor New Task(s)
20
Data Validation

Tests for data integrity and consistency
Scrubs data and finds problems in upstream
pipelines
Most of the validation can be performed within
the individual task DB (in parallel)

21
Distributed Loading
Samba-mounted CSV/Binary Files
Load Monitor
Master
LoadAdmin
Slave
Slave
LoadSupport
LoadSupport
LoadSupport
View of Master Schema
Task DB
Task DB
Task DB
Publish

Finish
22
Schema Changes

Schema in task and publish DBs is driven off a
list of schema DDL files to execute (xschema.txt)
Requires replacing DDL files in schema/sql
directory and updating xschema.txt with their
names
PS1 schema DDL files have already been built
Index definitions have also been created
Metadata tables will be automatically generated
using metadata scripts already in the loader

23
Workflow Changes
LOAD

Cross-Match and Partition steps will be added to
the workflow
Cross-match will match detections to objects
Partition will horizontally partition data, move
it to slice servers, and build DPVs on main

Export
Check CSVs
Create Task DBs
Build SQL Schema
Validate
XMatch
PUBLISH
Partition
24
Matching Detections with Objects

Algorithm described fully in prototype section
Stored procedures to cross-match detections will
be part of the LOAD stage in loader pipeline
Vertical partition of Objects table kept on load
server for matching with detections
Zones cross-match algorithm used to do 1 and 2
matches
Detections with no matches saved in Orphans table

25
XMatch and Partition Data Flow
26
Detailed Design

Reuse SDSS software as much as possible
Data Transformation Layer (DX) Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
Schema and Test Queries
Database Management System
Scalable Data Architecture
Hardware
Query Manager (QM CasJobs for prototype)

27
Data Storage Schema
28
PS1 Table Sizes Spreadsheet
29
PS1 Table Sizes - All Servers

Table Year 1 Year 2 Year 3 Year 3.5
Objects 4.63 4.63 4.61 4.59
StackPsfFits 5.08 10.16 15.20 17.76
StackToObj 1.84 3.68 5.56 6.46
StackModelFits 1.16 2.32 3.40 3.96
P2PsfFits 7.88 15.76 23.60 27.60
P2ToObj 2.65 5.31 8.00 9.35
Other Tables 3.41 6.94 10.52 12.67
Indexes 20 5.33 9.76 14.18 16.48
Total 31.98 58.56 85.07 98.87
Sizes are in TB
30
Data Storage Test Queries

Drawn from several sources
Initial set of SDSS 20 queries
SDSS SkyServer Sample Queries
Queries from PS scientists (Monet, Howell,
Kaiser, Heasley)
Two objectives
Find potential holes/issues in schema
Serve as test queries
Test DBMS iintegrity
Test DBMS performance
Loaded into CasJobs (Query Manager) as sample
queries for prototype

31
Data Storage DBMS

Microsoft SQL Server 2005
Relational DBMS with excellent query optimizer
Plus
Spherical/HTM (C library SQL glue)
Spatial index (Hierarchical Triangular Mesh)
Zones (SQL library)
Alternate spatial decomposition with dec zones
Many stored procedures and functions
From coordinate conversions to neighbor search
functions
Self-extracting documentation (metadata) and
diagnostics

32
Documentation and Diagnostics
33
Data Storage Scalable Architecture

Monolithic database design (a la SDSS) will not
do it
SQL Server does not have cluster implementation
Do it by hand
Partitions vs Slices
Partitions are file-groups on the same server
Parallelize disk accesses on the same machine
Slices are data partitions on separate servers
We use both!
Additional slices can be added for scale-out
For PS1, use SQL Server Distributed Partition
Views (DPVs)

34
Distributed Partitioned Views

Difference between DPVs and file-group
partitioning
FG on same database
DPVs on separate DBs
FGs are for scale-up
DPVs are for scale-out

Main server has a view of a partitioned table
that includes remote partitions (we call them
slices to distinguish them from FG partitions)
Accomplished with SQL Servers linked server
technology
NOT truly parallel, though

35
Scalable Data Architecture

Shared-nothing architecture
Detections split across cluster

Objects replicated on Head and Slice DBs
DPVs of Detections tables on the Headnode DB
Queries on Objects stay on head node

Queries on detections use only local data on
slices

36
Hardware - Prototype
Storage
S3 PS04
4
10A 10 x 13 x 750 GB 3B 3 x 12 x 500 GB
2A
Server Naming Convention
Function
S2 PS03
4
LX Linux L Load server S/Head DB server M
MyDB server W Web server
PS0x 4-core PS1x 8-core
2A
L2/M PS05
S1 PS12
8
4
A
2A
Head PS11
8
W PS02
4
LX PS01
L1 PS13
8
4
B
2B
2A
A
Web
Staging
Loading
DB
MyDB
Function
9 TB
39 TB
0 TB
10 TB
Total space
RAID10
RAID5
RAID10
RAID10
RAID config
12D/4W
14D/3.5W
Disk/rack config
37
Hardware PS1

Ping-pong configuration to maintain high
availability and query performance

2 copies of each slice and of main (head) node
database on fast hardware (hot spares)

3rd spare copy on slow hardware (can be just
disk)
Updates/ingest on offline copy then switch copies
when ingest and replication finished

Synchronize second copy while first copy is
online
Both copies live when no ingest

3x basic config. for PS1

38
Detailed Design

Reuse SDSS software as much as possible
Data Transformation Layer (DX) Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
Schema and Test Queries
Database Management System
Scalable Data Architecture
Hardware
Query Manager (QM CasJobs for prototype)

39
Query Manager

Based on SDSS CasJobs
Configure to work with distributed database, DPVs
Direct links (contexts) to slices can be added
later if necessary
Segregates quick queries from long ones
Saves query results server-side in MyDB
Gives users a powerful query workbench
Can be scaled out to meet any query load
PS1 Sample Queries available to users
PS1 Prototype QM demo

40
ODM Prototype Components

Data Loading Pipeline
Data Storage
CasJobs
Query Manager (QM)
Web Based Interface (WBI)
Testing

41
Spatial Queries (Alex)
42
Prototype (Maria)
43
Hardware/Scalability (Jan)
44
How Design Meets Requirements

Cross-matching detections with objects
Zone cross-match part of loading pipeline
Already exceeded requirement with prototype
Query performance
Ping-pong configuration for query during ingest
Spatial indexing and distributed queries
Query manager can be scaled out as necessary
Scalability
Shared-nothing architecture
Scale out as needed
Beyond PS1 we will need truly parallel query plans

45
WBS/Development Tasks
2 PM 3 PM 1 PM 3 PM 3 PM 1 PM 2 PM 2 PM 2
PM 2 PM 4 PM 4 PM
Refine Prototype/Schema
Staging/Transformation
Initial Load
4 PM
Load/Resolve Detections

Workflow Systems
Logging
Data Scrubbing
SSIS (?) C

Resolve/Synchronize Objects
Create Snapshot
Replication Module
Query Processing
2 PM

QM/Logging

Hardware
Redistribute Data
Total Effort 35 PM Delivery 9/2008
Documentation
Testing
46
Personnel Available

2 new hires (SW Engineers) 100
Maria 80
Ani 20
Jan 10
Alainna 15
Nolan Li 25
Sam Carliles 25
George Fekete 5
Laszlo Dobos 50 (for 6 months)

47
Issues/Risks

Versioning
Do we need to preserve snapshots of monthly
versions?
How will users reproduce queries on subsequent
versions?
Is it ok that a new version of the sky replaces
the previous one every month?
Backup/recovery
Will we need 3 local copies rather than 2 for
safety
Is restoring from offsite copy feasible?
Handoff to IfA beyond scope of WBS shown
This will involve several PMs

48
Mahalo!
49
Query Manager
MyDB table that query results go into
Check query syntax
Name that this query job is given
Context that query is executed in
Get graphical query plan
Run query in quick (1 minute) mode
Query buffer
Load one of the sample queries into query buffer
Submit query to long (8-hour) queue
50
Query Manager
51
Query Manager
MyDB context is the default, but other contexts
can be selected
User can browse DB Views, Tables, Functions and
Procedures
The space used and total space available
Multiple tables can be selected and dropped at
once
Table list can be sorted by name, size, type.
52
Query Manager
The query that created this table
53
Query Manager
Context to run search on
Search radius
Table to hold results

Write a Comment

User Comments (0)