Title: JSOC Pipeline Processing Overview
1JSOC Pipeline Processing Overview
- Rasmus Munk Larsen, Stanford University
- rmunk_at_quake.stanford.edu
- 650-725-5485
2Overview
- Hardware overview
- JSOC data model
- Pipeline infrastructure subsystems
- Pipeline modules
3JSOC Connectivity
Stanford
DDS
NASA AMES
LMSAL
1 Gb Private line
MOC
White Net
4JSOC Hardware configuration
5JSOC data model Motivation
- Evolved from MDI dataset concept to
- Enable record level access to meta-data for
queries and browsing - Accommodate more complex data models required by
higher-level processing - Main design features
- Lesson learned from MDI Separate meta-data
(keywords) and image data - No need to re-write large image files when only
keywords change (lev1.8 problem) - No out-of-date keyword values in FITS headers -
can bind to most recent values on export - Data access through query-like dataset names
- All access in terms of (sets of) data records,
which are the atomic units of a data series - A dataset name is a query specifying a set of
data records - jsochmi_lev1_V3000-3020 (21 records from
with known epoch and cadence) - jsochmi_lev0_fgt_obs2008-11-07_020000/8hcam
doppler (8 hours worth of filtergrams) - Storage and tape management must be transparent
to user - Chunking of data records into storage units for
efficient tape/disk usage done internally - Completely separate storage unit and meta-data
databases more modular design - MDI data and modules will be migrated to use new
storage service - Store meta-data (keywords) in relational database
- Can use power of relational database to search
and index data records - Easy and fast to create time series of any
keyword value (for trending etc.)
6JSOC data model
- JSOC Data will be organized according to a data
model with the following classes - Series A sequence of like data records,
typically data products produced by a particular
analysis - Attributes include Name, Owner , primary search
index, Storage unit size, Storage group - Record Single measurement/image/observation with
associated meta-data - Attributes include ID, Storage Unit ID, Storage
Unit Slot - Contain Keywords, Links, Data segments
- Records are the main data objects seen by module
programmers - Keyword Named meta-data value, stored in
database - Attributes include Name, Type, Value, Physical
unit - Link Named pointer from one record to another,
stored in database - Attributes include Name, Target series, target
record id or primary index value - Used to capture data dependencies and processing
history - Data Segment Named data container representing
the primary data on disk belonging to a record - Attributes include Name, filename, datatype,
naxis, axis0naxis-1, storage format - Can be either structure-less (any file) or
n-dimensional array stored in tiled, compressed
file format - Storage Unit A chunk of data records from the
same series stored in a single directory tree - Attributes include Online location, offline
location, tape group, retention time - Managed by the Storage Unit Manager in a manner
transparent to most module programmers
7JSOC data model
JSOC Data Series
Data records for series hmi_lev1_fd_V
Single hmi_lev1_fd_V data record
Keywords RECORDNUM 12345 Unique serial
number SERIESNUM 5531704 Slots since
epoch. T_OBS 2009.01.05_232240_TAI DATAMIN
-2.537730543544E03 DATAMAX
1.935749511719E03 ... P_ANGLE
LINKORBIT,KEYWORDSOLAR_P
hmi_lev0_cam1_fg
hmi_lev1_fd_V12345
aia_lev0_cont1700
hmi_lev1_fd_V12346
hmi_lev1_fd_M
hmi_lev1_fd_V12347
hmi_lev1_fd_V
Links ORBIT hmi_lev0_orbit, SERIESNUM
221268160 CALTABLE hmi_lev0_dopcal, RECORDNUM
7 L1 hmi_lev0_cam1_fg, RECORDNUM 42345232 R1
hmi_lev0_cam1_fg, RECORDNUM 42345233
hmi_lev1_fd_V12348
aia_lev0_FE171
hmi_lev1_fd_V12349
hmi_lev1_fd_V12350
hmi_lev1_fd_V12351
hmi_lev1_fd_V12352
Data Segments V_DOPPLER
hmi_lev1_fd_V12353
Storage Unit Directory
8JSOC subsystems
- SUMS Storage Unit Management System
- Maintains database of storage units and their
location on disk and tape - Manages JSOC storage subsystems Disk array,
Robotic tape library - Scrubs old data from disk cache to maintain
enough free workspace - Loads and unloads tape to/from tape drives and
robotic library - Allocates disk storage needed by pipeline
processes through DRMS - Stages storage units requested by pipeline
processes through DRMS - Design features
- RPC client-server protocol
- Oracle DBMS (to be migrated to PostgreSQL)
- DRMS Data Record Management System
- Maintains database holding
- Master tables with definitions of all JSOC series
and their keyword, link and data segment
definitions - One table per series containing record meta-data,
e.g. keyword values - Provides distributed transaction processing
framework for pipeline - Provides full meta-data searching through JSOC
query language - Multi-column indexed searches on primary index
values allows for fast and simple querying for
common cases - Inclusion of free-form SQL clauses allows
advanced querying
9Pipeline software/hardware architecture
JSOC Science Libraries
Utility Libraries
Pipeline program module
File I/O
OpenRecords CloseRecords
GetKeyword, SetKeyword GetLink, SetLink
OpenDataSegment CloseDataSegment
DRMS Library
Data Segment I/O
JSOC Disks
JSOC Disks
JSOC Disks
JSOC Disks
Record Cache (KeywordsLinksData paths)
DRMS socket protocol
Data Record Management Service (DRMS)
Data Record Management Service (DRMS)
Storage unit transfer
Storage Unit Management Service (SUMS)
Data Record Management Service (DRMS)
AllocUnit GetUnit PutUnit
Storage unit transfer
SQL queries
Robotic Tape Archive
Database Server
SQL queries
SQL queries
Record Catalogs
Record Catalogs
Series Tables
Record Tables
Storage Unit Tables
10JSOC Pipeline Workflow
Pipeline processing plan
Pipeline Operator
DRMS session
Module3
Processing script, mapfile List of pipeline
modules with needed datasets for input, output
PUI Pipeline User Interface (scheduler)
Module2
Processing History Log
Module1
DRMS Data Record Management service
DRMS Data Record Management service
SUMS Storage Unit Management System
11Analysis modules co-I contributions and
collaboration
- Contributions from co-I teams
- Software for intermediate and high level analysis
modules - Data series definitions
- Keywords, links, data segments, size of storage
units, primary index keywords etc. - Documentation
- Test data and intended results for verification
- Time
- Explain algorithms and implementation
- Help with verification
- Collaborate on improvements if required (e.g.
performance or maintainability) - Contributions from HMI team
- Pipeline execution environment
- Software hardware resources (Development
environment, libraries, tools) - Time
- Help with defining data series
- Help with porting code to JSOC API
- If needed, collaborate on algorithmic
improvements, tuning for JSOC hardware,
parallelization - Verification
12HMI module status and MDI heritage
Intermediate and high level data products
Primary observables
Internal rotation
Heliographic Doppler velocity maps
Spherical Harmonic Time series
Mode frequencies And splitting
Internal sound speed
Full-disk velocity, sound speed, Maps (0-30Mm)
Local wave frequency shifts
Ring diagrams
Doppler Velocity
Carrington synoptic v and cs maps (0-30Mm)
Time-distance Cross-covariance function
Tracked Tiles Of Dopplergrams
Wave travel times
High-resolution v and cs maps (0-30Mm)
Egression and Ingression maps
Wave phase shift maps
Deep-focus v and cs maps (0-200Mm)
Far-side activity index
Stokes I,V
Line-of-sight Magnetograms
Line-of-Sight Magnetic Field Maps
Stokes I,Q,U,V
Full-disk 10-min Averaged maps
Vector Magnetograms Fast algorithm
Vector Magnetic Field Maps
Vector Magnetograms Inversion algorithm
Coronal magnetic Field Extrapolations
Tracked Tiles
Tracked full-disk 1-hour averaged Continuum maps
Coronal and Solar wind models
Continuum Brightness
Solar limb parameters
Brightness feature maps
Brightness Images
13Example Global Seismology Pipeline
14Questions to be discussed at working sessions
- List of standard science data products
- Which data products, including intermediate ones,
should be produced by JSOC to accomplish the
science goals of the mission? - What cadence, resolution, coverage etc. should
each data product have? - Which data products should be computed on the fly
and which should be archived? - What are the challenges to be overcome for each
analysis technique? - Detailing each branch of the processing pipeline
- What are the detailed steps in each branch?
- Can some of the computational steps be
encapsulated in general tools that can be shared
among different branches (example tracking)? - What are the CPU and I/O resource requirements of
computational steps? - Contributed analysis modules
- What groups or individuals will contribute code,
and incorporate it in the pipeline? - If multiple candidate techniques and/or
implementations exist, which should be included
in the pipeline? - What is the test plan and what data is needed to
verify the approach?
15JSOC Series Definition
16Global Database Tables
17Database tables for example series hmi_fd_v
- Tables specific for each series contain per
record values of - Keywords
- Record numbers of records pointed to by links
- DSIndex an index identifying the SUMS storage
unit containing the data segments of a record - Series sequence counter used for generating
unique record numbers
18Pipeline batch processing
- A pipeline batch is encapsulated in a single
database transaction - If no module fails all data records are commited
and become visible to other clients of the JSOC
catalog at the end of the session - If failure occurs all data records are deleted
and the database rolled back - It is possible to commit data produced up to
intermediate checkpoints during sessions
Pipeline batch atomic transaction
Module 2.1
Module N
Commit Data Deregister
Module 1
Register session
DRMS API
DRMS API
DRMS API
DRMS API
DRMS API
Module 2.2
DRMS API
Input data records
Output data records
DRMS Service Session Master
Record Series Database
SUMS
19Example of module code
- A module doing a (naïve) Doppler velocity
calculation could look as shown below - Usage
- doppler DRMSSESSIONhelios33546
"2009.09.01_160000_TAI" "2009.09.01_170000_TAI
"
extern CmdParams_t cmdparams / command line
args / extern DRMS_Env_t drms_env / DRMS
environment / int module_main(void)
DRMS_RecordSet_t filtergrams, dopplergram
int first_frame, status char
query1024,start,end start
cmdparms_getarg(cmdparams, 1) end
cmdparms_getarg(cmdparams, 2) sprintf(query,
"hmi_lev0_fgT_Obss-s", start, end)
filtergrams drms_open_records(drms_env, query,
"RD", status) if (filtergrams-gtnum_recs0)
printf("Sorry, no filtergrams found for
that time interval.\n") return -1
first_frame 0 / Start looping over record
set. / for () first_frame
find_next_framelist(first_frame, filtergrams)
if (first_frame -1) / No more complete
framelists. Exit. / break dopplergram
drms_create_records(drms_env, "hmi_fd_v",
1, status) if (status)
return -1 compute_dopplergram(first_frame,
filtergrams, dopplergram) drms_close_records(
drms_env, dopplergram) return 0
20Example continued
int compute_dopplergram(int first_frame,
DRMS_RecordSet_t filtergrams,
DRMS_RecordSet_t
dopplergram) int n_rows, n_cols, tuning
DRMS_Segment_t fg10, dop short
fg_data10 char pol double dop_data
/ Get pointers for doppler data array. / dop
drms_open_datasegment(dopplergram-gtrecords0,
"v_doppler", "RDWR") n_cols
drms_getaxis(dop, 0) n_rows
drms_getaxis(dop, 1) dop_data (double
)drms_getdata(dop, 0, 0) / Get pointers for
filtergram data arrays. / for (ifirst_frame
iltfirst_frame10 i) fgi
drms_open_datasegment(filtergrams-gtrecordsi,
"intensity", "RD") fg_datai (short
)drms_getdata(fg, 0, 0) pol
drms_getkey_string(filtergrams-gtrecordsi,
"Polarization") tuning drms_getkey_int(filt
ergrams-gtrecordsi, "Tuning") printf(Using
filtergram (s, d)\n, pol, tuning) /
Do the actual Doppler computation./
calc_v(fg_data, dop_data)