Title: Virtual Data in CMS Analysis
1Virtual Data in CMS Analysis
- A.Arbree, P.Avery, D.Bourilkov,
- R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde,
Y.Zhao - CMS GriPhyN
- CHEP03, La Jolla, California
- March 25, 2003
2Virtual Data
- Webster dictionary virtual Function
adjectiveEtymology Middle English, possessed of
certain physical virtues, from Medieval Latin
virtualis, from Latin virtus strength, virtue - Most scientific data are not simple
measurements ? produced from increasingly
complex computations (e.g. reconstructions,
calibrations, selections, simulations, fits etc.) - HEP (and other sciences) increasingly CPU/Data
intensive - Programs are significant community resources
(transformations) - So are the executions of those programs
(derivations) - Management of dataset transformations important!
- Derivation Instantiation of a potential data
product - Provenance Exact history of any existing data
product
We already do this, but manually!
3Virtual Data Motivations
Ive detected a muon calibration error and want
to know which derived data products need to be
recomputed.
Ive found some interesting data, but I need to
know exactly what corrections were applied before
I can trust it.
Data
consumed-by/ generated-by
product-of
Derivation
Transformation
execution-of
I want to search a database for 3 rare electron
events. If a program that does this analysis
exists, I wont have to write one from scratch.
I want to apply a forward jet analysis to 100M
events. If the results already exist, Ill save
weeks of computation.
4Virtual Data Motivations
- Data track-ability and result audit-ability
"Virtual Logbook - In the nature of science
- Reproducibility of results
- Tools and data sharing and collaboration (data
with recipe) - Individuals discover other scientists work and
build from it - Different Teams can work in a modular,
semi-autonomous fashion reuse previous
data/code/results or entire analysis chains - Repair and correction of data c.f. make
- Workflow management, Performance optimization
data staged-in from remote site OR re-created
locally on demand? - Transparency with respect to location and
existence
5Introducing CHIMERA The GriPhyN Virtual Data
System
- Virtual Data Language
- textual (concise, for human consumption)
- XML (uses XML schema, for component integration)
- Virtual Data Interpreter
- implemented in Java
- JAVA API and command-line toolkit
- Virtual Data Catalog tracks data provenance (acts
like a metadata repository) different back-ends
for persistency - PostGreSQL and MySQL DB
- file based (for easy testing)
6Virtual Data in CHIMERA
- A function call paradigm
- Virtual data data objects with a well defined
method of (re)production - Transformation namespaceidentifierversion
? - Abstract description of how a script/executable
is invoked - Similar to a "function declaration" in C/C
- Derivation namespaceidentifierversi
on range - Invocation of a transformation with specific
arguments - Similar to a "function call" in C/C
- Can be either past or future
- a record of how logical files were produced
- a recipe for creating logical files at some point
in the future
7Virtual Data Language
TR pythia( out a2, in a1, none param160.0 )
argument arg param argument file
a1 Build-style recipe argument file
a2 TR cmsim( out a2, in a1 )
argument files a1 argument file
a2 DV x1-gtpythia( a2_at_outfile2,
a1_at_infile1) DV x2-gtcmsim( a2_at_outfile3,
a1_at_infile2, _at_incardfile )
Make-style recipe
file1
x1
file2, cardfile
x2
file3
8Abstract and Concrete DAGs
- Abstract DAXs (Virtual Data DAG)
- abstract directed acyclic graph with
- logical names for files/executables
- (complete build-style recipe as DAX)
- Resource locations unspecified
- File names are logical
- Data destinations unspecified
- Concrete DAGs (stuff for DAGMan)
- CONDOR style DAG for grid execution
- (check RC, skip steps, make-style)
- Resource locations determined
- Physical file names specified
- Data delivered to and returned from physical
- locations
VDL
XML
VDC
XML
Abs. Plan
Logical
DAX
RC
C. Plan.
DAG
Physical
DAGMan
9Nitty-Gritty
- Transformation catalog (expects pre-built
executables) - poolname ltransformation physical
transformation environment String - local hw
/bin/echo null - local pythcvs
/workdir/lhc-h-6-cvs null - local pythlin
/workdir/lhc-h-6-link null - local pythgen
/workdir/lhc-h-6-run null - local pythtree
/workdir/h2root.sh null - local pythview
/workdir/root.sh null - local GriphynRC
/vdshome/bin/replica-catalog JAVA_HOME/vdt/jdk1.
3VDS_HOME/vdshome - local globus-url-copy
/vdt/bin/globus-url-copy
GLOBUS_LOCATION/vdtLD_LIBRARY_PATH/vdt/lib - ufl hw
/bin/echo null - ufl GriphynRC
/vdshome/bin/replica-catalog
JAVA_HOME/vdt/jdk1.3.1_04VDS_HOME/vdshome - ufl globus-url-copy
/vdt/bin/globus-url-copy
GLOBUS_LOCATION/vdtLD_LIBRARY_PATH/vdt/lib - Pool configuration
- pool universe job-manager-string
url-prefix workdir ... - ufl vanilla testulix/jm-condor-INTEL-
LINUX gsiftp//testulix/mydir /mydir - ufl standard testulix/jm-condor-INTEL-LINUX
gsiftp//testulix/mydir /mydir - ufl globus testulix/jm-condor-INTEL-LIN
UX gsiftp//testulix/mydir /mydir - ufl transfer testulix/jobmanager
gsiftp//testulix/mydir /mydir
10Data Analysis in HEP
- Decentralized, chaotic
- Flexible enough system able to accommodate large
user base, use cases that we cant foresee - Ability to build scripts/executables on the
fly, including user supplied code/parameters
(possibly linking with preinstalled libraries on
the execution sites)
11Prototypes
- First for SC2002, second for CHEP03
CVS tag
CVS
ntuples
h2root
C code
datacards
FORTRAN code
root trees
PYTHIA wrapper
root wrapper
compile, link
executable
libraries version N
event displays
plots
12Prototypes
- CHIMERA/ROOT prototype for generating events
with PYTHIA/CMKIN, histogramming and visualization
13mass 160 decay bb
A virtual space of simulated data is created
for future use by scientists...
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
14Search for WW decays of the Higgs Boson where
the Ws decay to electron and muon mass
160 decay WW WW ? e???
mass 160 decay bb
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
15mass 160 decay bb
Scientist obtains an interesting result and
wants to track how it was derived.
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
16mass 160 decay bb
Now the scientist wants to dig deeper...
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
17...The scientist adds a new derived data
branch...
mass 160 decay bb
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW WW ? e??? Pt gt 20
mass 160 decay WW
...and continues to investigate !
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
18A Collaborative Data-flow Development
Environment Complex Data Flow and Data
Provenance in HEP
Plots, Tables, Fits
AOD
ESD
Raw
TAG
- History of a Data Analysis (like CVS)
- "Check-point" a Data Analysis
- Analysis Development Environment
- Audit a Data Analysis
Comparisons Plots, Tables, Fits
Real Data
Simulated Data
19Outlook
- Work in progress both on CHIMERA CMS sides a
snapshot - A CHIMERA/ROOT prototype for building
executables on the fly, generating events with
PYTHIA/CMKIN, plotting and visualization
available (CHIMERA is a great integration tool) - The full CMS Monte Carlo chain is working under
CHIMERA (next talk) - Possible future directions
- Workflow management automatic generation
inheritance - Store metadata about derivations (like
annotations) in a searchable catalog - Handle Datasets, not just Logical File Names
- Integration with CLARENS (remote access), with
ROOT/PROOF (run in parallel) - A picture is better than 1000 words Prototype
Demo