Virtual Data in CMS Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Virtual Data in CMS Analysis

Description:

R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde, Y.Zhao. CMS & GriPhyN. CHEP03, La Jolla, California ... Most scientific data are not simple 'measurements' produced ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 20
Provided by: ianf212
Category:
Tags: cms | analysis | data | virtual | wilde

less

Transcript and Presenter's Notes

Title: Virtual Data in CMS Analysis


1
Virtual Data in CMS Analysis
  • A.Arbree, P.Avery, D.Bourilkov,
  • R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde,
    Y.Zhao
  • CMS GriPhyN
  • CHEP03, La Jolla, California
  • March 25, 2003

2
Virtual Data
  • Webster dictionary virtual Function
    adjectiveEtymology Middle English, possessed of
    certain physical virtues, from Medieval Latin
    virtualis, from Latin virtus strength, virtue
  • Most scientific data are not simple
    measurements ? produced from increasingly
    complex computations (e.g. reconstructions,
    calibrations, selections, simulations, fits etc.)
  • HEP (and other sciences) increasingly CPU/Data
    intensive
  • Programs are significant community resources
    (transformations)
  • So are the executions of those programs
    (derivations)
  • Management of dataset transformations important!
  • Derivation Instantiation of a potential data
    product
  • Provenance Exact history of any existing data
    product

We already do this, but manually!
3
Virtual Data Motivations
Ive detected a muon calibration error and want
to know which derived data products need to be
recomputed.
Ive found some interesting data, but I need to
know exactly what corrections were applied before
I can trust it.
Data
consumed-by/ generated-by
product-of
Derivation
Transformation
execution-of
I want to search a database for 3 rare electron
events. If a program that does this analysis
exists, I wont have to write one from scratch.
I want to apply a forward jet analysis to 100M
events. If the results already exist, Ill save
weeks of computation.
4
Virtual Data Motivations
  • Data track-ability and result audit-ability
    "Virtual Logbook
  • In the nature of science
  • Reproducibility of results
  • Tools and data sharing and collaboration (data
    with recipe)
  • Individuals discover other scientists work and
    build from it
  • Different Teams can work in a modular,
    semi-autonomous fashion reuse previous
    data/code/results or entire analysis chains
  • Repair and correction of data c.f. make
  • Workflow management, Performance optimization
    data staged-in from remote site OR re-created
    locally on demand?
  • Transparency with respect to location and
    existence

5
Introducing CHIMERA The GriPhyN Virtual Data
System
  • Virtual Data Language
  • textual (concise, for human consumption)
  • XML (uses XML schema, for component integration)
  • Virtual Data Interpreter
  • implemented in Java
  • JAVA API and command-line toolkit
  • Virtual Data Catalog tracks data provenance (acts
    like a metadata repository) different back-ends
    for persistency
  • PostGreSQL and MySQL DB
  • file based (for easy testing)

6
Virtual Data in CHIMERA
  • A function call paradigm
  • Virtual data data objects with a well defined
    method of (re)production
  • Transformation namespaceidentifierversion
    ?
  • Abstract description of how a script/executable
    is invoked
  • Similar to a "function declaration" in C/C
  • Derivation namespaceidentifierversi
    on range
  • Invocation of a transformation with specific
    arguments
  • Similar to a "function call" in C/C
  • Can be either past or future
  • a record of how logical files were produced
  • a recipe for creating logical files at some point
    in the future

7
Virtual Data Language
TR pythia( out a2, in a1, none param160.0 )
argument arg param argument file
a1 Build-style recipe argument file
a2 TR cmsim( out a2, in a1 )
argument files a1 argument file
a2 DV x1-gtpythia( a2_at_outfile2,
a1_at_infile1) DV x2-gtcmsim( a2_at_outfile3,
a1_at_infile2, _at_incardfile )
Make-style recipe

file1
x1
file2, cardfile
x2
file3
8
Abstract and Concrete DAGs
  • Abstract DAXs (Virtual Data DAG)
  • abstract directed acyclic graph with
  • logical names for files/executables
  • (complete build-style recipe as DAX)
  • Resource locations unspecified
  • File names are logical
  • Data destinations unspecified
  • Concrete DAGs (stuff for DAGMan)
  • CONDOR style DAG for grid execution
  • (check RC, skip steps, make-style)
  • Resource locations determined
  • Physical file names specified
  • Data delivered to and returned from physical
  • locations

VDL
XML
VDC
XML
Abs. Plan
Logical
DAX
RC
C. Plan.
DAG
Physical
DAGMan
9
Nitty-Gritty
  • Transformation catalog (expects pre-built
    executables)
  • poolname ltransformation physical
    transformation environment String
  • local hw
    /bin/echo null
  • local pythcvs
    /workdir/lhc-h-6-cvs null
  • local pythlin
    /workdir/lhc-h-6-link null
  • local pythgen
    /workdir/lhc-h-6-run null
  • local pythtree
    /workdir/h2root.sh null
  • local pythview
    /workdir/root.sh null
  • local GriphynRC
    /vdshome/bin/replica-catalog JAVA_HOME/vdt/jdk1.
    3VDS_HOME/vdshome
  • local globus-url-copy
    /vdt/bin/globus-url-copy
    GLOBUS_LOCATION/vdtLD_LIBRARY_PATH/vdt/lib
  • ufl hw
    /bin/echo null
  • ufl GriphynRC
    /vdshome/bin/replica-catalog
    JAVA_HOME/vdt/jdk1.3.1_04VDS_HOME/vdshome
  • ufl globus-url-copy
    /vdt/bin/globus-url-copy
    GLOBUS_LOCATION/vdtLD_LIBRARY_PATH/vdt/lib
  • Pool configuration
  • pool universe job-manager-string
    url-prefix workdir ...
  • ufl vanilla testulix/jm-condor-INTEL-
    LINUX gsiftp//testulix/mydir /mydir
  • ufl standard testulix/jm-condor-INTEL-LINUX
    gsiftp//testulix/mydir /mydir
  • ufl globus testulix/jm-condor-INTEL-LIN
    UX gsiftp//testulix/mydir /mydir
  • ufl transfer testulix/jobmanager
    gsiftp//testulix/mydir /mydir

10
Data Analysis in HEP
  • Decentralized, chaotic
  • Flexible enough system able to accommodate large
    user base, use cases that we cant foresee
  • Ability to build scripts/executables on the
    fly, including user supplied code/parameters
    (possibly linking with preinstalled libraries on
    the execution sites)

11
Prototypes
  • First for SC2002, second for CHEP03

CVS tag
CVS
ntuples
h2root
C code
datacards
FORTRAN code
root trees
PYTHIA wrapper
root wrapper
compile, link
executable
libraries version N
event displays
plots
12
Prototypes
  • CHIMERA/ROOT prototype for generating events
    with PYTHIA/CMKIN, histogramming and visualization

13
mass 160 decay bb
A virtual space of simulated data is created
for future use by scientists...
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
14
Search for WW decays of the Higgs Boson where
the Ws decay to electron and muon mass
160 decay WW WW ? e???
mass 160 decay bb
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
15
mass 160 decay bb
Scientist obtains an interesting result and
wants to track how it was derived.
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
16
mass 160 decay bb
Now the scientist wants to dig deeper...
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
17
...The scientist adds a new derived data
branch...
mass 160 decay bb
mass 160
mass 160 decay ZZ
mass 160 decay WW WW ? leptons
mass 160 decay WW WW ? e??? Pt gt 20
mass 160 decay WW
...and continues to investigate !
mass 160 decay WW WW ? e???
mass 160 event 8
mass 160 decay WW WW ? e??? event 8
mass 160 decay WW event 8
mass 160 plot 1
mass 160 decay WW WW ? e??? plot 1
mass 160 decay WW plot 1
18
A Collaborative Data-flow Development
Environment Complex Data Flow and Data
Provenance in HEP
Plots, Tables, Fits
AOD
ESD
Raw
TAG
  • History of a Data Analysis (like CVS)
  • "Check-point" a Data Analysis
  • Analysis Development Environment
  • Audit a Data Analysis

Comparisons Plots, Tables, Fits
Real Data
Simulated Data
19
Outlook
  • Work in progress both on CHIMERA CMS sides a
    snapshot
  • A CHIMERA/ROOT prototype for building
    executables on the fly, generating events with
    PYTHIA/CMKIN, plotting and visualization
    available (CHIMERA is a great integration tool)
  • The full CMS Monte Carlo chain is working under
    CHIMERA (next talk)
  • Possible future directions
  • Workflow management automatic generation
    inheritance
  • Store metadata about derivations (like
    annotations) in a searchable catalog
  • Handle Datasets, not just Logical File Names
  • Integration with CLARENS (remote access), with
    ROOT/PROOF (run in parallel)
  • A picture is better than 1000 words Prototype
    Demo
Write a Comment
User Comments (0)
About PowerShow.com