Title: KANGA: ROOT Access to BABAR Data for Physics Analysis
1KANGA ROOT Access to BABAR Data for Physics
Analysis
- David Kirkby, UC Irvine
- for the BABAR Computing Group
- CHEP 03 - Data Management Persistency
- 25 March 2003
Primary Reference T.J.Adye, A.Dorigo,
R.Dubitzky, A.Forti, S.J.Gowdy, G.Hamel de
Monchenault, R.G.Jacobsen, D.Kirkby, S.Kluth,
E.Leonardi, A.Salnikov, L.Wilden,Comp. Phys.
Comm. 150, p.197-214 (2003).
2The BABAR Experiment
- The BABAR experiment records ee- collisions at
the SLAC PEP-II collider.BABAR has 600
collaborators from 77 institutions in10
countries. Approximately half are from US
institutions.
3The BABAR Detector
- The BABAR detector has 200k channels read out
at100 Hz into a typical raw-data event size of
25kB.The experiment wrote 300 TB to tape for
the 40/fb recorded during 2001, with 10 TB kept
on disk at SLAC.Projected luminosity increases
will deliver an integrated 500/fb by end of
2006.
4BABAR Physics Analysis and Data Access
- BABAR has published 36 physics papers since Feb
2001.The typical physics analysis only needs
access to amicro-DST for sparse subsets of
data and Monte Carlo.
Tag
0.7kB/evt
Micro-DST(incl. truth subset)
3.0kB/evt
Analysis objs.
8.5kB/evt
Event summary data
Reconstructed data
120kB/evt
Raw/Simulated hit data
53kB/evt
Monte Carlo truth data
15kB/evt
Until 1999, data stored exclusively in an Objy.
Database (now gt750TB). No longer keeping Raw, Sim
Reco.
5BABAR Analysis Framework
- BABAR analysis uses a standard software
framework - Begin/NextEvent/Finalize transitions.
- Each transition is passed through a sequence of
execution modules with common base class. - Special modules handle data I/O and the
conversion between persistent transient obj.
representations. - User modules deal only with transient object
representations. - Data access is handled differently for event- and
non-event (conditions) sources. - This framework design completely decouples the
reconstruction and analysis code from the data
store technology, at some cost in performance.
6Motivation for KANGA
- An Aug 99 review of BABAR Computing examined
challenges involved in producing first physics
results under conference deadline
pressure.Access to data, both at SLAC and at
remote sites, was identified as a critical
bottleneck in physics analysis. - Objectivity (Objy) performance problems
recognized as weakness of computing model at the
time. In particular, the limitations imposed by
large files (2Gb for analysis data), and poor
lock-server scaling with many (100)
clients.Review committee recommended that BABAR
develop a limited-function short-to-medium term
solution
7KANGA Design Requirements
- This recommendation led to the following design
requirements - 1. Access to the identical micro-DST data
available from Objy. No support for access to
lower-level data.2. Compatible with existing
framework and user analysis code. Changes almost
transparent to analysis users (relink
required).3. Fast event filtering using simple
attributes (TAG) data.4. Simple and efficient
distribution of data to remote (non-SLAC) sites.
8The Implementation KANGA (ROO)
- Kind ANd Gentle Analysis (without Relying On
Objectivity) The key technical decision was to
use ROOT objects and files for persistent data
store.In general, there are many tradeoffs
involved in the Objy/ROOT decision.Our decision
was made in the context of a limited-function,
short-term solution that would enhance the
capabilities of a continuing Objy data store, and
that could be completed quickly.KANGA was
implemented and deployed in 4 months by a small
(5) team in 1999.
9Event Data Overview
- KANGA event data is stored in ROOT TTree objects.
Each branch represents a small set of persistent
classes with one branch instance per
event.Events from one run are usually grouped
into a single file containing 2 trees (Analysis
objs, Tag attributes). Typical size is 1.7 kB
for data (21.6 GB per /fb) and 4.7kB for Monte
Carlo. Tag attributes are stored as built-in
types.
KANGA file (106 of these now)
Tag attributes
Analysis Objs
class-1
class-n
attr-1
attr-m
10Event Data Architecture
- BABAR event data I/O is managed by
special-purpose framework execution modules. Only
those modules dealing directly with persistent
analysis objects and Tag attributes were
re-implemented for KANGA.
Input Module
Reco.Module
Output Module
Reco.Module
RAW
mDST
Input Module
AnalysisModule
AnalysisModule
A significant factor in the rapid deployment of
KANGA was the earlier design decision to
completely decouple the event store technology
from the analysis framework.
11Event Data Attribute Tags
- The design requirement of fast selection on a
sparse set of event attributes (total energy,
of muons, etc) required a small compromise in the
persistent/transient decoupling to gain improved
efficiency. - Instead of converting attributes, use adapter
pattern to implement transient interface
directly in terms of persistent objects. - This compromise ties transient class directly to
ROOT persistent class, but without exposing
persistent class to user code.
12Event Data Object References
- Direct references (eg, by pointer) between
transient classes require special handling to be
persisted.Implemented general mechanism to
support persistence of references between
transient objects valid in a single execution
context.In practice, this limits references to
be within an event and does not support
inter-event references.BABAR transient classes
do not use direct references, and rely instead on
indirect indexing. So this feature is not
currently being exploited.
13Event Data Schema Evolution
- Schema describes the organization of data in a
persistent object. - Schema evolution is desirable to support
improvements in data representation and pruning
of obsolete data. - ROOT I/O supports schema evolution for TObject
subclasses via user-managed version numbers for
each persistent class that are used to dispatch
appropriate input-streamer code at obj-read time. - KANGA additionally requires updated classes to
implement a standard (frozen) interface for
persistent-gttransient conversion.
14- After schema evolution, only new objects are
written by new code.New and existing code must
be linked against all versions of persistent
classes. No change required to user modules.
15Conditions Data Overview
- Non-event data tracks slowly-varying (lt1 Hz)
data-taking conditions, e.g. high-voltages, gas
flows, temperatures. - Calibration results are also considered
conditions. - Conditions data is accessed using time as a key,
unlike event data. - The full BABAR conditions DB is implemented in
Objy and supports a flexible revision mechanism.
16Kanga Conditions Data
- KANGA supports access to the limited set of
conditions needed for typical physics
analysis.Access is read-only and limited to a
single revision.The most recent revision of
specific conditions are automatically extracted
from Objy and stored in a single ROOT file of
20Mb. Use separate files for data, MC.ROOT
persistent implementation uses a binary tree
(BTree class) for efficient time-key lookup with
1s resolution.Correct association of event- and
non-event ROOT files requires some non-trivial
bookkeeping.
17Event Collections
- Physics analysis typically involves analyzing
sparse subsets of the events in a data file, but
different analyses require different
subsets.Sparse collections used for analysis
are grouped into 100 skims. Skims were
initially written using self-contained copies of
each event. Grouping correlated skims into 20
streams limited event-duplication overhead to
200. - More recently, pointer-based collections were
implemented. These are more efficient for bulk
storage and distribution, but carry additional
book-keeping overhead. Now moving in this
direction.
18KANGA Book-keeping Production
- The set of available KANGA event-data files and
their processing history is tracked in a
relational DB managed with perl scripts
(SkimTools package).This DB is used to
schedule and monitor jobs for producing KANGA
files from Objy (as well as physics skims from
unfiltered data and MC).Users can query this
database to prepare a TCL fragment that
configures their analysis job to analyze a
dataset.Size of DB is 400Mb. Tables and
scripts are compatible with Oracle and MySQL.
19Data Export
- Straightforward and efficient data export was a
primary requirement of the KANGA design.Goals - - only transfer files that are new (once
created, a file is assumed to never change) - - mirror SLAC filesystem layout to simplify
logical-to-physical name mapping between
sites.Initial implementation based on rsync was
not efficient for typical directories containing
O(1000) files.Present implementation uses the
relational DB to efficiently generate lists of
new files to transfer.
20Experience and Outlook
- Since May 2002, the primary KANGA event store is
based at Rutherford (RAL).RAL currently stores
22 TB of data and Monte Carlo(8B events) in
1.1M files. - A survey in early 2002 found that at least 19
institutions operated a local KANGA event store,
including 5 with the majority of data
available.Head-to-head comparisons of analysis
results obtained with Kanga and Objy provide
valuable QA tool.
21- Although conceived as a short-term solution,
KANGA is still with us 3 years later.Burden of
duplicated support and storage is becoming
unsustainable.BABAR is now implementing a new
Computing Model in which ROOT is the primary
event store technology.This migration involves
the eventual complete phase out of Objectivity
from the event store, and possible significant
changes to the original KANGA design to support
other features of the new Computing Model.