Title: Advanced software methods for physics analysis
1Advanced software methods for physics analysis
2Introduction
- Unprecedented data analysis complexity are
experienced in BaBar today, and LHC in the future - Huge data samples
- Large number of analyses
- High level of analysis complexity
- Analysis software and data model must have a high
level of reliability - Provide the users community with a set powerful,
generic, simple to use, tools to analyze data - Core software technicalities must be hidden to
the end user behind well defined interfaces - Code reuse must be pursued to permit users to
share what has been developed - A good Object Oriented architecture is a solution
- N.B. Good is more important than OO here
3Outline
- Analysis Data Model
- BaBar experience with the new Analysis Model
- Analysis toolkits
- An example of fitting and toy Monte Carlo
- Conclusions
4BaBar experience
- BaBar has collected up to now ?200106 BB pairs
- Physics events are organized according to a well
established model with different components at
increasing levels of detail - Tag ( boolean bits, floats and int. ) for fast
event filtering - Aod ( Micro-DST ) containing physics-level
objects - Esd ( Mini-DST ) intermediate level
reconstruction objects - Reconstructed tracks and neutral objects are
accessible to the users by means of the
information provided by BtaCandidates, which
represent particle candidates - Event store technology originally based on
Objectivity - Showed to be unfit for analysis purposes
- Almost every physics analysis is based on Tag and
Aod levels. - ROOT-IO supported early for those components
5The original Analysis Model
- The physics code in the BaBar Framework
- Performs particle identification and build
intermediate composite particles (?0, KS, D,
resonances) - Composite particles are built by means of a
standard (i.e. validated) and configurable set of
tools - This task is intrinsically very expensive in term
of CPU time because of the high combinatorics
involved. - The output of the process (containers of
composite particles candidates) was available
only at the transient level - Very often physics analyses take advantage of an
intermediate step of data reduction (skimming) - Performs further data reduction on individual
analysis basis - Eventually manage the production of large
PAW/ROOT n-tuples which contain specific
analysis level physics variables, composite
candidates kinematics and vertexing
6Old model achievements
- This model has been so far very successful in
producing high quality physics results - Submitted more than 90 physics papers, which
include discoveries! - But experts believe that it wouldnt scale well
to 5x current data set - Moreover, analysis at the n-tuple level has the
disadvantage that existing (and validated) code
is lost - This leads to replication of code
- Sharing analysis tools and strategies is
challenging
7The new Analysis Model
- Makes the Event Store content configurable
- Support for the storage of composite candidates
- More flexibility support addition of custom user
data - Expand the role of Skimming (collections of
pre-selected events) - Rewrite just customized components
- Borrow components from parent collections
- Provide both pointer-copy full deep-copy for
light skims - Better performance in reading, off-site easy
portability - The possible scenario
- Analysis working groups may instrument their
skims with high level added data and specific
composite candidates - Final users may produce their own reduced skims
(re-skimming) with their options of user added
data and composite candidates depending on their
actual needs - Frequent centralized skimming to reduce the delay
for availability of improvements, new skims, etc.
- Interactively access supported to skim data
(ROOT) for analysis! - Production of large n-tuples becomes less
appealing!
8Composite particles candidates
- PAW n-tuples was used as an additional storage
format to cope with lack of needed
functionalities - Analysts use to store composite candidates and
high level physics quantities in the n-tuples - Often replicating most of the Event Store
content. - Every analysis working group has its own large
heterogeneous PAW/ROOT files (disk space
inefficient) and formats - With the new model, candidates particles
genealogy, the fit algorithm and the constrains
are written to the DST - This design choice rewards disk space saving with
some cost in terms of reading performances - When reading back, all the persistent candidates
are translated in transient objects - The performance issues of re-fitting all the
transient candidates may be mitigated supporting
dynamic data access - Also known as Reconstruction on demand
9User added data - requirements
- Support for storing user-defined variables
- Widely used physics analysis variables in BaBar
- mES, ?E, ?mD,D, cos?B,Dl,
- Guarantee flexibility and uniformity in the
interface - Support for native types
- float, int,
- As well as for more complex objects
- ThreeVector, LorentzVector,
- User data should refer to analysis objects or be
global to the entire event - Non-intrusive design
- To extend functionalities (old good Open/Closed
OO principle) - Simplicity from the point view of the user
- Technological details are hidden as much as
possible
10User Added Data Interface
- The requirements are fulfilled by systematic use
of Generic Programming (C templates) - User variables of generic type
- Global event data
- Data associable to generic objects
- Specializations for BtaCandidate case (the BaBar
particle candidate object) - All native and most common structured variable
types and for BtaCandidate objects can be stored
in the DST - User has to deal with few concepts and classes
- Variables by means of UsrVariableltTgt template
class - Blocks aggregating variables values and
associations - UsrEventBlock class to aggregate quantities
relevant to the whole event - UsrCandBlock class to group variables associated
to candidates - static put/get functions to interact with the
event store
11 User defined candidate variables
begin of the job Declare variables
- UsrVariableltfloatgt mES( mES )
- UsrVariableltfloatgt deltaE( deltaE )
- UsrCandBlock B0Data
- for each BtaCandidate cand
- bool ok true
- mES ... deltaE ... // compute the values
- ok B0Data.put( cand, mES )
- ok B0Data.put( cand, deltaE )
-
- for each BtaCandidate cand
-
- bool found B0Data.get( cand, mES )
- // get candidate mES
- found B0Data.get( cand, deltaE )
- // get candidate deltaE
Analysis groups writing data
Analyst reading data
12Status of the new Analysis Model
- The model is completed and in production
- Physics results for the next conferences are
being produced using the new Analysis Model
13A toolkit for Fit and toy Monte Carlo
- The toolkit provides
- a language to describe and model complex
parametric fit problems in C - utilities to study the fit frequentistic
properties - Not intended to provide new mathematical
algorithms - The underlying minimization engine is Minuit
- Interest on this tool from
- BaBar, GEANT4, LCG-PI (to be released)
14Main functionalities
- Description of Probability Distribution Functions
(PDF) - most common PDFs provided (Gaussian, Poisson,
etc.) - random number generators for each provided PDF
- utilities to combine and manipulate PDFs
- Manipulation of symbolic expression
- simplifies the definition of PDF models and fit
functions - Fitter tools
- different Unbinned Maximum Likelihood (UML)
fitters and Chi-square fitter supported - Toy Monte Carlo
- utility to generate random data samples to
validate the fit results (pull distribution, fit
bias estimate, etc.) - User-defined components can be easily plugged-in
15Design choices
- The code is optimized for speed
- Toy Monte Carlo of complex fits are very CPU
intensive - It can be achieved without loosing good OO design
- avoid virtual functions where not necessary
- using template generic programming
- the Boost C library provides powerful tools
- Metaprogramming permits type manipulations at
compile time - User dont see these technical detail in the
interface - External package dependencies are well isolated
- Random number generator engines (ROOT, CLHEP, )
- Minuit wrapper (ROOT, )
- Other minimizers may be adopted (NAG, )
16PDF interface
A PDF implements the () operator P f( x,
y, ) Users can define new PDFs
respecting the above interface
- struct Flat
- PdfFlat( double a, double b )
- min( a ), max( b )
- double operator()( double x ) const
-
- return ( x lt min x gt max ?
- 0
- 1 / ( max - min ) )
-
- double min, max
- struct Poissonian
- PdfPoissonian( double m )
- mean( m )
- double operator()( int n ) const
-
- return ( exp( - mean )
- pow( mean, n ) /
- factorial( n ) )
-
- double mean
17Random number generators
Implements the generate method r.generate(
x, y, )
- templatelt typename Generator DefaultGengt
- struct RandomGeneratorlt Flat, Generator gt
-
- RandomGenerator( const Flat pdf )
- _min( pdf.min ), _max( pdf.max )
- void generate( double x ) const
- x GeneratorshootFlat( _min, _max )
-
- private
- const double _min, _max
- Users can define new generators with the
preferred method
- Numerical implementations are provided
- trapezoidal PDF sampling
- hit or miss technique
RANDOM_GENERATOR_SAMPLE(MyPdf, Bins, Min, Max)
RANDOM_GENERATOR_HITORMISS(MyPdf, Min, Max, fMax)
18Combining PDFs
- Argus shoulder ( 5.20, 5.28, -0.1 )
- Gaussian peak( 5.28, 0.05 )
- typedef MixtureltGaussian, Argusgt Mix
- Mix pdf( peak, shoulder, 0.1 )
- RandomGeneratorltMixgt rnd
- double x
- rnd.generate( x )
- Gaussian sigX( 5.28, 0.05 )
- Gaussian sigY ( 0, 0.015 )
- typedef IndependentltGaussian, Gaussiangt SigXY
- RandomGeneratorltSigXYgt rndXY
- double x, y
- rndXY.generate( x, y )
- Transformation of variables is also supported
- Random variables are be generated in the original
coordinate system, then transformed
Define the relevant language of the problem in
C!
19Fit PDF parameters and run Toy MC
- const int sig 100
- double mean 0, sigma 1
- Gaussian pdf( mean, sigma )
- LikelihoodltGaussiangt like( pdf )
- UMLParameterFitterltLikelihoodltGaussiangt gt
fitter( like ) - fitter.addParameter( "mean", pdf.mean )
- fitter.addParameter( "sigma", pdf.sigma )
- Poissonian num( sig ) // alternative Constant
- Gaussian pdfExp( mean, sigma )
- ExperimentltPoissonian, Gaussiangt experiment(
num, pdfExp ) -
- for ( int i 0 i lt 50000 i )
- SampleltLikelihoodtypesgt sample
- experiment.generate( sample )
-
- double par 2 mean, sigma , err 2
1, 1 , logLike
20Parameter fit Results (Pulls)
There is a bias (as expected) ?2 1/n?i(xi-?)2
? 1/n-1?i(xi-?)2
21Combined Yield and parameter fit
- const int sig 10, bkg 5
- typedef Poissonian Fluctuation
- Fluctuation fluctuationSig( sig ),
fluctuationBkg( bkg ) - typedef Independentlt Gaussian,
- Gaussian gt PdfSig
- typedef Independentlt Flat,
- Flat gt PdfBkg
- Gaussian g1( 0, 1 ), g2( 0, 0.5 )
- Flat f1( -5, 5 ), f2( -5, 5 )
- Sig pdfSig( g1, g2 )
- Bkg pdfBkg( f1, f2 )
- typedef ExperimentltFluctuation,
- Siggt ToySig
- typedef ExperimentltFluctuation,
- Bkggt ToyBkg
- ToySig toySig( fluctuationSig, pdfSig )
- ToyBkg toyBkg( fluctuationBkg, pdfBkg )
- Gaussian G1( 0, 1 )
- Sig pdfSig1( G1, g2 )
- Likelihood like( pdfSig1, pdfBkg )
- UMLYieldAndParameterFitterltLikelihoodgt fitter(
like ) - fitter.addParameter( "mean", G1.mean )
- double pull1, pull2, pull3
- for ( int i 0 i lt 50000 i )
- Samplelt Likelihoodtypes gt sample
- toy.generate( sample )
- double s sig, bkg, 0
- double err 1, 1, 1
- double logLike fitter.fit(
- s, err, sample )
- pull1 ( s 0 - sig ) / err 0
- pull2 ( s 1 - bkg ) / err 1
- pull3 ( s 2 - 0 ) / err 2
-
2D Gaussian signal over a 2D flat
background Simultaneous fit of yields and
Gaussian mean
22Conclusions
- Object Oriented analysis and design has been
adopted in the last decade as software technology
for physics applications - but many OO solutions exist for complex
problems like physics analysis - Special care needed to evaluate code flexibility,
extensibility, interdependency, - The potential advantages of OO are fully
exploited with a careful evaluation of the
required use cases - Not always obvious, experience is important!
- but more than 10 years of development have
brought significant experience and solutions - OO is a mature technology in the HEP field.
23Backup slides
24How to improve the Analysis model ?
- Protect existing physics level code against
changes - New functionalities are added by means of a
non-intrusive design - Same interface and same modus operandi as before
- Backward compatibility with existing physics code
- Replace Objectivity and large ntuples production
- Support a complete ROOT-IO persistency
- Except detector conditions that still depend on
Objectivity - Migration to ROOT in the future is foreseen
- Support interactive access ( ROOT/CINT ) from the
same ROOT files - Usable as a data inspection tool
-
25The new Event Store structure
Mini
ROOT persistency only
Micro
Structure has been rationalized leaving the same
transient interface
26User defined event variables
- Interface for global event variables the same
mutatis mutandis
begin of the job Declare variables
UsrEventBlock AWG_Tag UsrVariableltintgt
multiplicity AWG_Tag.addVariable( multiplicity )
multiplicity some_algo() AWG_Tag.put(
multiplicity ) UsrIfdltUsrEventBlockgtput(
event, AWG_Tag,AWGTAG)
event() code
Provide to the analysis groups the functionality
to make their own TAGs easily
27UML Yield fit
- const int sig 10, bkg 5
- typedef Independentlt Gaussian, Gaussian gt
PdfSig - typedef Independentlt Flat, Flat gt PdfBkg
- PdfSig pdfSig( Gaussian( 0, 1 ), Gaussian( 0,
0.5 ) ) - PdfBkg pdfBkg( Flat( -5, 5 ), Flat( -5, 5 ) )
- typedef ExtendedLikelihood2lt PdfSig, PdfBkg gt
Likelihood - Likelihood like( pdfSig, pdfBkg )
- UMLYieldFitterlt Likelihood gt fitter( like )
- typedef Poissonian Fluctuation //
alternative Constant - Fluctuation fluctuationSig( sig ),
fluctuationBkg( bkg ) - typedef Experimentlt Fluctuation, PdfSig gt
ToySig - typedef Experimentlt Fluctuation, PdfBkg gt
ToyBkg - ToySig toySig( fluctuationSig, pdfSig )
- ToyBkg toyBkg( fluctuationBkg, pdfBkg )
- Experiment2lt ToySig, ToyBkg gt toy( toySig,
toyBkg )
28Yield fit Results (Pulls)
ltsgt 10
ltbgt 5
Discrete structure because of low statistics
Poisson fluctuation
29Symbolic function package
- Symbolic expressions makes the definition of PDFs
easier
X x // declare the variable x //
normalize using the symbolic integration at
c-tor PdfNonParametricltXgt f1( sqr( sin(x)
cos(x) ) , 0, 4 M_PI
) // recompute the normalization every
time, since // the parameter tau may change
from call to call Parameter tau( 0.123 )
PdfParametricltXgt f2( x exp( - tau x ) ,
0, 10 )
30Example of ?2 fit
-
- X x
- Parameter a( 0 ), b( 1 ), c( 0 )
- FunctionltXgt parabola( c x( b xa ) )
- UniformPartition partition( 100, -1.0, 1.0 )
- Chi2ltFunctionltXgt gt chi2( parabola, partition )
- Chi2FitterltChi2ltFunctionltXgt gt gt fitter( chi )
- fitter.addParameter( "a", a.ptr() )
- fitter.addParameter( "b", b.ptr() )
- fitter.addParameter( "c", c.ptr() )
- SampleErrltdoublegt sample( partition.bins() )
- // fill the sample...
- double par a, b, c , err 1, 1, 1
- fitter.fit( par, err, sample )
31Possible future improvement
- Upper limit extraction based on Toy Monte Carlo
- Support for ?2 fit with correlated errors and
covariance matrix - Provide more standard PDFs
- Crystal ball, Tchebichev polynomials,
- Managing singular PDF
- Delta-Dirac components
- Managing (un)folding
-
32Conclusions
- The new analysis model addresses the limitations
of the current analysis model - ROOT I/O based Event Store persistency
- Backward compatibility with existing physics code
- Flexibility and extensibility of the Event Store
contents and central role of skims production - Iterative skimming of the events with progressive
data reduction and enrichment in specific
information provides - easier off-site portability of data
- reduced resource needs ( CPU and disk space )
- Moreover, it favors common tools developments and
common analysis strategies among groups
33Conclusion
- We designed a new tool to model fit problems
- Using template generic programming we obtained
- Generality
- User can plug-in new components (PDF,
transformations, random generators, etc.) - Easy to incorporate in the tool external
contributions - Light-weight
- Most of the code is contained in header
(include) files - Mild external dependencies
- Easy to use
- Very synthetic and expressive code
- CPU Speed
- Virtual function calls are extremely limited
- Most of the methods are inlined
- Interest has been expressed from
- Geant4 Statistical testing toolkit
- LCG/PI (LHC Computing Grid - Physics Interfaces)
- Will focus on a release version shortly