Title: TMVA Toolkit for Multivariate Data Analysis
1TMVA Toolkit for Multivariate Data Analysis with
ROOT
Helge Voss, MPI-K Heidelberg on behalf of
Andreas Höcker, Fredrik Tegenfeld, Joerg Stelzer
- Supply an environment to easily
- apply different sophisticated data selection
algorithms - have them all trained, tested and evaluated
- find the best one for your selection problem
and contributors A.Christov, S.Henrot-Versillé,
M.Jachowski, A.Krasznahorkay Jr., Y.Mahalalel,
X.Prudent, P.Speckmayer, M.Wolter, A.Zemla
http//tmva.sourceforge.net/ arXiv
physics/0703039
2Motivation/Outline
- ROOT is the analysis framework used by most
(HEP)-physicists - Idea rather than just implementing new MVA
techniques and making them somehow available in
ROOT (i.e. like TMulitLayerPercetron does) - have one common platform/interface for all MVA
classifiers - easy to use and compare different MVA classifiers
- train/test on same data sample and evaluate
consistently
- Outline
- introduction
- the MVA classifiers available in TMVA
- demonstration with toy examples
- summary
3Multivariate Event Classification
- All multivariate classifiers condense
(correlated) multi-variable input information
into a single scalar output variable Rn
? R
y(Bkg) ? 0 y(Signal) ? 1
One variable to base your decision on
4What is in TMVA
- TMVA currently includes
- Rectangular cut optimisation
- Projective and Multi-dimensional likelihood
estimator - Fisher discriminant and H-Matrix (?2 estimator)
- Artificial Neural Network (3 different
implementations) - Boosted/bagged Decision Trees
- Rule Fitting
- Support Vector Machines
- all classifiers are highly customizable
- common pre-processing of input de-correlation,
principal component analysis
- support of arbitrary pre-selections and
individual event weights
- TMVA package provides training, testing and
evaluation of the classifiers
- each classifier provides a ranking of the input
variables
- classifiers produce weight files that are read by
reader class for MVA application
- integrated in ROOT(since release 5.11/03) and
very easy to use!
5Preprocessing the Input Variables Decorrelation
- Commonly realised for all methods in TMVA
(centrally in DataSet class)
- Removal of linear correlations by rotating
variables - using the square-root of the correlation matrix
- using the Principal Component Analysis
SQRT derorr.
PCA derorr.
original
- Note that this de-correlation is only complete,
if - input variables are Gaussians
- correlations linear only
- in practise gain form de-correlation often
rather modest or even harmful ?
6Cut Optimisation
- Simplest method cut in rectangular volume using
- scan in signal efficiency 0 ?1 and maximise
background rejection - from this scan, the optimal working point in
terms if S,B numbers can be derived
- Technical problem how to perform optimisation
- TMVA uses random sampling, Simulated Annealing
or Genetics Algorithm - speed improvement in volume search
- ? training events are sorted in Binary Seach
Trees
- do this in normal variable space or de-correlated
variable space
7Projected Likelihood Estimator (PDE Appr.)
- Combine probability from different variables for
an event to be signal or background like
- Optimal if no correlations and PDFs are correct
(known) - usually it is not true ? development of
different methods
- Technical problem how to implement reference
PDFs - 3 ways counting, function fitting ,
parametric fitting (splines, kernel estimators.)
8Multidimensional Likelihood Estimator
- Generalisation of 1D PDE approach to Nvar
dimensions
- Optimal method in theory if true N-dim PDF
were known
- Practical challenges
- derive N-dim PDF from training sample
x2
S
- TMVA implementation Range search PDERS
- count number of signal and background events in
vicinity of a data event ? fixed size or
adaptive (latter one kNN-type classifiers)
test event
B
x1
- volumes can be rectangular or spherical
- use multi-D kernels (Gaussian, triangular, )
to weight events within a volume
- speed up range search by sorting training
events in Binary Trees
Carli-Koblitz, NIM A501, 576 (2003)
9Fisher Discriminant (and H-Matrix)
- Well-known, simple and elegant classifier
- determine linear variable transformation where
- linear correlations are removed
- mean values of signal and background are pushed
as far apart as possible
- the computation of Fisher response is very
simple - linear combination of the event variables
Fisher coefficients
Fisher coefficients
10Artificial Neural Network (ANN)
- Get a non-linear classifier response by giving
linear combination of input variables to nodes
with non-linear activation
- Nodes (or neurons) and arranged in series
- ? Feed-Forward Multilayer Perceptrons (3
different implementations in TMVA)
(Activation function)
- Training adjust weights using known event such
that signal/background are best separated
11Decision Trees
Decision Trees
- sequential application of cuts which splits
the data into nodes, and the final nodes (leaf)
classifies an event as signal or background
- Training growing a decision tree
- Start with Root node
- Split training sample according to cut on best
variable at this node - Splitting criterion e.g., maximum Gini-index
purity ? (1 purity) - Continue splitting until min. number of events or
max. purity reached
- Classify leaf node according to majority of
events, or give weight unknown test events are
classified accordingly
Decision tree after pruning
Decision tree before pruning
- Bottom up Pruning
- remove statistically insignificant nodes ?
avoid overtraining
12Boosted Decision Trees
Boosted Decision Trees
- Decision Trees well know since a long time but
hardly used in HEP (although very similar to
simple Cuts) - Disatvantage instability small changes in
training sample can give large changes in tree
structure
- Boosted Decision Trees (1996) combine
several decision trees forest - classifier output is the (weighted) majority
vote of individual trees - trees derived from same training sample with
different event weights - e.g. AdaBoost wrong classified training
events are given a larger weight - bagging (re-sampling with replacement) ?random
weights
- Remark bagging/boosting ? create a basis of
classifiers - final classifier is a linear combination of
base classifiers
13Rule Fitting(Predictive Learning via Rule
Ensembles)
- Following RuleFit from Friedman-Popescu
Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford
U., 2003
- Classifier is a linear combination of simple base
classifiers - that are called rules and are here
sequences of cuts
- The procedure is
- create the rule ensemble ? created from a set of
decision trees - fit the coefficients ? Gradient directed
regularization (Friedman et al)
14Support Vector Machines
- Find hyperplane that best separates signal from
background - best separation maximum distance between
closest events (support) to hyperplane - linear decision boundary
x2
- Non linear cases
- transform the variables in higher dimensional
feature space where linear boundary
(hyperplanes) can separate the data - transformation is done implicitly using Kernel
Functions that effectively introduces a metric
for the distance measures that mimics the
transformation - Choose Kernel and fit the hyperplane
x1
Available Kernels Gaussian,
Polynomial,
Sigmoid
x1
15A Complete Example Analysis
void TMVAnalysis( ) TFile outputFile
TFileOpen( "TMVA.root", "RECREATE" )
TMVAFactory factory new TMVAFactory(
"MVAnalysis", outputFile,"!V") TFile input
TFileOpen("tmva_example.root") TTree
signal (TTree)input-gtGet("TreeS")
TTree background (TTree)input-gtGet("TreeB")
factory-gtAddSignalTree ( signal, 1. )
factory-gtAddBackgroundTree( background, 1.)
factory-gtAddVariable("var1var2", 'F')
factory-gtAddVariable("var1-var2", 'F')
factory-gtAddVariable("var3", 'F')
factory-gtAddVariable("var4", 'F')
factory-gtPrepareTrainingAndTestTree("",
"NSigTrain3000NBkgTrain3000SplitModeRandom!V
" ) factory-gtBookMethod(
TMVATypeskLikelihood, "Likelihood",
"!V!TransformOutputSpline2NSmooth5NAvEvtPerB
in50" ) factory-gtBookMethod(
TMVATypeskMLP, "MLP", "!VNCycles200HiddenLa
yersN1,NTestRate5" ) factory-gtTrainAllMet
hods() factory-gtTestAllMethods()
factory-gtEvaluateAllMethods()
outputFile-gtClose() delete factory
16Example Application
void TMVApplication( ) TMVAReader
reader new TMVAReader("!Color")
Float_t var1, var2, var3, var4
reader-gtAddVariable( "var1var2", var1 )
reader-gtAddVariable( "var1-var2", var2 )
reader-gtAddVariable( "var3", var3 )
reader-gtAddVariable( "var4", var4 )
reader-gtBookMVA( "MLP method",
"weights/MVAnalysis_MLP.weights.txt" ) TFile
input TFileOpen("tmva_example.root")
TTree theTree (TTree)input-gtGet("TreeS")
Float_t userVar1, userVar2 theTree-gtSetBranchA
ddress( "var1", userVar1 )
theTree-gtSetBranchAddress( "var2", userVar2 )
theTree-gtSetBranchAddress( "var3", var3 )
theTree-gtSetBranchAddress( "var4", var4 )
for (Long64_t ievt3000 ievtlttheTree-gtGetEntries(
)ievt) theTree-gtGetEntry(ievt)
var1 userVar1 userVar2 var2 userVar1
- userVar2 cout ltlt reader-gtEvaluateMVA(
"MLP method" ) ltltendl delete
reader
17A purely academic Toy example
- Use data set with 4 linearly correlated Gaussian
distributed variables
--------------------------------------- Rank
Variable Separation --------------------------
------------- 1 var3
3.834e02 2 var2 3.062e02
3 var1 1.097e02
4 var0 5.818e01 --------------------
-------------------
18Validating the classifiers
Validating the Classifier Training
- Projective likelihood PDFs, MLP training, BDTs,
....
average no. of nodes before/after pruning 4193 /
968
19Classifier Output
The Output
- TMVA output distributions
Fisher
PDERS
Likelihood
correlations removed
due to correlations
Neural Network
Boosted Decision Trees
Rule Fitting
20Evaluation Output
The Output
- TMVA output distributions for Fisher, Likelihood,
BDT and MLP
For this case Fisher discriminant provides the
theoretically best possible method ? Same as
de-correlated Likelihood
Cuts and Likelihood w/o de-correlation are
inferior
Note About All Realistic Use Cases are Much More
Difficult Than This One
21Evaluation Output (taken from TMVA printout)
Evaluation results ranked by best signal
efficiency and purity (area) ---------------------
--------------------------------------------------
------- MVA Signal efficiency at
bkg eff. (error) Sepa- Signifi- Methods
_at_B0.01 _at_B0.10 _at_B0.30 Area
ration cance ----------------------------------
-------------------------------------------- Fishe
r 0.268(03) 0.653(03) 0.873(02)
0.882 0.444 1.189 MLP
0.266(03) 0.656(03) 0.873(02) 0.882 0.444
1.260 LikelihoodD 0.259(03) 0.649(03)
0.871(02) 0.880 0.441 1.251 PDERS
0.223(03) 0.628(03) 0.861(02) 0.870
0.417 1.192 RuleFit 0.196(03)
0.607(03) 0.845(02) 0.859 0.390
1.092 HMatrix 0.058(01) 0.622(03)
0.868(02) 0.855 0.410 1.093 BDT
0.154(02) 0.594(04) 0.838(03) 0.852
0.380 1.099 CutsGA 0.109(02)
1.000(00) 0.717(03) 0.784 0.000
0.000 Likelihood 0.086(02) 0.387(03)
0.677(03) 0.757 0.199 0.682 --------------
--------------------------------------------------
-------------- Testing efficiency compared to
training efficiency (overtraining
check) -------------------------------------------
----------------------------------- MVA
Signal efficiency from test sample (from
traing sample) Methods _at_B0.01
_at_B0.10 _at_B0.30 ---------------------
--------------------------------------------------
------- Fisher 0.268 (0.275)
0.653 (0.658) 0.873 (0.873) MLP
0.266 (0.278) 0.656 (0.658) 0.873
(0.873) LikelihoodD 0.259 (0.273)
0.649 (0.657) 0.871 (0.872) PDERS
0.223 (0.389) 0.628 (0.691) 0.861
(0.881) RuleFit 0.196 (0.198)
0.607 (0.616) 0.845 (0.848) HMatrix
0.058 (0.060) 0.622 (0.623) 0.868
(0.868) BDT 0.154 (0.268)
0.594 (0.736) 0.838 (0.911) CutsGA
0.109 (0.123) 1.000 (0.424) 0.717
(0.715) Likelihood 0.086 (0.092)
0.387 (0.379) 0.677 (0.677) -----------------
--------------------------------------------------
----------
Better classifier
Check for over-training
22More Toys Circular correlations
More Toys Linear-, Cross-, Circular Correlations
- Illustrate the behaviour of linear and nonlinear
classifiers
Circular correlations (same for signal and
background)
23Illustustration Events weighted by MVA-response
Weight Variables by Classifier Performance
- Example How do classifiers deal with the
correlation patterns ?
Linear Classifiers
Fisher
Likelihood
decorrelated Likelihood
Non Linear Classifiers
Decision Trees
PDERS
24Final Classifier Performance
Final Classifier Performance
- Background rejection versus signal efficiency
curve
Circular Example
25More Toys Schachbrett (chess board)
Event Distribution
- Performance achieved without parameter
adjustments - PDERS and BDT are best out of the box
- After some parameter tuning, also SVM und
ANN(MLP) perform
Theoretical maximum
Events weighted by SVM response
26TMVA-Users Guide
We (finally) have a Users Guide !
Available from tmva.sf.net
TMVA Users Guide 78pp, incl. code examples
arXiv physics/0703039
27Summary
- TMVA unifies highly customizable and performing
multivariate classification algorithms in a
single user-friendly framework
- This ensures most objective classifier
comparisons and simplifies their use
- TMVA is available from tmva.sf.net and in ROOT
(gt5.11/03)
- A typical TMVA analysis requires user interaction
with a Factory (for classifier training) and a
Reader (for classifier application)
- a set of ROOT macros displays the evaluation
results
- We will continue to improve flexibility and add
new classifiers - Bayesian Classifiers
- Committee Method ? combination of different
MVA techniques - C-code output for trained classifiers (for
selected methods)
28More Toys Linear, Cross, Circular correlations
More Toys Linear-, Cross-, Circular Correlations
- Illustrate the behaviour of linear and nonlinear
classifiers
Linear correlations (same for signal and
background)
Linear correlations (opposite for signal and
background)
Circular correlations (same for signal and
background)
29Illustustration Events weighted by MVA-response
Weight Variables by Classifier Performance
- How well do the classifier resolve the various
correlation patterns ?
Linear correlations (same for signal and
background)
Linear correlations (opposite for signal and
background)
Circular correlations (same for signal and
background)
30Final Classifier Performance
Final Classifier Performance
- Background rejection versus signal efficiency
curve
Linear Example
Cross Example
Circular Example
31Stability with respect to irrelevant variables
Stability with Respect to Irrelevant Variables
- Toy example with 2 discriminating and 4
non-discriminating variables ?
use only two discriminant variables in classifiers
use all discriminant variables in classifiers
32Using TMVA in Training and Application
Can be ROOT scripts, C executables or python
scripts (via PyROOT), or any other high-level
language that interfaces with ROOT
33Introduction Event Classification
- Different techniques use different ways trying to
exploit (all) features - ? compare and choose
A linear boundary?
A nonlinear one?
Rectangular cuts?
S
- How to place the decision boundary?
- ? Let the machine learn it from training events