Statistical Toolkit Recent updates - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical Toolkit Recent updates

Description:

New release with algorithm and user layer extension in the next days ... Another new release with new algorithms planned in autumn publication on recent extensions ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 21

Provided by: mariagr

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Toolkit Recent updates

1
Statistical Toolkit Recent updates

B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon,
P. Viarengo

CMS Software Meeting 25 July 2005
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s
2
Goodness-of-Fit component
B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon,
P. Viarengo CERN, INFN Genova, IST Genova
G.A.P. Cirrone et al., A Goodness-of-Fit
Statistical ToolkitIEEE Transactions on Nuclear
Science, Vol. 51, Issue 5, Oct. 2004, Pages
2056 - 2063
Component for modeling multi-parametric fit
problems
F. Fabozzi, L. Lista INFN Napoli
F. Fabozzi and L. Lista., A generic toolkit for
multivariate fitting designed with template
metaprogramming To be published in IEEE
Transactions on Nuclear Science, 2005
3
Vision the basics

Have a vision for the project
General purpose tool for statistical analysis
Toolkit approach (choice open to users)
Open source product

Clearly define scope, objectives

Rigorous software process

Software quality
Flexible, extensible, maintainable system

Build on a solid architecture

4
Architectural guidelines

The project adopts a solid architectural approach
to offer the functionality and the quality needed
by the users
to be maintainable over a long time scale
to be extensible, to accommodate future
evolutions of the requirements
Component-based architecture
to facilitate re-use and integration in diverse
frameworks
Dependencies
no dependence on any specific analysis tool
the core mathematical component is independent
from the users representation of his/her
analysis objects
the user layer bridges the core statistical
component and the users analysis
multiple implementations of the user layer (AIDA,
ROOT, FITS, GSL etc.)

5
(No Transcript)
6
(No Transcript)
7

Simple user layer
Shields the user from the complexity of the
underlying algorithms and design
Only deal with users analysis objects and choice
of comparison algorithm

8
Software process

United Software Development Process, specifically
tailored to the project
practical guidance and tools from the RUP
both rigorous and lightweight
mapping onto ISO 15504
significant experience gained in the group from
other projects
Incremental and iterative life-cycle model

9
GoF algorithms (currently implemented)

Algorithms for binned distributions
Anderson-Darling test
Chi-squared test
Fisz-Cramer-von Mises test
Tiku test (Cramer-von Mises test in chi-squared
approximation)
Algorithms for unbinned distributions
Anderson-Darling test
Cramer-von Mises test
Goodman test (Kolmogorov-Smirnov test in
chi-squared approximation)
Kolmogorov-Smirnov test
Kuiper test
Tiku test (Cramer-von Mises test in chi-squared
approximation)
The most complete statistics software for the
comparison of two distributions (also among
commercial/professional statistics tools)

10
Recent extensions algorithms

Fisz-Cramer-von Mises test
exact asymptotic distribution (earlier critical
values)
Anderson-Darling test
exact formulation (exists for unbinned
distributions only, earlier approximated
formulation for both binned and unbinned
distributions)
Tiku test
Cramer-von Mises test in a chi-squared
approximation
New tests weighted Kolmogorov-Smirnov, weighted
Cramer-von Mises
various weighting functions available in
literature
In preparation Watson test
can be applied in case of cyclic observations
(like Kuiper test)
It is already the most complete software for the
comparison of two distributions (even among
commercial/professional statistics tools)
goal provide all 2-sample GoF algorithms
currently existing in statistics literature
Publication in preparation to describe the new
algorithms

11
Recent extensions user layer

First release user layer for AIDA analysis
objects
LCG Architecture Blueprint, Geant4 requirement
July 2005 added user layer for ROOT histograms
CMS requirement (requested by Pedro Arce)
Other user layer implementations foreseen
easy to add
sound architecture decouples the mathematical
component and the users representation of
analysis objects
different requirements from various user
communities satisfy all of them w/o introducing
dependencies on any analysis tools
More in Andreas talk

12
Release

Releases are publicly downloadable from the web
code, documentation etc.
For the convenience of LCG users, releases are
also distributed with LCG AA software as
external contributions
New release with algorithm and user layer
extension in the next days
thanks to Pedro Arce for b-testing!
new naming of packages for additional user layers
Another new release with new algorithms planned
in autumn
publication on recent extensions
Releases include extensive user documentation
statistics algorithms
how to use the software
The project is systematically accompanied by
publications on refereed journals to document the
recognition of its scientific value

13
Power of GoF tests

Do we really need such a wide collection of GoF
tests? Why?
Which is the most appropriate test to compare two
distributions?
How good is it to recognize real equivalent
distributions and to reject fake ones?

Which test to use?

No comprehensive study of the relative power of
GoF tests exists in literature
novel research in statistics (not only in physics
data analysis!)
Compare the various GoF tests w.r.t. a set of
typical functional distributions
Provide guidance to the users based on sound
quantitative arguments
Preliminary results available, publication in
preparation

14
Usage

Geant4 physics validation
rigorous approach quantitative evaluation of
Geant4 physics models with respect to established
reference data
see for instance K. Amako et al., Comparison of
Geant4 electromagnetic physics models against the
NIST reference dataTo be published on IEEE
Transactions on Nuclear Science
LCG Simulation Validation project
see for instance A. Ribon, Testing Geant4 with a
simplified calorimeter setup, talk at Geant4
Physics Validation Workshop, Genova, July 2005,
http//www.ge.infn.it/geant4/events/july2005
CMS
validation of new histograms w.r.t. reference
ones in OSCAR Validation Suite
Geant4 regression testing
prototype developed at Gran Sasso Lab, inclusion
in Geant4 regular testing in preparation
(discussed at Geant4 Physics validation Workshop
last week)
Usage also in space science, medicine etc.

15
Outlook

Treatment of errors
1-sample GoF tests (comparison w.r.t. a function)
Comparison of two/multi-dimensional distributions
In all cases goal to provide an extensive set of
algorithms so far published in statistics
literature, with a critical evaluation of their
relative strengths and applicability
Your feedback is very much appreciated
Please let us know your requirements
we do our best to satisfy them, compatible with
available resources
The project is open to developers interested in
statistical methods
it is advanced RD in statistics, not only a
useful tool for physics analysis
New release coming soon
New publications in preparation

16
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/ Will be moved to http//www.ge.infn.it/statisti
caltoolkit
17
Acknowledgments

Work supported and partially funded by the
European Space Agency (ESA) under Contract
No.16339/02/NL/FM
Fred James (CERN) and Louis Lyons (Oxford)
many useful suggestions, discussions,
encouragement...
Thanks to our statistician (Paolo Viarengo, IST
National Institute for Cancer Research) for his
help with the mathematical calculations and the
guidance through statistics literature

18
include ltmemorygt

include
ltiostreamgt

include ltiomanipgt

include "TH1.h"

include "StatisticsTesting/StatisticsComparator
.h"
include
"ComparisonResult.h"

include
"Chi2ComparisonAlgorithm.h"

using namespace StatisticsTesting

int main
(int, char )

// Create and fill
two 1-dimensional histograms with random numbers.

TH1 hA("A", 10, 0.0, 1.0)

TH1 hB("B", 10, 0.0, 1.0)

for
(int i 0 i lt 1000 i )

hA.Fill(
drand48() )

hB.Fill( drand48() )

19
Comparing two root histograms
// Do the (binned) statistical test between the
two histograms

StatisticsComparatorlt Chi2ComparisonAlgorithm gt
comparator

ComparisonResult result comparator.compare(hA,
hB)

stdcout ltlt "Result of
the Chi2 Statistical Test " ltlt stdendl

ltlt "distance " ltlt
result.distance() ltlt stdendl

ltlt NDF " ltlt
result.ndf() ltlt stdendl

ltlt "p-value " ltlt result.quality()
ltlt stdendl

20
Make and run
source setup.sh make .objs/exampleChi2 Resu
lt of the Chi2 Statistical Test distance
4.82038 ndf (number of degree of freedom)
9 p-value 0.849677

Write a Comment

User Comments (0)