Statistical Toolkit Recent updates - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Statistical Toolkit Recent updates

Description:

New release with algorithm and user layer extension in the next days ... Another new release with new algorithms planned in autumn publication on recent extensions ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 21
Provided by: mariagr
Category:

less

Transcript and Presenter's Notes

Title: Statistical Toolkit Recent updates


1
Statistical Toolkit Recent updates
  • B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon,
    P. Viarengo

CMS Software Meeting 25 July 2005
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s
2
Goodness-of-Fit component
B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon,
P. Viarengo CERN, INFN Genova, IST Genova
G.A.P. Cirrone et al., A Goodness-of-Fit
Statistical ToolkitIEEE Transactions on Nuclear
Science, Vol. 51, Issue 5, Oct. 2004, Pages
2056 - 2063 
Component for modeling multi-parametric fit
problems
F. Fabozzi, L. Lista INFN Napoli
F. Fabozzi and L. Lista., A generic toolkit for
multivariate fitting designed with template
metaprogramming To be published in IEEE
Transactions on Nuclear Science, 2005
3
Vision the basics
  • Have a vision for the project
  • General purpose tool for statistical analysis
  • Toolkit approach (choice open to users)
  • Open source product

Clearly define scope, objectives
  • Rigorous software process

Software quality
Flexible, extensible, maintainable system
  • Build on a solid architecture

4
Architectural guidelines
  • The project adopts a solid architectural approach
  • to offer the functionality and the quality needed
    by the users
  • to be maintainable over a long time scale
  • to be extensible, to accommodate future
    evolutions of the requirements
  • Component-based architecture
  • to facilitate re-use and integration in diverse
    frameworks
  • Dependencies
  • no dependence on any specific analysis tool
  • the core mathematical component is independent
    from the users representation of his/her
    analysis objects
  • the user layer bridges the core statistical
    component and the users analysis
  • multiple implementations of the user layer (AIDA,
    ROOT, FITS, GSL etc.)

5
(No Transcript)
6
(No Transcript)
7
  • Simple user layer
  • Shields the user from the complexity of the
    underlying algorithms and design
  • Only deal with users analysis objects and choice
    of comparison algorithm

8
Software process
  • United Software Development Process, specifically
    tailored to the project
  • practical guidance and tools from the RUP
  • both rigorous and lightweight
  • mapping onto ISO 15504
  • significant experience gained in the group from
    other projects
  • Incremental and iterative life-cycle model

9
GoF algorithms (currently implemented)
  • Algorithms for binned distributions
  • Anderson-Darling test
  • Chi-squared test
  • Fisz-Cramer-von Mises test
  • Tiku test (Cramer-von Mises test in chi-squared
    approximation)
  • Algorithms for unbinned distributions
  • Anderson-Darling test
  • Cramer-von Mises test
  • Goodman test (Kolmogorov-Smirnov test in
    chi-squared approximation)
  • Kolmogorov-Smirnov test
  • Kuiper test
  • Tiku test (Cramer-von Mises test in chi-squared
    approximation)
  • The most complete statistics software for the
    comparison of two distributions (also among
    commercial/professional statistics tools)

10
Recent extensions algorithms
  • Fisz-Cramer-von Mises test
  • exact asymptotic distribution (earlier critical
    values)
  • Anderson-Darling test
  • exact formulation (exists for unbinned
    distributions only, earlier approximated
    formulation for both binned and unbinned
    distributions)
  • Tiku test
  • Cramer-von Mises test in a chi-squared
    approximation
  • New tests weighted Kolmogorov-Smirnov, weighted
    Cramer-von Mises
  • various weighting functions available in
    literature
  • In preparation Watson test
  • can be applied in case of cyclic observations
    (like Kuiper test)
  • It is already the most complete software for the
    comparison of two distributions (even among
    commercial/professional statistics tools)
  • goal provide all 2-sample GoF algorithms
    currently existing in statistics literature
  • Publication in preparation to describe the new
    algorithms

11
Recent extensions user layer
  • First release user layer for AIDA analysis
    objects
  • LCG Architecture Blueprint, Geant4 requirement
  • July 2005 added user layer for ROOT histograms
  • CMS requirement (requested by Pedro Arce)
  • Other user layer implementations foreseen
  • easy to add
  • sound architecture decouples the mathematical
    component and the users representation of
    analysis objects
  • different requirements from various user
    communities satisfy all of them w/o introducing
    dependencies on any analysis tools
  • More in Andreas talk

12
Release
  • Releases are publicly downloadable from the web
  • code, documentation etc.
  • For the convenience of LCG users, releases are
    also distributed with LCG AA software as
    external contributions
  • New release with algorithm and user layer
    extension in the next days
  • thanks to Pedro Arce for b-testing!
  • new naming of packages for additional user layers
  • Another new release with new algorithms planned
    in autumn
  • publication on recent extensions
  • Releases include extensive user documentation
  • statistics algorithms
  • how to use the software
  • The project is systematically accompanied by
    publications on refereed journals to document the
    recognition of its scientific value

13
Power of GoF tests
  • Do we really need such a wide collection of GoF
    tests? Why?
  • Which is the most appropriate test to compare two
    distributions?
  • How good is it to recognize real equivalent
    distributions and to reject fake ones?

Which test to use?
  • No comprehensive study of the relative power of
    GoF tests exists in literature
  • novel research in statistics (not only in physics
    data analysis!)
  • Compare the various GoF tests w.r.t. a set of
    typical functional distributions
  • Provide guidance to the users based on sound
    quantitative arguments
  • Preliminary results available, publication in
    preparation

14
Usage
  • Geant4 physics validation
  • rigorous approach quantitative evaluation of
    Geant4 physics models with respect to established
    reference data
  • see for instance K. Amako et al., Comparison of
    Geant4 electromagnetic physics models against the
    NIST reference dataTo be published on IEEE
    Transactions on Nuclear Science
  • LCG Simulation Validation project
  • see for instance A. Ribon, Testing Geant4 with a
    simplified calorimeter setup, talk at Geant4
    Physics Validation Workshop, Genova, July 2005,
    http//www.ge.infn.it/geant4/events/july2005
  • CMS
  • validation of new histograms w.r.t. reference
    ones in OSCAR Validation Suite
  • Geant4 regression testing
  • prototype developed at Gran Sasso Lab, inclusion
    in Geant4 regular testing in preparation
    (discussed at Geant4 Physics validation Workshop
    last week)
  • Usage also in space science, medicine etc.

15
Outlook
  • Treatment of errors
  • 1-sample GoF tests (comparison w.r.t. a function)
  • Comparison of two/multi-dimensional distributions
  • In all cases goal to provide an extensive set of
    algorithms so far published in statistics
    literature, with a critical evaluation of their
    relative strengths and applicability
  • Your feedback is very much appreciated
  • Please let us know your requirements
  • we do our best to satisfy them, compatible with
    available resources
  • The project is open to developers interested in
    statistical methods
  • it is advanced RD in statistics, not only a
    useful tool for physics analysis
  • New release coming soon
  • New publications in preparation

16
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/ Will be moved to http//www.ge.infn.it/statisti
caltoolkit
17
Acknowledgments
  • Work supported and partially funded by the
    European Space Agency (ESA) under Contract
    No.16339/02/NL/FM
  • Fred James (CERN) and Louis Lyons (Oxford)
  • many useful suggestions, discussions,
    encouragement...
  • Thanks to our statistician (Paolo Viarengo, IST
    National Institute for Cancer Research) for his
    help with the mathematical calculations and the
    guidance through statistics literature

18
include ltmemorygt

include
ltiostreamgt

include ltiomanipgt


include "TH1.h"


include "StatisticsTesting/StatisticsComparator
.h"
include
"ComparisonResult.h"

include
"Chi2ComparisonAlgorithm.h"




using namespace StatisticsTesting




int main
(int, char )

// Create and fill
two 1-dimensional histograms with random numbers.

TH1 hA("A", 10, 0.0, 1.0)


TH1 hB("B", 10, 0.0, 1.0)

for
(int i 0 i lt 1000 i )

hA.Fill(
drand48() )

hB.Fill( drand48() )



19
Comparing two root histograms
// Do the (binned) statistical test between the
two histograms

StatisticsComparatorlt Chi2ComparisonAlgorithm gt
comparator

ComparisonResult result comparator.compare(hA,
hB)



stdcout ltlt "Result of
the Chi2 Statistical Test " ltlt stdendl

ltlt "distance " ltlt
result.distance() ltlt stdendl

ltlt NDF " ltlt
result.ndf() ltlt stdendl

ltlt "p-value " ltlt result.quality()
ltlt stdendl




20
Make and run
source setup.sh make .objs/exampleChi2 Resu
lt of the Chi2 Statistical Test distance
4.82038 ndf (number of degree of freedom)
9 p-value 0.849677
Write a Comment
User Comments (0)
About PowerShow.com