Computing in High Energy Physics 2003 - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Computing in High Energy Physics 2003

Description:

Data Analysis has been traditionally associated with the final stages of data ... the development of Data Structures management systems (hydra, zbook-- zebra, bos) ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 63
Provided by: fons73
Learn more at: https://chep03.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Computing in High Energy Physics 2003


1
Perspective on Future Data Analysis in HENP
  • Computing in High Energy Physics 2003
  • La Jolla 24 March
  • René Brun
  • CERN

2
Data Analysis ??
  • Data Analysis has been traditionally associated
    with the final stages of data processing, ie
    Physics Analysis.
  • In this talk, I will cover a more general aspect
    of Data Analysis (in the true sense).
  • How to interact with data at all stages of data
    processing (batch or interactive modes)?
  • Can we imagine an experiment-independent way to
    achieve this?

3
Evolution
  • To understand the possible directions, we must
    understand some messages from the past, the solid
    recipes!
  • One important message is Make it simple.
  • Heavy experiment frameworks are often perceived
    as a serious obstacle and push users to use more
    basic but universal frameworks.

4
Once upon a time (seventies)
  • With the first electronic (as opposed to bubble
    chamber) experiments, data analysis was
    experiment specific, an activity after the data
    taking.
  • The only common software was the histograming
    package (eg Hbook) ,the fitting package (eg
    Minuit), some plotting packages and independent
    routines in cernlib (linear algebra and small
    utilities)
  • Data structures Fortran common blocks

5
Early Eighties
  • With the growing complexity of the experiments
    and corresponding software, we see the
    development of Data Structures management systems
    (hydra, zbook--gtzebra, bos).
  • These systems are able to write/read complex bank
    collections. Zebra had a self-describing bank
    format with built-in support for bank evolution.
  • Most data processed in batch, but many prototypes
    of interactive systems start to appear (htv, gep,
    then paw..)

6
PAW
  • Designed in 1985. Stable since 1993
  • Row-Wise-Ntuples. OK for small data sets,
    interactive histograming with cuts.
  • Column-Wise-Ntuples. A major step illustrating
    the advantage of structured data sets
  • PAW a success
  • not so much because of its technical merits
  • but perceived as a tool widely available
  • stability since many years an important element

7
1993--gt2000 (1)
  • Move from Fortran to OO
  • Took far more time than expected
  • new language(s)
  • new programming techniques
  • basic infrastructure not available to compete
    with existing libraries and tools
  • conflicts between projects
  • ad-hoc software in experiments

8
1993--gt2000 (2)
  • False hopes with OODBMS (or too early?)
  • OODBMS --gtObjectivity
  • OO models designed for Objy
  • batch oriented
  • Interactive use via conversion to PAW ntuples
  • central data base does not fit well with GRID
    concepts
  • Licensing problems and more

9
Data Analysis Models

10
From the desktop to the GRID
Local/remote Storage
Online/Offline Farms
Desktop
New data analysis tools must be able to use in
parallel remote CPUS, storage elements and
networks in a transparent way for a user at a
desktop
GRID
11
My laptop in 200X
Using a naïve extrapolation of Moores law for a
state of the art laptop
Year CPU/Ghz RAM/GB disk/GB 2003 2.4
0.5 60 2005 5 1 150 2007
10 2 300 2009 20 4
600 2011 40 8 1000
Nice ! But less than 1/1000 of what I need
12
Batch-mode Local analysis
  • Conventional model The user has full control on
    the event loop.
  • The program produces histograms, ntuples or
    trees.
  • The selection is via user private code
  • Histograms are then added (tool or in the
    interactive session)
  • ntuples/trees are combined into a chain and
    analyzed interactively.

13
Batch Analysis on the GRID
  • From a user viewpoint, a simple extrapolation of
    the local batch analysis.
  • In practice, must involve all the GRID machinery
    authentication, resource brokers, sandboxes.
  • Viewing the current status (histograms) must be
    possible.
  • Advantage Stateless, can process large data
    volumes.

Advanced systems already exist (see talk by
Andreas Wagner)
14
AliEnFS Distributed Analysis
15
Interactive Local Analysis
  • On a public cluster, or the users laptop.
  • Tools like PAW or successor are used for
    visualization and ntuples/trees analysis.

16
GRID Interactive AnalysisCase 1
  • Data transfer to users laptop
  • Optional Run/File catalog
  • Optional GRID software

Optional run/File Catalog
Analysis scripts are interpreted or compiled on
the local machine
Trees
Remote file server eg rootd
Trees
17
GRID Interactive AnalysisCase 2
  • Remote data processing
  • Optional Run/File catalog
  • Optional GRID software

Optional run/File Catalog
Analysis scripts are interpreted or compiled on
the remote machine
Trees
Remote data analyzer eg proofd
Commands, scripts
Trees
histograms
18
GRID Interactive AnalysisCase 3
  • Remote data processing
  • Run/File catalog
  • Full GRID software

Run/File Catalog
Analysis scripts are interpreted or compiled on
the remote master(s)
Trees
slave
Trees
Trees
Trees
slave
Remote data analyzer eg proofd
slave
Commands, scripts
slave
Trees
Histograms,trees
Trees
Trees
slave
Trees
slave
19
Data Analysis Projects

20
Tools for data analysis
  • PAW started in 1985, no major developments since
    1994.
  • HippoDraw started in 1991
  • ROOT started in 1995, continuous developments
  • JAS started in 1995, continuous developments
  • Open Scientist ?
  • LHC/Anaphe 1996--gt2002
  • PI new project in the LHC Computing Grid, just
    starting now

21
PAW
  • The reference since 18 years (1985), Used by most
    collaborations
  • ported on many platforms, small (3 to 15 MB)
  • many criticisms during the development phase
  • applauded since it is stable
  • maintained by Olivier Couet (ROOT team)

Usage still growing
0.1 FTE
22
HippoDraw
  • Author Paul Kunz
  • show the way in 1991/1992
  • Usage Paul a 50 year-old CERN physicist
  • Seems to be in constant prototyping phases
  • Good to have this type of prototype to illustrate
    new possible interactive techniques.

1 FTE ?
23
ROOT
  • In constant development since 1995
  • Used by many collaborations and outside HEP

More than 10000 distributions of binary tar files
in February
6 2..FTE
24
JAS
  • Started in 1995. (Tony Johnson)
  • Current version 2. JAS3 presented at this CHEP
  • For the Java world.
  • How to cooperate with C frameworks?

3 FTE ?
25
In AIDA you believe ?
  • The Abstract Interfaces for Data Analysis project
    was started by the defunct LHC and continued by
    Anaphe (now stopped).
  • Supported by JAS and Open Scientist
  • Goal define abstract interfaces to facilitate
    cooperation between developers and facilitate
    migration of users to new products
  • Versions 1, 2 and 3 (version 4 for PI ?)

26
In AIDA I dont believe
  • Abstract Interfaces are fundamental in modern
    systems to make a system more modular and
    adaptable.
  • But, common abstract interfaces are not a good
    idea.
  • They force a lowest common denominator
  • They require international agreements
  • Users will be confused (what is common and not)
  • you become slave of a deal against creativity
  • It is more important to agree on object
    interchange formats and data base access
  • You can easily change a few hundred lines of
    code. You cannot copy Terabytes of data

27
The LCG PI project
  • Fresh from the oven
  • One of the projects recently launched by the
    Applications Area of the LCG project.
  • Ideas
  • promote the use of AIDA (version 4)
  • Python for scripting
  • interface to ROOT CINT
  • in gestation
  • see Vincenzo

28
User Developer views
  • Users Requests
  • very rarely requests for grandiose new features
  • zillions of tiny new features
  • zillions of tiny improvements
  • want consolidation stability
  • Developers view
  • want to implement the sexy features
  • target modularity (more complex installation?)
  • maintenance helpdesk a problem or a chance?

29
Lessons from the past
  • It takes time to develop a general tool
  • more than 7 years for PAW, ROOT and JAS
  • User feedback is essential in the development
    phase
  • People like stable systems
  • Efficient access to data sets is a prerequisite
  • 24h x 7days x 12 months x N years online support
    is vital

30
Develop/Debug/maintain
In an Interactive system with N basic functions,
the number of combinations may be unlimited, (Not
NxN, but N! ) 10 of the time to develop first
90 of the code. 90 of the time to develop the
remaining 10
31
Time to develop
LCG
32
Technical aspects

33
Desktop
  • Plug-in Manager and Dictionary
  • GUI
  • Graphics 2-d, 3-d
  • Event Displays
  • Histograming Fitting
  • Statistics tools
  • Scripting
  • Data/Program organization

34
Plug-in Manager
Exp Shared libs
User Shared lib
Exp Shared libs
Exp Shared libs
Basic Services, GUI, Math..
General Utility Shared lib
Plug-in manager
I/O manager
Interpreter
I/O manager
Object Dictionary
35
The Object Dictionary
Object Dictionary
Data dictionary
Functions dictionary
Inspectors Browsers
I/O
Interpreted scripts GUI Command line
Compiled code
36
Scripting for data analysis
  • After KUIP and Tk/Tcl era
  • Command line Interface required
  • Scripts
  • interpreted or/and byte-code interpreted
  • automatic compilation and linking
  • call compiled or interpreted code
  • compiled code must be able to call interpreted
    code (GUI and configuration scripts)
  • Big bonus if compiled and interpreted languages
    are the same
  • Scripting and object dictionary symbiosis
  • Remote execution of scripts (in parallel)

37
Languages scripting
C Compiled code
C Interpreted scripts
Python/Perl scripts
GUI with signal/slots
Interactive User
Batch User
38
Comparing scripts
Very interesting project from Subir
Sarkar Cooperation between Java and a C
framework based on Object Dictionary
http//sarkar.home.cern.ch/sarkar/jroot/main.html
39
GUI(s)
  • Constant evolution
  • Microsoft MFC, Win32 API
  • Signals/Slots principle very nice. It helps
    designing large and modular GUI systems
  • Interpreters help GUI builders/editors

1983 Vax/VMS SMS VT100
1989 MOTIF Unix workstations
1985 GKS Textronix
2001 Qt Linux/Laptops
1997 Java/Swing The Web
40
2-D graphics
  • An area where constant improvements are required.
  • Better plotters, better fonts,...
  • Better drivers postscript, SVG, XML, etc

Publication quality is a must. This requirement
alone explains why many proposed data analysis
systems do not penetrate experiments
41
3-D graphics
  • Data structures Objects lt--gt scene
  • Scene renderers OpenGL, Open Inventor
  • Most difficult is detector geometry graphics
  • z-buffer algorithms OK for fast real time fancy
    graphics, not OK for good debugging (shape
    outline is important on top of z-buffer views).
  • Vector Postscript (or PDF/SVG) must be available
    (not Postscript from OpenGL triangles)
  • see talks about GraXML and Persint

42
Example with PERSINT/ATLAS
43
Event Displays
  • The most successful event displays so far were
    2-D projections (see Aleph, Atlas/Atlantis)
  • A lot of work with 3-d graphics in many
    experiments (see talks about Iguana)
  • Client-server model
  • Access to framework objects, browsers
  • One could have expected a bigger role for Java!
  • Mismatch with experiment C frameworks?
  • Possible directions
  • standardize object exchange (SOAP/XML/Root I/O)
  • standardize low level graphics exchange (HEPREP)

44
Histograming
  • This should be a stable area
  • Thread Safety
  • Binning on parallel systems
  • Merging on batch/parallel systems

45
Fitting
  • Minuit the standard
  • Fumili was nice and fast
  • Upgrade of Minuit with new algorithms including
    Fumili in the pipeline
  • several GUIs on top
  • a very powerful package developed by BaBar
  • see talk on RooFit by D.Kirkby

46
Statistics Math
  • Many tools and algorithms exist
  • GSL ?
  • Gnu R-Math project
  • TerraFerma Initiative
  • Subject of discussions at many workshops
  • confidence limits workshops
  • ACAT FermiLab and Moscow
  • Durham
  • Need to be federated in a coherent framework

47
Lost with Complexity?
  • In large collaborations, users are often lost
    when confronted to the complexity of big
    simulation and reconstruction programs
  • What is the data organization?
  • How are algorithms organized? The hierarchy?
  • The problem is amplified by the use of
    dynamically configurable systems, dynamic linking
    and polymorphism
  • Browsing data and algorithms is a must

48
Folders/ white boards
Folders help understanding complex hierarchical
structures Language Independent Could be
GRID-aware
49
Why Folders ?
This diagram shows a system without folders. The
objects have pointers to each other to access
each other's data. Pointers are an efficient
way to share data between classes. However, a
direct pointer creates a direct coupling between
classes. This design can become a very tangled
web of dependencies in a system with a large
number of classes.
50
Why Folders ?
In the diagram below, a reference to the data is
in the folder and the consumers refer to the
folder rather than each other to access the data.
A naming and search service provides an
alternative. It loosely couples the classes and
greatly enhances I/O operations. In this way,
folders separate the data from the algorithms and
greatly improve the modularity of an application
by minimizing the class dependencies.
51
Tasks/Algorithms
In the same way that Folders can be used to
organize the data, one can use Tasks to organize
a hierarchy of algorithms. Tasks can be
organized into a hierarchical tree of tasks and
displayed in the browser. A Task is an
abstraction with standard functions to
Begin,Execute,Finish. Each Task derived class
may contain other Tasks that can be executed
recursively, such that a complex program can be
dynamically built and executed by invoking the
services of the top level task or one of its
subtasks.
Tasks help understanding the organization and
sequence of execution of large programs
52
Directions

53
Exchange/Compatibility
  • If we assume that several data analysis tools
    will be around (HEP made or commercial), it is
    important to exchange objects between these tools
    (dragdrop, network or files).
  • The SOAP/XML have emerged as standards to
    exchange low level volume of objects.
  • Several technical solutions are possible. The
    winning solutions will be the ones that will be
    able to automatize the process by exploiting all
    the information in the object dictionary.

54
Follow Microsoft ?
  • The SOAP/XML are one of the key components of
    .NET (and also of the MS competition).
  • MS is preparing a new OS (Longhorn ?) for 2005.
    This new OS will introduce an Object distributed
    data base.
  • This may have a serious impact on the GRID
    software and on our tools.

55
Access Patterns
  • Understand data access patterns
  • to objects in one file
  • to subsets of objects in many collections
  • relations with run/file catalogs
  • persistent reference pointers
  • Optimize design of containers for
  • processing in batch
  • interactive parallel processing
  • cache management and proxies

56
Query processor
  • Extend/Develop powerful query systems that
    minimize the amount of programming
  • Optimize I/O (read only the strict necessary)
  • are able to process data in parallel, hiding the
    complexity of parallelism to the end user.
  • can be executed again and again, possibly
    learning from the previous passes.
  • Are robust against network failures, CRTL/C,
    programming errors.
  • Can be run in GUI mode, interpreted or compiled
    mode

57
Event Collections
  • Develop/Extend objects able to keep a summary of
    previous runs
  • Event collections with their iterators well
    matched to the query processor (eventrun, UUID,
    tree entry serial number).
  • Special objects masks, bit slice index to speed
    up searches in large collections.
  • The system must be able to run with and without
    the run/file catalog

58
Exploiting meta information
  • The normal data analysis mode requires access to
    the user classes.
  • However, experience shows that users also expect
    (as it was the case for PAW) to be able to
    process their data sets without the
    classes/shared libraries used to generate these
    data sets, still supporting automatic schema
    evolution.
  • The class meta information is saved in the data
    set. Simple queries involving only data class
    attributes must be possible without the code.
  • This requirement has consequences on the way the
    object dictionary is used.

59
Dependencies Simplicity
  • Minimize component dependencies to facilitate
    software distribution/portability
  • The winning tools will be the ones that
  • are easy to port to new systems (OS/compilers)
  • depend only on other systems also easy to port
  • are used in real conditions to guarantee feedback
  • are able to evolve very quickly to adapt to new
    situations and new requirements.

60
Integration with GRID soft
  • The data analysis software is an integral part of
    the GRID software. It drives the process, not the
    inverse.
  • This implies a close cooperation between teams
    working on tools for data analysis and teams
    working on the GRID plumbing resource brokers,
    authentication,etc, and GRID high level tools
    like Condor.
  • The Batch line and the Interactive line must be
    developed in a complementary way.

61
Trends Summary
Parallelism on the GRID Batch/Interactive Access
to Catalogs Resource Brokers Process
migration Progress Monitors Proxies/caches Virtu
al data sets
More and more GRID oriented data analysis More
and more experiment-independent software
Efficient Access to large and structured event
collections Interaction with user experiment
classes
Histogram Ntuple viewers Data Presenters
62
Acknowledgements
  • For a long time, data analysis has been the last
    wheel of the car. Many thanks to the organizing
    committee for giving me the opportunity to
    present my views on the subject.

Enjoy this conference
Write a Comment
User Comments (0)
About PowerShow.com