Computing in High Energy Physics 2003

About This Presentation

Title:

Computing in High Energy Physics 2003

Description:

Data Analysis has been traditionally associated with the final stages of data ... the development of Data Structures management systems (hydra, zbook-- zebra, bos) ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 63

Provided by: fons73

Learn more at: https://chep03.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computing in High Energy Physics 2003

1
Perspective on Future Data Analysis in HENP

Computing in High Energy Physics 2003
La Jolla 24 March
René Brun
CERN

2
Data Analysis ??

Data Analysis has been traditionally associated
with the final stages of data processing, ie
Physics Analysis.
In this talk, I will cover a more general aspect
of Data Analysis (in the true sense).
How to interact with data at all stages of data
processing (batch or interactive modes)?
Can we imagine an experiment-independent way to
achieve this?

3
Evolution

To understand the possible directions, we must
understand some messages from the past, the solid
recipes!
One important message is Make it simple.
Heavy experiment frameworks are often perceived
as a serious obstacle and push users to use more
basic but universal frameworks.

4
Once upon a time (seventies)

With the first electronic (as opposed to bubble
chamber) experiments, data analysis was
experiment specific, an activity after the data
taking.
The only common software was the histograming
package (eg Hbook) ,the fitting package (eg
Minuit), some plotting packages and independent
routines in cernlib (linear algebra and small
utilities)
Data structures Fortran common blocks

5
Early Eighties

With the growing complexity of the experiments
and corresponding software, we see the
development of Data Structures management systems
(hydra, zbook--gtzebra, bos).
These systems are able to write/read complex bank
collections. Zebra had a self-describing bank
format with built-in support for bank evolution.
Most data processed in batch, but many prototypes
of interactive systems start to appear (htv, gep,
then paw..)

6
PAW

Designed in 1985. Stable since 1993
Row-Wise-Ntuples. OK for small data sets,
interactive histograming with cuts.
Column-Wise-Ntuples. A major step illustrating
the advantage of structured data sets
PAW a success
not so much because of its technical merits
but perceived as a tool widely available
stability since many years an important element

7
1993--gt2000 (1)

Move from Fortran to OO
Took far more time than expected
new language(s)
new programming techniques
basic infrastructure not available to compete
with existing libraries and tools
conflicts between projects
ad-hoc software in experiments

8
1993--gt2000 (2)

False hopes with OODBMS (or too early?)
OODBMS --gtObjectivity
OO models designed for Objy
batch oriented
Interactive use via conversion to PAW ntuples
central data base does not fit well with GRID
concepts
Licensing problems and more

9
Data Analysis Models

10
From the desktop to the GRID
Local/remote Storage
Online/Offline Farms
Desktop
New data analysis tools must be able to use in
parallel remote CPUS, storage elements and
networks in a transparent way for a user at a
desktop
GRID
11
My laptop in 200X
Using a naïve extrapolation of Moores law for a
state of the art laptop
Year CPU/Ghz RAM/GB disk/GB 2003 2.4
0.5 60 2005 5 1 150 2007
10 2 300 2009 20 4
600 2011 40 8 1000
Nice ! But less than 1/1000 of what I need
12
Batch-mode Local analysis

Conventional model The user has full control on
the event loop.
The program produces histograms, ntuples or
trees.
The selection is via user private code
Histograms are then added (tool or in the
interactive session)
ntuples/trees are combined into a chain and
analyzed interactively.

13
Batch Analysis on the GRID

From a user viewpoint, a simple extrapolation of
the local batch analysis.
In practice, must involve all the GRID machinery
authentication, resource brokers, sandboxes.
Viewing the current status (histograms) must be
possible.
Advantage Stateless, can process large data
volumes.

Advanced systems already exist (see talk by
Andreas Wagner)
14
AliEnFS Distributed Analysis
15
Interactive Local Analysis

On a public cluster, or the users laptop.
Tools like PAW or successor are used for
visualization and ntuples/trees analysis.

16
GRID Interactive AnalysisCase 1

Data transfer to users laptop
Optional Run/File catalog
Optional GRID software

Optional run/File Catalog
Analysis scripts are interpreted or compiled on
the local machine
Trees
Remote file server eg rootd
Trees
17
GRID Interactive AnalysisCase 2

Remote data processing
Optional Run/File catalog
Optional GRID software

Optional run/File Catalog
Analysis scripts are interpreted or compiled on
the remote machine
Trees
Remote data analyzer eg proofd
Commands, scripts
Trees
histograms
18
GRID Interactive AnalysisCase 3

Remote data processing
Run/File catalog
Full GRID software

Run/File Catalog
Analysis scripts are interpreted or compiled on
the remote master(s)
Trees
slave
Trees
Trees
Trees
slave
Remote data analyzer eg proofd
slave
Commands, scripts
slave
Trees
Histograms,trees
Trees
Trees
slave
Trees
slave
19
Data Analysis Projects

20
Tools for data analysis

PAW started in 1985, no major developments since
1994.
HippoDraw started in 1991
ROOT started in 1995, continuous developments
JAS started in 1995, continuous developments
Open Scientist ?
LHC/Anaphe 1996--gt2002
PI new project in the LHC Computing Grid, just
starting now

21
PAW

The reference since 18 years (1985), Used by most
collaborations
ported on many platforms, small (3 to 15 MB)
many criticisms during the development phase
applauded since it is stable
maintained by Olivier Couet (ROOT team)

Usage still growing
0.1 FTE
22
HippoDraw

Author Paul Kunz
show the way in 1991/1992
Usage Paul a 50 year-old CERN physicist
Seems to be in constant prototyping phases
Good to have this type of prototype to illustrate
new possible interactive techniques.

1 FTE ?
23
ROOT

In constant development since 1995
Used by many collaborations and outside HEP

More than 10000 distributions of binary tar files
in February
6 2..FTE
24
JAS

Started in 1995. (Tony Johnson)
Current version 2. JAS3 presented at this CHEP
For the Java world.
How to cooperate with C frameworks?

3 FTE ?
25
In AIDA you believe ?

The Abstract Interfaces for Data Analysis project
was started by the defunct LHC and continued by
Anaphe (now stopped).
Supported by JAS and Open Scientist
Goal define abstract interfaces to facilitate
cooperation between developers and facilitate
migration of users to new products
Versions 1, 2 and 3 (version 4 for PI ?)

26
In AIDA I dont believe

Abstract Interfaces are fundamental in modern
systems to make a system more modular and
adaptable.
But, common abstract interfaces are not a good
idea.
They force a lowest common denominator
They require international agreements
Users will be confused (what is common and not)
you become slave of a deal against creativity
It is more important to agree on object
interchange formats and data base access
You can easily change a few hundred lines of
code. You cannot copy Terabytes of data

27
The LCG PI project

Fresh from the oven
One of the projects recently launched by the
Applications Area of the LCG project.
Ideas
promote the use of AIDA (version 4)
Python for scripting
interface to ROOT CINT
in gestation
see Vincenzo

28
User Developer views

Users Requests
very rarely requests for grandiose new features
zillions of tiny new features
zillions of tiny improvements
want consolidation stability
Developers view
want to implement the sexy features
target modularity (more complex installation?)
maintenance helpdesk a problem or a chance?

29
Lessons from the past

It takes time to develop a general tool
more than 7 years for PAW, ROOT and JAS
User feedback is essential in the development
phase
People like stable systems
Efficient access to data sets is a prerequisite
24h x 7days x 12 months x N years online support
is vital

30
Develop/Debug/maintain
In an Interactive system with N basic functions,
the number of combinations may be unlimited, (Not
NxN, but N! ) 10 of the time to develop first
90 of the code. 90 of the time to develop the
remaining 10
31
Time to develop
LCG
32
Technical aspects

33
Desktop

Plug-in Manager and Dictionary
GUI
Graphics 2-d, 3-d
Event Displays
Histograming Fitting
Statistics tools
Scripting
Data/Program organization

34
Plug-in Manager
Exp Shared libs
User Shared lib
Exp Shared libs
Exp Shared libs
Basic Services, GUI, Math..
General Utility Shared lib
Plug-in manager
I/O manager
Interpreter
I/O manager
Object Dictionary
35
The Object Dictionary
Object Dictionary
Data dictionary
Functions dictionary
Inspectors Browsers
I/O
Interpreted scripts GUI Command line
Compiled code
36
Scripting for data analysis

After KUIP and Tk/Tcl era
Command line Interface required
Scripts
interpreted or/and byte-code interpreted
automatic compilation and linking
call compiled or interpreted code
compiled code must be able to call interpreted
code (GUI and configuration scripts)
Big bonus if compiled and interpreted languages
are the same
Scripting and object dictionary symbiosis
Remote execution of scripts (in parallel)

37
Languages scripting
C Compiled code
C Interpreted scripts
Python/Perl scripts
GUI with signal/slots
Interactive User
Batch User
38
Comparing scripts
Very interesting project from Subir
Sarkar Cooperation between Java and a C
framework based on Object Dictionary
http//sarkar.home.cern.ch/sarkar/jroot/main.html
39
GUI(s)

Constant evolution
Microsoft MFC, Win32 API
Signals/Slots principle very nice. It helps
designing large and modular GUI systems
Interpreters help GUI builders/editors

1983 Vax/VMS SMS VT100
1989 MOTIF Unix workstations
1985 GKS Textronix
2001 Qt Linux/Laptops
1997 Java/Swing The Web
40
2-D graphics

An area where constant improvements are required.
Better plotters, better fonts,...
Better drivers postscript, SVG, XML, etc

Publication quality is a must. This requirement
alone explains why many proposed data analysis
systems do not penetrate experiments
41
3-D graphics

Data structures Objects lt--gt scene
Scene renderers OpenGL, Open Inventor
Most difficult is detector geometry graphics
z-buffer algorithms OK for fast real time fancy
graphics, not OK for good debugging (shape
outline is important on top of z-buffer views).
Vector Postscript (or PDF/SVG) must be available
(not Postscript from OpenGL triangles)
see talks about GraXML and Persint

42
Example with PERSINT/ATLAS
43
Event Displays

The most successful event displays so far were
2-D projections (see Aleph, Atlas/Atlantis)
A lot of work with 3-d graphics in many
experiments (see talks about Iguana)
Client-server model
Access to framework objects, browsers
One could have expected a bigger role for Java!
Mismatch with experiment C frameworks?
Possible directions
standardize object exchange (SOAP/XML/Root I/O)
standardize low level graphics exchange (HEPREP)

44
Histograming

This should be a stable area
Thread Safety
Binning on parallel systems
Merging on batch/parallel systems

45
Fitting

Minuit the standard
Fumili was nice and fast
Upgrade of Minuit with new algorithms including
Fumili in the pipeline
several GUIs on top
a very powerful package developed by BaBar
see talk on RooFit by D.Kirkby

46
Statistics Math

Many tools and algorithms exist
GSL ?
Gnu R-Math project
TerraFerma Initiative
Subject of discussions at many workshops
confidence limits workshops
ACAT FermiLab and Moscow
Durham
Need to be federated in a coherent framework

47
Lost with Complexity?

In large collaborations, users are often lost
when confronted to the complexity of big
simulation and reconstruction programs
What is the data organization?
How are algorithms organized? The hierarchy?
The problem is amplified by the use of
dynamically configurable systems, dynamic linking
and polymorphism
Browsing data and algorithms is a must

48
Folders/ white boards
Folders help understanding complex hierarchical
structures Language Independent Could be
GRID-aware
49
Why Folders ?
This diagram shows a system without folders. The
objects have pointers to each other to access
each other's data. Pointers are an efficient
way to share data between classes. However, a
direct pointer creates a direct coupling between
classes. This design can become a very tangled
web of dependencies in a system with a large
number of classes.
50
Why Folders ?
In the diagram below, a reference to the data is
in the folder and the consumers refer to the
folder rather than each other to access the data.
A naming and search service provides an
alternative. It loosely couples the classes and
greatly enhances I/O operations. In this way,
folders separate the data from the algorithms and
greatly improve the modularity of an application
by minimizing the class dependencies.
51
Tasks/Algorithms
In the same way that Folders can be used to
organize the data, one can use Tasks to organize
a hierarchy of algorithms. Tasks can be
organized into a hierarchical tree of tasks and
displayed in the browser. A Task is an
abstraction with standard functions to
Begin,Execute,Finish. Each Task derived class
may contain other Tasks that can be executed
recursively, such that a complex program can be
dynamically built and executed by invoking the
services of the top level task or one of its
subtasks.
Tasks help understanding the organization and
sequence of execution of large programs
52
Directions

53
Exchange/Compatibility

If we assume that several data analysis tools
will be around (HEP made or commercial), it is
important to exchange objects between these tools
(dragdrop, network or files).
The SOAP/XML have emerged as standards to
exchange low level volume of objects.
Several technical solutions are possible. The
winning solutions will be the ones that will be
able to automatize the process by exploiting all
the information in the object dictionary.

54
Follow Microsoft ?

The SOAP/XML are one of the key components of
.NET (and also of the MS competition).
MS is preparing a new OS (Longhorn ?) for 2005.
This new OS will introduce an Object distributed
data base.
This may have a serious impact on the GRID
software and on our tools.

55
Access Patterns

Understand data access patterns
to objects in one file
to subsets of objects in many collections
relations with run/file catalogs
persistent reference pointers
Optimize design of containers for
processing in batch
interactive parallel processing
cache management and proxies

56
Query processor

Extend/Develop powerful query systems that
minimize the amount of programming
Optimize I/O (read only the strict necessary)
are able to process data in parallel, hiding the
complexity of parallelism to the end user.
can be executed again and again, possibly
learning from the previous passes.
Are robust against network failures, CRTL/C,
programming errors.
Can be run in GUI mode, interpreted or compiled
mode

57
Event Collections

Develop/Extend objects able to keep a summary of
previous runs
Event collections with their iterators well
matched to the query processor (eventrun, UUID,
tree entry serial number).
Special objects masks, bit slice index to speed
up searches in large collections.
The system must be able to run with and without
the run/file catalog

58
Exploiting meta information

The normal data analysis mode requires access to
the user classes.
However, experience shows that users also expect
(as it was the case for PAW) to be able to
process their data sets without the
classes/shared libraries used to generate these
data sets, still supporting automatic schema
evolution.
The class meta information is saved in the data
set. Simple queries involving only data class
attributes must be possible without the code.
This requirement has consequences on the way the
object dictionary is used.

59
Dependencies Simplicity

Minimize component dependencies to facilitate
software distribution/portability
The winning tools will be the ones that
are easy to port to new systems (OS/compilers)
depend only on other systems also easy to port
are used in real conditions to guarantee feedback
are able to evolve very quickly to adapt to new
situations and new requirements.

60
Integration with GRID soft

The data analysis software is an integral part of
the GRID software. It drives the process, not the
inverse.
This implies a close cooperation between teams
working on tools for data analysis and teams
working on the GRID plumbing resource brokers,
authentication,etc, and GRID high level tools
like Condor.
The Batch line and the Interactive line must be
developed in a complementary way.

61
Trends Summary
Parallelism on the GRID Batch/Interactive Access
to Catalogs Resource Brokers Process
migration Progress Monitors Proxies/caches Virtu
al data sets
More and more GRID oriented data analysis More
and more experiment-independent software
Efficient Access to large and structured event
collections Interaction with user experiment
classes
Histogram Ntuple viewers Data Presenters
62
Acknowledgements

For a long time, data analysis has been the last
wheel of the car. Many thanks to the organizing
committee for giving me the opportunity to
present my views on the subject.

Enjoy this conference

Write a Comment

User Comments (0)

About PowerShow.com

Computing in High Energy Physics 2003 - PowerPoint PPT Presentation

Computing in High Energy Physics 2003

Data Analysis has been traditionally associated with the final stages of data ... the development of Data Structures management systems (hydra, zbook-- zebra, bos) ... – PowerPoint PPT presentation