Title: Computing in High Energy Physics 2003
1Perspective on Future Data Analysis in HENP
- Computing in High Energy Physics 2003
- La Jolla 24 March
- René Brun
- CERN
2Data Analysis ??
- Data Analysis has been traditionally associated
with the final stages of data processing, ie
Physics Analysis. - In this talk, I will cover a more general aspect
of Data Analysis (in the true sense). - How to interact with data at all stages of data
processing (batch or interactive modes)? - Can we imagine an experiment-independent way to
achieve this?
3Evolution
- To understand the possible directions, we must
understand some messages from the past, the solid
recipes! - One important message is Make it simple.
- Heavy experiment frameworks are often perceived
as a serious obstacle and push users to use more
basic but universal frameworks.
4Once upon a time (seventies)
- With the first electronic (as opposed to bubble
chamber) experiments, data analysis was
experiment specific, an activity after the data
taking. - The only common software was the histograming
package (eg Hbook) ,the fitting package (eg
Minuit), some plotting packages and independent
routines in cernlib (linear algebra and small
utilities) - Data structures Fortran common blocks
5Early Eighties
- With the growing complexity of the experiments
and corresponding software, we see the
development of Data Structures management systems
(hydra, zbook--gtzebra, bos). - These systems are able to write/read complex bank
collections. Zebra had a self-describing bank
format with built-in support for bank evolution. - Most data processed in batch, but many prototypes
of interactive systems start to appear (htv, gep,
then paw..)
6PAW
- Designed in 1985. Stable since 1993
- Row-Wise-Ntuples. OK for small data sets,
interactive histograming with cuts. - Column-Wise-Ntuples. A major step illustrating
the advantage of structured data sets - PAW a success
- not so much because of its technical merits
- but perceived as a tool widely available
- stability since many years an important element
71993--gt2000 (1)
- Move from Fortran to OO
- Took far more time than expected
- new language(s)
- new programming techniques
- basic infrastructure not available to compete
with existing libraries and tools - conflicts between projects
- ad-hoc software in experiments
81993--gt2000 (2)
- False hopes with OODBMS (or too early?)
- OODBMS --gtObjectivity
- OO models designed for Objy
- batch oriented
- Interactive use via conversion to PAW ntuples
- central data base does not fit well with GRID
concepts - Licensing problems and more
9Data Analysis Models
10From the desktop to the GRID
Local/remote Storage
Online/Offline Farms
Desktop
New data analysis tools must be able to use in
parallel remote CPUS, storage elements and
networks in a transparent way for a user at a
desktop
GRID
11My laptop in 200X
Using a naïve extrapolation of Moores law for a
state of the art laptop
Year CPU/Ghz RAM/GB disk/GB 2003 2.4
0.5 60 2005 5 1 150 2007
10 2 300 2009 20 4
600 2011 40 8 1000
Nice ! But less than 1/1000 of what I need
12Batch-mode Local analysis
- Conventional model The user has full control on
the event loop. - The program produces histograms, ntuples or
trees. - The selection is via user private code
- Histograms are then added (tool or in the
interactive session) - ntuples/trees are combined into a chain and
analyzed interactively.
13Batch Analysis on the GRID
- From a user viewpoint, a simple extrapolation of
the local batch analysis. - In practice, must involve all the GRID machinery
authentication, resource brokers, sandboxes. - Viewing the current status (histograms) must be
possible. - Advantage Stateless, can process large data
volumes.
Advanced systems already exist (see talk by
Andreas Wagner)
14AliEnFS Distributed Analysis
15Interactive Local Analysis
- On a public cluster, or the users laptop.
- Tools like PAW or successor are used for
visualization and ntuples/trees analysis.
16GRID Interactive AnalysisCase 1
- Data transfer to users laptop
- Optional Run/File catalog
- Optional GRID software
Optional run/File Catalog
Analysis scripts are interpreted or compiled on
the local machine
Trees
Remote file server eg rootd
Trees
17GRID Interactive AnalysisCase 2
- Remote data processing
- Optional Run/File catalog
- Optional GRID software
Optional run/File Catalog
Analysis scripts are interpreted or compiled on
the remote machine
Trees
Remote data analyzer eg proofd
Commands, scripts
Trees
histograms
18GRID Interactive AnalysisCase 3
- Remote data processing
- Run/File catalog
- Full GRID software
Run/File Catalog
Analysis scripts are interpreted or compiled on
the remote master(s)
Trees
slave
Trees
Trees
Trees
slave
Remote data analyzer eg proofd
slave
Commands, scripts
slave
Trees
Histograms,trees
Trees
Trees
slave
Trees
slave
19Data Analysis Projects
20Tools for data analysis
- PAW started in 1985, no major developments since
1994. - HippoDraw started in 1991
- ROOT started in 1995, continuous developments
- JAS started in 1995, continuous developments
- Open Scientist ?
- LHC/Anaphe 1996--gt2002
- PI new project in the LHC Computing Grid, just
starting now
21PAW
- The reference since 18 years (1985), Used by most
collaborations - ported on many platforms, small (3 to 15 MB)
- many criticisms during the development phase
- applauded since it is stable
- maintained by Olivier Couet (ROOT team)
Usage still growing
0.1 FTE
22HippoDraw
- Author Paul Kunz
- show the way in 1991/1992
- Usage Paul a 50 year-old CERN physicist
- Seems to be in constant prototyping phases
- Good to have this type of prototype to illustrate
new possible interactive techniques.
1 FTE ?
23ROOT
- In constant development since 1995
- Used by many collaborations and outside HEP
More than 10000 distributions of binary tar files
in February
6 2..FTE
24JAS
- Started in 1995. (Tony Johnson)
- Current version 2. JAS3 presented at this CHEP
- For the Java world.
- How to cooperate with C frameworks?
3 FTE ?
25In AIDA you believe ?
- The Abstract Interfaces for Data Analysis project
was started by the defunct LHC and continued by
Anaphe (now stopped). - Supported by JAS and Open Scientist
- Goal define abstract interfaces to facilitate
cooperation between developers and facilitate
migration of users to new products - Versions 1, 2 and 3 (version 4 for PI ?)
26In AIDA I dont believe
- Abstract Interfaces are fundamental in modern
systems to make a system more modular and
adaptable. - But, common abstract interfaces are not a good
idea. - They force a lowest common denominator
- They require international agreements
- Users will be confused (what is common and not)
- you become slave of a deal against creativity
- It is more important to agree on object
interchange formats and data base access - You can easily change a few hundred lines of
code. You cannot copy Terabytes of data
27The LCG PI project
- Fresh from the oven
- One of the projects recently launched by the
Applications Area of the LCG project. - Ideas
- promote the use of AIDA (version 4)
- Python for scripting
- interface to ROOT CINT
- in gestation
- see Vincenzo
28User Developer views
- Users Requests
- very rarely requests for grandiose new features
- zillions of tiny new features
- zillions of tiny improvements
- want consolidation stability
- Developers view
- want to implement the sexy features
- target modularity (more complex installation?)
- maintenance helpdesk a problem or a chance?
29Lessons from the past
- It takes time to develop a general tool
- more than 7 years for PAW, ROOT and JAS
- User feedback is essential in the development
phase - People like stable systems
- Efficient access to data sets is a prerequisite
- 24h x 7days x 12 months x N years online support
is vital
30Develop/Debug/maintain
In an Interactive system with N basic functions,
the number of combinations may be unlimited, (Not
NxN, but N! ) 10 of the time to develop first
90 of the code. 90 of the time to develop the
remaining 10
31Time to develop
LCG
32Technical aspects
33Desktop
- Plug-in Manager and Dictionary
- GUI
- Graphics 2-d, 3-d
- Event Displays
- Histograming Fitting
- Statistics tools
- Scripting
- Data/Program organization
34Plug-in Manager
Exp Shared libs
User Shared lib
Exp Shared libs
Exp Shared libs
Basic Services, GUI, Math..
General Utility Shared lib
Plug-in manager
I/O manager
Interpreter
I/O manager
Object Dictionary
35The Object Dictionary
Object Dictionary
Data dictionary
Functions dictionary
Inspectors Browsers
I/O
Interpreted scripts GUI Command line
Compiled code
36Scripting for data analysis
- After KUIP and Tk/Tcl era
- Command line Interface required
- Scripts
- interpreted or/and byte-code interpreted
- automatic compilation and linking
- call compiled or interpreted code
- compiled code must be able to call interpreted
code (GUI and configuration scripts) - Big bonus if compiled and interpreted languages
are the same - Scripting and object dictionary symbiosis
- Remote execution of scripts (in parallel)
37Languages scripting
C Compiled code
C Interpreted scripts
Python/Perl scripts
GUI with signal/slots
Interactive User
Batch User
38Comparing scripts
Very interesting project from Subir
Sarkar Cooperation between Java and a C
framework based on Object Dictionary
http//sarkar.home.cern.ch/sarkar/jroot/main.html
39GUI(s)
- Constant evolution
- Microsoft MFC, Win32 API
- Signals/Slots principle very nice. It helps
designing large and modular GUI systems - Interpreters help GUI builders/editors
1983 Vax/VMS SMS VT100
1989 MOTIF Unix workstations
1985 GKS Textronix
2001 Qt Linux/Laptops
1997 Java/Swing The Web
402-D graphics
- An area where constant improvements are required.
- Better plotters, better fonts,...
- Better drivers postscript, SVG, XML, etc
Publication quality is a must. This requirement
alone explains why many proposed data analysis
systems do not penetrate experiments
413-D graphics
- Data structures Objects lt--gt scene
- Scene renderers OpenGL, Open Inventor
- Most difficult is detector geometry graphics
- z-buffer algorithms OK for fast real time fancy
graphics, not OK for good debugging (shape
outline is important on top of z-buffer views). - Vector Postscript (or PDF/SVG) must be available
(not Postscript from OpenGL triangles) - see talks about GraXML and Persint
42Example with PERSINT/ATLAS
43Event Displays
- The most successful event displays so far were
2-D projections (see Aleph, Atlas/Atlantis) - A lot of work with 3-d graphics in many
experiments (see talks about Iguana) - Client-server model
- Access to framework objects, browsers
- One could have expected a bigger role for Java!
- Mismatch with experiment C frameworks?
- Possible directions
- standardize object exchange (SOAP/XML/Root I/O)
- standardize low level graphics exchange (HEPREP)
44Histograming
- This should be a stable area
- Thread Safety
- Binning on parallel systems
- Merging on batch/parallel systems
45Fitting
- Minuit the standard
- Fumili was nice and fast
- Upgrade of Minuit with new algorithms including
Fumili in the pipeline - several GUIs on top
- a very powerful package developed by BaBar
- see talk on RooFit by D.Kirkby
46Statistics Math
- Many tools and algorithms exist
- GSL ?
- Gnu R-Math project
- TerraFerma Initiative
- Subject of discussions at many workshops
- confidence limits workshops
- ACAT FermiLab and Moscow
- Durham
- Need to be federated in a coherent framework
47Lost with Complexity?
- In large collaborations, users are often lost
when confronted to the complexity of big
simulation and reconstruction programs - What is the data organization?
- How are algorithms organized? The hierarchy?
- The problem is amplified by the use of
dynamically configurable systems, dynamic linking
and polymorphism - Browsing data and algorithms is a must
48Folders/ white boards
Folders help understanding complex hierarchical
structures Language Independent Could be
GRID-aware
49Why Folders ?
This diagram shows a system without folders. The
objects have pointers to each other to access
each other's data. Pointers are an efficient
way to share data between classes. However, a
direct pointer creates a direct coupling between
classes. This design can become a very tangled
web of dependencies in a system with a large
number of classes.
50Why Folders ?
In the diagram below, a reference to the data is
in the folder and the consumers refer to the
folder rather than each other to access the data.
A naming and search service provides an
alternative. It loosely couples the classes and
greatly enhances I/O operations. In this way,
folders separate the data from the algorithms and
greatly improve the modularity of an application
by minimizing the class dependencies.
51Tasks/Algorithms
In the same way that Folders can be used to
organize the data, one can use Tasks to organize
a hierarchy of algorithms. Tasks can be
organized into a hierarchical tree of tasks and
displayed in the browser. A Task is an
abstraction with standard functions to
Begin,Execute,Finish. Each Task derived class
may contain other Tasks that can be executed
recursively, such that a complex program can be
dynamically built and executed by invoking the
services of the top level task or one of its
subtasks.
Tasks help understanding the organization and
sequence of execution of large programs
52Directions
53Exchange/Compatibility
- If we assume that several data analysis tools
will be around (HEP made or commercial), it is
important to exchange objects between these tools
(dragdrop, network or files). - The SOAP/XML have emerged as standards to
exchange low level volume of objects. - Several technical solutions are possible. The
winning solutions will be the ones that will be
able to automatize the process by exploiting all
the information in the object dictionary.
54Follow Microsoft ?
- The SOAP/XML are one of the key components of
.NET (and also of the MS competition). - MS is preparing a new OS (Longhorn ?) for 2005.
This new OS will introduce an Object distributed
data base. - This may have a serious impact on the GRID
software and on our tools.
55Access Patterns
- Understand data access patterns
- to objects in one file
- to subsets of objects in many collections
- relations with run/file catalogs
- persistent reference pointers
- Optimize design of containers for
- processing in batch
- interactive parallel processing
- cache management and proxies
56Query processor
- Extend/Develop powerful query systems that
minimize the amount of programming - Optimize I/O (read only the strict necessary)
- are able to process data in parallel, hiding the
complexity of parallelism to the end user. - can be executed again and again, possibly
learning from the previous passes. - Are robust against network failures, CRTL/C,
programming errors. - Can be run in GUI mode, interpreted or compiled
mode
57Event Collections
- Develop/Extend objects able to keep a summary of
previous runs - Event collections with their iterators well
matched to the query processor (eventrun, UUID,
tree entry serial number). - Special objects masks, bit slice index to speed
up searches in large collections. - The system must be able to run with and without
the run/file catalog
58Exploiting meta information
- The normal data analysis mode requires access to
the user classes. - However, experience shows that users also expect
(as it was the case for PAW) to be able to
process their data sets without the
classes/shared libraries used to generate these
data sets, still supporting automatic schema
evolution. - The class meta information is saved in the data
set. Simple queries involving only data class
attributes must be possible without the code. - This requirement has consequences on the way the
object dictionary is used.
59Dependencies Simplicity
- Minimize component dependencies to facilitate
software distribution/portability - The winning tools will be the ones that
- are easy to port to new systems (OS/compilers)
- depend only on other systems also easy to port
- are used in real conditions to guarantee feedback
- are able to evolve very quickly to adapt to new
situations and new requirements.
60Integration with GRID soft
- The data analysis software is an integral part of
the GRID software. It drives the process, not the
inverse. - This implies a close cooperation between teams
working on tools for data analysis and teams
working on the GRID plumbing resource brokers,
authentication,etc, and GRID high level tools
like Condor. - The Batch line and the Interactive line must be
developed in a complementary way.
61Trends Summary
Parallelism on the GRID Batch/Interactive Access
to Catalogs Resource Brokers Process
migration Progress Monitors Proxies/caches Virtu
al data sets
More and more GRID oriented data analysis More
and more experiment-independent software
Efficient Access to large and structured event
collections Interaction with user experiment
classes
Histogram Ntuple viewers Data Presenters
62Acknowledgements
- For a long time, data analysis has been the last
wheel of the car. Many thanks to the organizing
committee for giving me the opportunity to
present my views on the subject.
Enjoy this conference