Title: Computing in HEP
1Computing in HEP
- A Introduction to Data Analysis in High Energy
Physics - Max Sang
- Applications for Physics Infrastructure Group
- IT Division, CERN, Geneva
- max.sang_at_cern.ch
2Introduction to HEP
- Accelerators produce high intensity, high energy
beams of particles like protons or electrons. - Detectors are huge, multi-layered electronic
devices constructed around the points where the
beams collide with targets or other beams. - Planned and constructed by multinational
collaborations of hundreds of people over several
years. - Once operational, they run for years (e.g. LEP
program 1989-2000).
3The Large Hadron Collider
CERN
Eight underground caverns for detectors
27km circumference 100m below surface First beam
2006
4CMS
- Under construction now - ready 2006
- 21 m long, 15 m diameter
- 12500 tons
- As much iron as the Eiffel Tower
- 1900 physicists from 31 countries
5Introduction to HEP (II)
- Events are like photographs of individual
subatomic interactions taken by the detectors. - Events produced at high rates (kHz-MHz) for
months at a time with minimal human intervention.
Analysis continues for years. - Fundamental physics processes are quantum
(probabilistic). They are uncorrelated
(consecutive events unconnected) but occur at a
wide range of frequencies - some very rare. Some
are more interesting than others...
6Introduction to HEP (III)
- Data are grouped into runs, periods, years.
Calibrations, detector faults, beam conditions,
etc. are associated with certain time periods,
e.g. The calorimeter was off during run 1234 - Event Generators simulate the collisions and
and produce the final state particles. - These are processed by simulated detectors to
produce Monte Carlo data for comparison with
what we see in the real thing. Iterative process
of comparison, tuning, model verification.
7Extracting the Data
- Passage of particles through detector components
produces ionisation which is amplified to a
detectable level. - Front-end electronics turn pulses into digits.
- Hardware processing turns digits into hits.
- Software turns hits into tracks, clusters
etc. - Multi-level trigger/filter decides what events to
keep (sometimes only one event in 107). - Online reconstruction ? storage.
8The LEP Era (Started 1989)
- Four detectors (300 people each) producing
- 50 kHz collision rate ? 5 Hz storage rate.
- Event size 100kB, reconstructed by small farm of
O(10) very high-end workstations. - lt 500 GB/year/experiment
- Stored on tape (with disk caching) at CERN.
- Analysed on mainframes by remote batch jobs.
- Ntuples (? 100MB) returned to user for more
(interactive) analysis and calculation. Plots
produced for presentations and papers.
9The LHC Era (Starts 2006)
- 4 detectors (6k people in total)
- 50 MHz collision rate ? 100 Hz storage rate.
- 500 GB/s raw data rate after triggering.
- Event size 1-2 MB, reconstructed by farm of 1k
PCs. - 1 PB/year/experiment in 2007, increasing rapidly.
Total by 2015 for all detectors 100 PB. - Searches may look for single events in 107. Every
user (in 30 countries) will want to eat millions
of events at a single sitting, with reasonably
democratic data access.
10Physicists are also Programmers
- All data analysis done using computers
- The physicists are all programmers, but almost
none of them have any formal CS training - Some will be very experienced (usually F77). Will
write lots of code for reconstruction, triggering
etc. - Others write more modest programs for their own
data analysis. - Some will be fresh graduate students whove never
written a line of code. - Our job is to help them do physics.
11What Software do they Need?
- Experiment-specific code
- Triggering, data acquisition, slow controls,
reconstruction, new physics code - Mostly written by the experimentalists without
assistance - Event generators
- Highly technical, constantly in flux
- Written by phenomenologists
- We dont help with these!
12What Software do they Need?(II)
- Specialised HEP tools
- Detector simulation tools, relativistic
kinematics, ... - General purpose scientific tools with a HEP slant
- Data visualisation, histogramming, ...
- General purpose technical libraries
- Random numbers, matrices, geometry, analytical
statistics, 2D and 3D graphics, ... - We do help with these!
13The Situation in 1995
- Millions of lines of F77, some of it very
technical - Thousands of man-years of debugging
- Users know and love/hate the software, and they
dont want to change - Serious and unavoidable maintenance commitment
for old code - F77 is here to stay! - Shrinking manpower in IT division
- Not long until the start of the LHC programme.
Change now or wait until 2020!
14The Old Software
- Largely home-grown in 70s and 80s
- Persistent storage and memory management ZEBRA
- Code management PATCHY
- Scripting KUIP/COMIS
- Histograms and Ntuples HBOOK
- Detector simulation GEANT 3
- Fitting Minimisation MINUIT
- Mathematics, random numbers, kinematics MATHLIB
- Graphics HIGZ/HPLOT
- Visualisation and interactive analysis PAW
15The Anaphe Project
- Provide a modern, object-oriented, more flexible,
more powerful replacement for CERNLIB with fewer
people in less time. - Identify areas where commercial and/or Open
Source products can (or must) be used instead of
home-grown solutions - Concentrate efforts on HEP-specific tasks
- Use object-oriented techniques and plan for very
long term maintenance and evolution - Detector simulation is a separate project (v. big)
16Commodity Solutions
- Luckily, computing has also evolved.
- What can we get off-the-shelf?
- Open Source tools
- Code management (CVS)
- Graphics (Qt, OpenGL)
- Scripting (Python, Perl)
- Commercial products
- Persistency (Objectivity OODB)
- Mathematics (Nag library CERN edition)
17HEP Community Developments
- Not everything is being done solely at CERN!
- CLHEP - C class libraries for HEP
- Random numbers
- 3D geometry, vectors, matrices, kinematics
- Units and dimensions
- Generic HEP classes (particles, decay chains etc)
- Generators being moved (slowly) to C
- The competition (JAS, Open Scientist, Root)
18Anaphe C Libraries (I)
- Fitting FML (fitting and minimisation library)
- Flexible, extensible library based on Gemini
engine - Gemini - core fitting engine based on Nag or
MINUIT - Histograms HTL (histogram template library)
- Histograms are statistical distributions of
measured quantities - the workhorse of HEP
analysis. Must be flexible, extensible and very
efficient.
19Anaphe C Libraries (II)
- QPlotter Graphics package
- For drawing histograms and more
- Based on Qt (superset of Motif)
- NtupleTag
- Extends concept of ntuple ( static table of
data) - Can add with new columns as you work
- Can navigate back to original events
- Smart clustering of data
- See Zsolts presentation...
20Interactive Analysis
- Analysis in HEP Data Mining
- Extract parameters from large multi-dimensional
samples. - Typical tasks
- Plot one or more variables with cuts on yet
others - exploring the variable space. - Perform statistical tests on distributions
(fitting, moments etc.) - Produce histograms etc. for papers or talks.
21Interactive Analysis (II)
- Almost all analyses begin as interactive
playing with the data and progress organically
to large, complex, CPU intensive procedures. - Step 1 single commands to a script interpreter
e.g. plot x for all events with y gt 5 - Step 2 multi-command scripts/macros
- Step 3 procedures can be translated into C
functions and called interactively - Step 4 user can build new libraries and interact
with them through the command line (etc...)
22Interactive Analysis (III)
- The progression from command line, to macro, to
compiled library, should be smooth and simple. - Doing the easy things should be easy to allow
rapid development and prototyping of algorithms. - Doing complex things then becomes significantly
easier than starting from scratch in C - Distributed analysis must also be possible (see
Kubas talk)
23Lizard (I)
- Interactive environment for data analysis using
the other Anaphe components - First prototype (with limited functionality)
available since CHEP 2000 - Re-design started in April 2000
- Beta version October 2000
- Full version out since June 2001
- Much more work and testing to do, but already
approaching (and surpassing) PAW functionality - Embedded in Python
24Lizard (II)
- Architecture
- Everything interacts with everything else through
their abstract interfaces so the implementation
is hidden. - Commander C classes load the implementation
classes at run time and become proxies for them. - Use SWIG to generate shadow classes from the
Commander header files. These are compiled into
the Python library and become accessible as new
Python objects. - Swapping components at run time becomes trivial.
25Lizard Screenshot
26Behind the Scenes
Automatically generated by SWIG
AIDA Interfaces
User
Controller Shadow classes
C interfaces
Python
C implementations
Anaphe implementations
27AIDA
- Use of abstract interfaces promotes weak coupling
between components. - AIDA (Abstract Interfaces for Data Analysis)
project is extending this to community-wide
standard interfaces which will allow use of C
components in Java and vice versa. - Developers only need to learn one way of
interacting with a histogram, which works with
all compliant implementations.
28Summary
- HEP has (and has always had) serious computing
requirements - The old model (F77 monoliths) is no longer
workable in the LHC era - New software in C and Java uses modern software
design to plan for the long term - Anaphe is CERN IT divisions contribution
- Flexible, extensible, modular, efficient
- The LHC is coming and we must be ready!
29Further information
- More information about the detectors and HEP in
general - http//cmsinfo.cern.ch
- http//cern.ch/atlas
- CERN IT Division
- http//cern.ch/IT
- The Anaphe project
- http//cern.ch/Anaphe