Title: The LHCb Way of Computing
1The LHCb Way of Computing The approach to its
organisation and development
John Harvey CERN/ LHCb DESY Seminar Jan 15th,
2001
2Talk Outline
- Brief introduction to the LHCb experiment
- Requirements on data rates and cpu capacities
- Scope and organisation of the LHCb Computing
Project - Importance of reuse and a unified approach
- Data processing software
- Importance of architecture-driven development and
software frameworks - DAQ system
- Simplicity and maintainability of the
architecture - Importance of industrial solutions
- Experimental Control System
- Unified approach to controls
- Use of commercial software
- Summary
3Overview of LHCb Experiment
4The LHCb Experiment
- Special purpose experiment to measure precisely
CP asymmetries and rare decays in B-meson
systems - Operating at the most intensive source of Bu, Bd,
Bs and Bc, i.e. the LHC at CERN - LHCb plans to run with an average luminosity of
2x1032cm-2s-1 - Events dominated by single pp interactions - easy
to analyse - Detector occupancy is low
- Radiation damage is reduced
- High performance trigger based on
- High pT leptons and hadrons (Level 0)
- Detached decay vertices (Level 1)
- Excellent particle identification for charged
particles - K/p 1GeV/c lt p lt 100GeV/c
5The LHCb Detector
- At high energies b- and b-hadrons are produced in
same forward cone - Detector is a single-arm spectrometer with one
dipole - ?min 15 mrad (beam pipe and radiation)
- ?max 300 mrad (cost optimisation)
Polar angles of b and b-hadrons calculated using
PYTHIA
6LHCb Detector Layout
7(No Transcript)
8Typical Interesting Event
9The LHCb Collaboration
49 institutes 513 members
10LHCb in numbers
- Expected rate from inelastic p-p collisions is
15 MHz - Total b-hadron production rate is 75 kHz
- Branching ratios of interesting channels range
between 10-5-10-4 giving interesting physics
rate of 5 Hz
11Timescales
- LHCb experiment approved in September 1998
- Construction of each component scheduled to start
after approval of corresponding Technical Design
Report (TDR) - Magnet, Calorimeter and RICH TDRs submitted in
2000 - Trigger and DAQ TDRs expected January 2002
- Computing TDR expected December 2002
- Expect nominal luminosity (2x1032 cm-2sec 1)
soon after LHC turn-on - Exploit physics potential from day 1
- Smooth operation of the whole data acquisition
and data processing chain will be needed very
quickly after turnon - Locally tuneable luminosity ? long physics
programme - Cope with long life-cycle of 15 years
12LHCb Computing Scope and Organisation
13Requirements and Resources
- More stringent requirements
- Enormous number of items to control -
scalability - Inaccessibility of detector and electronics
during datataking -reliability - intense use of software in triggering (Levels 1,
2, 3) - quality - many orders of magnitude more data and CPU -
performance - Experienced manpower very scarce
- Staffing levels falling
- Technology evolving very quickly (hardware and
software) - Rely very heavily on very few experts (1 or 2) -
bootstrap approach - The problem - a more rigorous approach is needed
but this is more manpower intensive and must be
undertaken under conditions of dwindling resources
14Importance of Reuse
- Put extra effort into building high quality
components - Become more efficient by extracting more use out
of these components (reuse) - Many obstacles to overcome
- too broad functionality / lack of flexibility in
components - proper roles and responsibilities not defined (
e.g. architect ) - organisational - reuse requires a broad overview
to ensure unified approach - we tend to split into separate domains each
independently managed - cultural
- dont trust others to deliver what we need
- fear of dependency on others
- fail to share information with others
- developers fear loss of creativity
- Reuse is a management activity - need to provide
the right organisation to make it happen
15Traditional Project Organisation
DAQ Hardware
Message System
16A Process for reuse
Manage Plan, initiate, track,
coordinate Set priorities and schedules, resolve
conflicts
Build Develop
architectural models Choose integration
standards Engineer reusable components
Support Support
development Manage maintain components Validate,
classify, distribute Document, give feedback
Assemble Design
application Find and specialise
components Develop missing components Integrate
components
Requirements (Existing software and hardware)
Systems
17LHCb Computing Project Organisation
Technical Review
National Computing Board
Computing Steering Group
RC
M
E
M
A
M
C
A
E
RC
C
RC
Manage
Assemble
Build
Support
18Data Processing Software
19Software architecture
- Definition of software architecture 1
- Set or significant decisions about the
organization of the software system - Selection of the structural elements and their
interfaces which compose the system - Their behavior -- collaboration among the
structural elements - Composition of these structural and behavioral
elements into progressively larger subsystems - The architectural style that guides this
organization - The architecture is the blue-print (architecture
description document)
1 I. Jacobson, et al. The Unified Software
development Process, Addison Wesley 1999
20Software Framework
- Definition of software framework 2,3
- A kind of micro-architecture that codifies a
particular domain - Provides the suitable knobs, slots and tabs that
permit clients to customise it for specific
applications within a given range of behaviour - A framework realizes an architecture
- A large O-O system is constructed from several
cooperating frameworks - The framework is real code
- The framework should be easy to use and should
provide a lot of functionality
2 G. Booch, Object Solutions, Addison-Wesley
1996
3 E. Gamma, et al., Design Patterns,
Addison-Wesley 1995
21Benefits
- Having an architecture and a framework
- Common vocabulary, better specifications of what
needs to be done, better understanding of the
system. - Low coupling between concurrent developments.
Smooth integration. Organization of the
development. - Robustness, resilient to change
(change-tolerant). - Fostering code re-use
architecture
framework
applications
22Whats the scope?
- Each LHC experiment needs a framework to be used
in their event data processing applications - physics/detector simulation
- high level triggers
- reconstruction
- analysis
- event display
- data quality monitoring,
- The experiment framework will incorporate other
frameworks persistency, detector description,
event simulation, visualization, GUI, etc.
23Software Structure
Applications built on top of frameworks and
implementing the required physics algorithms.
Reconstruction
Simulation
High level triggers
Analysis
One main framework Various specialized
frameworks visualization, persistency,
interactivity, simulation, etc.
Frameworks Toolkits
A series of basic libraries widely used STL,
CLHEP, etc.
Foundation Libraries
24GAUDI Object Diagram
Converter
Converter
Application Manager
Converter
Event Selector
Transient Event Store
Data Files
Persistency Service
Message Service
Event Data Service
JobOptions Service
Algorithm
Algorithm
Algorithm
Data Files
Transient Detector Store
Persistency Service
Particle Prop. Service
Detec. Data Service
Other Services
Data Files
Transient Histogram Store
Persistency Service
Histogram Service
25GAUDI Architecture Design Criteria
- Clear separation between data and algorithms
- Three basic types of data event, detector,
statistics - Clear separation between persistent and transient
data - Computation-centric architectural style
- User code encapsulated in few specific places
algorithms and converters - All components with well defined interfaces and
as generic as possible
26Status
- Sept 98 project started GAUDI team assembled
- Nov 25 98 - 1- day architecture review
- goals, architecture design document, URD,
scenarios - chair, recorder, architect, external reviewers
- Feb 8 99 - GAUDI first release (v1)
- first software week with presentations and
tutorial sessions - plan for second release
- expand GAUDI team to cover new domains (e.g.
analysis toolkits, visualisation) - Nov 00 GAUDI v6
- Nov 00 BRUNEL v1
- New reconstruction program based on GAUDI
- Supports C algorithms (tracking) and wrapped
FORTRAN - FORTRAN gradually being replaced
27Collaboration with ATLAS
- Now ATLAS also contributing to the development of
GAUDI - Open-Source style, expt independent web and
release area, - Other experiments are also using GAUDI
- HARP, GLAST, OPERA
- Since we can not provide all the functionality
ourselves, we rely on contributions from others - Examples Scripting interface, data dictionaries,
interactive analysis, etc. - Encouragement to put more quality into the
product - Better testing in different environments
(platforms, domains,..) - Shared long-term maintenance
- Gaudi developers mailing list
- tilde-majordom.home.cern.ch/majordom/news/gaudi-d
evelopers/index.html
28Data Acquisition System
29Trigger/DAQ Architecture
30Event Building Network
- Requirements
- 6 GB/s sustained bandwidth
- Scalable
- 120 inputs (RUs)
- 120 outputs (SFCs)
- commercial and affordable (COTS, Commodity?)
- Readout Protocol
- Pure push-through protocol of complete events to
one CPU of the farm - Destination assignment following identical
algorithm in all RUs (belonging to one partition)
based on event number - Simple hardware and software
- No central control ? perfect scalability
- Full flexibility for high-level trigger
algorithms - Larger bandwidth needed (50) compared with
phased event-building - Avoiding buffer overflows via throttle to
trigger - Only static load balancing between RUs and SFCs
31Readout Unit using Network Processors
- IBM NP4GS3
- 4 x 1Gb full duplex Ethernet MACs
- 16 RISC processors _at_ 133 MHz
- Up-to 64 MB external RAM
- Used in routers
- RU Functions
- EB and formatting
- 7.5 msec/event
- 200 kHz evt rate
32Sub Farm Controller (SFC)
- Alteon Tigon 2
- Dual R4000-class processor running at 88 MHz
- Up to 2 MB memory
- GigE MAClink-level interface
- PCI interface
- 90 kHz event fragments/s
- Development environment
- GNU C cross compiler with few special features to
support the hardware - Source-level remote debugger
Standard PC
PCI Bus
Local Bus
Readout Network (GbE)
Smart NIC
CPU
PCI Bridge
Subfarm Network (GbE)
NIC
Memory
50 MB/s
0.5 MB/s
Controls Network (FEth)
Control NIC
33Control Interface to Electronics
- Select a reduced number of solutions to interface
Front-end electronics to LHCbs control system - No radiation (counting room) Ethernet to credit
card PC on modules - Low level radiation (cavern)10Mbits/s custom
serial LVDS twisted pairSEU immune antifuse
based FPGA interface chip - High level radiation (inside detectors)CCU
control system made for CMS trackerRadiation
hard, SEU immune, bypass - Provide support (HW and SW) for the integration
of the selected solutions
34Experiment Control System
35Control and Monitoring
LHC-B Detector
Data rates
VDET TRACK ECAL HCAL MUON RICH
40 MHz
Level 0 Trigger
40 TB/s
Level-0 Front-End Electronics Level-1
1 MHz
Timing Fast Control
L0
Fixed latency 4.0 ms
1 TB/s
L1
40 kHz
Level 1 Trigger
LAN
1 MHz
Front-End Multiplexers (FEM)
Front End Links
6 GB/s
Variable latency lt1 ms
RU
RU
RU
Read-out units (RU)
Throttle
Read-out Network (RN)
6 GB/s
SFC
SFC
Sub-Farm Controllers (SFC)
Variable latency L2 10 ms L3 200 ms
Control Monitoring
Storage
50 MB/s
Trigger Level 2 3 Event Filter
CPU
CPU
CPU
CPU
36Experimental Control System
- The Experiment Control System will be used to
control and monitor the operational state of the
detector, of the data acquisition and of the
experimental infrastructure. - Detector controls
- High and Low voltages
- Crates
- Cooling and ventilation
- Gas systems etc.
- Alarm generation and handling
- DAQ controls
- RUN control
- Setup and configuration of all readout components
(FE, Trigger, DAQ, CPU Farm, Trigger
algorithms,...)
37System Requirements
- Common control services across the experiment
- System configuration services coherent
information in database - Distributed information system control data
archival and retrieval - Error reporting and alarm handling
- Data presentation status displays, trending
tools etc. - Expert system to assist shift crew
- Objectives
- Easy to operate 2/3 shift crew to run complete
experiment - Easy to adapt to new conditions and requirements
- Implies integration of DCS with the control of
DAQ and data quality monitoring
38Integrated System trending charts
DAQ
Slow Control
39Integrated system error logger
ALEPH error logger, ERRORS MONITOR
ALARM 2-JUN 1130 ALEP R_ALEP_0 RUNC_DAQ
ALEPHgtgt DAQ Error 2-JUN 1130 ALEP TPEBAL
MISS_SOURCE TPRP13 lt1_missing_Source(s)gt 2-JUN
1130 ALEP TS TRIGGERERROR Trigger protocol
error(TMO_Wait_No_Busy) 2-JUN 1130 TPC
SLOWCNTR SECTR_VME VME CRATE fault in SideA
Low
DAQ
Slow Control
40Scale of the LHCb Control system
- Parameters
- Detector Control O (105) parameters
- FE electronics Few parameters x 106 readout
channels - Trigger DAQ O(103) DAQ objects x O(102)
parameters - Implies a high level description of control
components (devices/channels) - Infrastructure
- 100-200 Control PCs
- Several hundred credit-card PCs.
- By itself a sizeable network (ethernet)
41LHCb Controls Architecture
Conf. DB, Archives, Log files,
Technologies
Storage
Supervision
SCADA
Users
Servers
WAN
LAN
Process Management
. . .
OPC
LAN
Controller/ PLC
Communication
Other systems (LHC, Safety, ...)
VME
Fieldbus
PLC
Field Management
Fieldbuses
Experimental equipment
Devices
42Supervisory Control And Data Acquisition
- Used virtually everywhere in industry including
very large and mission critical applications - Toolkit including
- Development environment
- Set of basic SCADA functionality (e.g. HMI,
Trending, Alarm Handling, Access Control,
Logging/Archiving, Scripting, etc.) - Networking/redundancy management facilities for
distributed applications - Flexible Open Architecture
- Multiple communication protocols supported
- Support for major Programmable Logic Controllers
(PLCs) but not VME - Powerful Application Programming Interface (API)
- Open Database Connectivity (ODBC)
- OLE for Process Control (OPC )
43Benefits/Drawbacks of SCADA
- Standard framework gt homogeneous system
- Support for large distributed systems
- Buffering against technology changes, Operating
Systems, platforms, etc. - Saving of development effort (50-100 man-years)
- Stability and maturity available immediately
- Support and maintenance, including documentation
and training - Reduction of work for the end users
- Not tailored exactly to the end application
- Risk of company going out of business
- Companys development of unwanted features
- Have to pay
44Commercial SCADA system chosen
- Major evaluation effort
- technology survey looked at 150 products
- PVSS II chosen from an Austrian company (ETM)
- Device oriented, Linux and NT support
- The contract foresees
- Unlimited usage by members of all institutes
participating in LHC experiments - 10 years maintenance commitment
- Training provided by company - to be paid by
institutes - Licenses available from CERN from October 2000
- PVSS II will be the basis for the development of
the control systems for all four LHC experiments
(Joint COntrols Project)
45Controls Framework
- LHCb aims to distribute with the SCADA system a
framework - Reduce to a minimum the work to be performed by
the sub-detector teams - Ensure work can be easily integrated despite
being performed in multiple locations - Ensure a consistent and homogeneous DCS
- Engineering tasks for framework
- Definition of system architecture (distribution
of functionality) - Model standard device behaviour
- Development of configuration tools
- Templates, symbols libraries, e.g. power supply,
rack, etc. - Support for system partitioning (uses FSM)
- Guidelines on use of colours, fonts, page layout,
naming, ... - Guidelines for alarm priority levels, access
control levels, etc. - First Prototype released end 2000
46Application Architecture
ECS
LHC
DCS
DAQ
Vertex
Tracker
Muon
Vertex
Tracker
Muon
GAS
HV
Temp
HV
GAS
HV
FE
RU
FE
RU
FE
RU
SAFETY
47Run Control
48Summary
- Organisation has important consequences for
cohesion, maintainability, manpower needed to
build system - Architecture driven development maximises common
infrastructure and results in systems more
resilient to change - Software frameworks maximuse level of reuse and
simplify distributed development by many
application builders - Use of industrial components (hardware and
software) can reduce development effort
significantly - DAQ is designed with simplicity and
maintainability in mind - Maintain a unified approach e.g. same basic
infrastructure for detector controls and DAQ
controls
49Extra Slides
50(No Transcript)
51Typical Interesting Event
52(No Transcript)
53LHCb Collaboration
France Clermont-Ferrand, CPPM Marseille, LAL
Orsay Germany Tech. Univ. Dresden, KIP Univ.
Heidelberg, Phys. Inst. Univ. Heidelberg, MPI
Heidelberg, Italy Bologna, Cagliari, Ferrara,
Firenze, Frascati, Genova, Milano, Univ. Roma I
(La Sapienza), Univ. Roma II(Tor
Vergata) Netherlands NIKHEF Poland Cracow
Inst. Nucl. Phys., Warsaw Univ. Spain Univ.
Barcelona, Univ. Santiago de Compostela Switzerlan
d Univ. Lausanne, Univ. Zürich UK Univ.
Bristol, Univ. Cambridge, Univ. Edinburgh, Univ.
Glasgow, IC London, Univ. Liverpool, Univ.
Oxford, RALCERN Brazil UFRJ China IHEP
(Beijing), Tsinghua Univ. (Beijing) Romania
IFIN-HH Bucharest Russia BINR (Novosibirsk),
INR, ITEP,Lebedev Inst., IHEP,PNPI(Gatchina) Ukrai
ne Inst. Phys. Tech. (Kharkov), Inst. Nucl.
Research (Kiev)
54Requirements on Data Rates and Computing
Capacities
55LHCb Technical Design Reports
Submitted January 2000 Recommended by
LHCC March 2000 Approved by RB April 2000
Submitted September 2000 Recommended November
2000
Submitted September 2000 Recommended November
2000
56Defining the architecture
- Issues to take into account
- Object persistency
- User interaction
- Data visualization
- Computation
- Scheduling
- Run-time type information
- Plug-and-play facilities
- Networking
- Security
57Architectural Styles
- General categorization of systems 2
- user-centric focus on the direct
visualization and manipulation of the objects
that define a certain domain - data-centric focus upon preserving the
integrity of the persistent objects in
a system - computation-centric focus is on the
transformation of objects that are
interesting to the system - Our applications have elements of all three.
Which one dominates?
58Getting Started
- First crucial step was to appoint an architect -
ideally skills as - OO mentor, domain specialist, leadership,
visionary - Started with small design team 6 people,
including - developers , librarian, use case analyst
- Control activities through visibility and self
discipline - meet regularly - in the beginning every day, now
once per week - Collect URs and scenarios, use to validate the
design - Establish the basic design criteria for the
overall architecture - architectural style, flow of control,
specification of interfaces
59Development Process
- Incremental approach to development
- new release every few ( 4) months
- software workshop timed to coincide with new
release - Development cycle is user-driven
- Users define priority of what goes in the next
release - Ideally they use what is produced and give rapid
feedback - Frameworks must do a lot and be easy to use
- Strategic decisions taken following thorough
review (1 /year) - Releases accompanied by complete documentation
- presentations, tutorials
- URD, reference documents, user guides, examples
60Possible migration strategies
C
Fortran
SICb
?
1
Gaudi
Fast translation of Fortran into C
SICb
2
Gaudi
Wrapping Fortran
SICb
3
Gaudi
Framework development phase
Transition phase
Hybrid phase
Consolidation phase
61How to proceed?
- Physics Goal
- To be able to run new tracking pattern
recognition algorithms written in C in
production with standard FORTRAN algorithms in
time to produce useful results for the RICH TDR. - Software Goal
- To allow software developers to become familiar
with GAUDI and to encourage the development of
new software algorithms in C. - Approach
- choose strategy 3
- start with migration of reconstruction and
analysis code - simulation will follow later
62New Reconstruction Program - BRUNEL
- Benefits of the approach
- A unified development and production environment
- As soon as C algorithms are proven to do the
right thing, they can be brought into production
in the official reconstruction program - Early exposure of all developers to Gaudi
framework - Increasing functionality of OO DST
- As more and more of the event data become
available in Gaudi, it will become more and more
attractive to perform analysis with Gaudi - A smooth transition to a C only reconstruction
63Integrated System - databases
The power supply on that VME crate
Readout System Database
Slow Control Database
Detector description
64Frontend Electronics
- Data Buffering for Level-0 latency
- Data Buffering for Level-1 latency
- Digitization and Zero Suppression
- Front-end Multiplexing onto Front-end links
- Push of data to next higher stage of the readout
(DAQ)
65Timing and Fast Control
- Provide common and synchronous clock to all
components needing it - Provide Level-0 and Level-1 trigger decisions
- Provide commands synchronous in all components
(Resets) - Provide Trigger hold-off capabilities in case
buffers are getting full - Provide support for partitioning (Switches, ORs)
66IBM NP4GS3
- Features
- 4 x 1Gb full duplex Ethernet MACs
- 16 special purpose RISC processors _at_ 133 MHz with
2 hw threads each - 4 processor (8 threads) share 3 co-processors for
special functions - Tree search
- Memory move
- Etc.
- Integrated 133 MHz Power PC processor
- Up-to 64 MB external RAM
67Event Building Network Simulation
- Simulated technology Myrinet
- Nominal 1.28 Gb/s
- Xon/Xoff flow control
- Switches
- ideal cross-bar
- 8x8 maximum size (currently)
- wormhole routing
- source routing
- No buffering inside switches
- Software used Ptolemy discrete event framework
- Realistic traffic patterns
- variable event sizes
- event building traffic
68Event Building Activities
- Studied Myrinet
- Tested NIC event-building
- simulated switching fabric of the size suitable
for LHCbResults show that switching network
could be implemented (provided buffers are added
between levels of switches) - Currently focussing on xGb Ethernet
- Studying smart NICs (-gt Nikos talk)
- Possible switch configuration for LHCb with
todays technology (to be simulated...)
Myrinet Simulation
Multiple Paths between sources and destinations!
69Network Simulation Results
Results dont depend strongly on specific
technology (Myrinet), but rather on
characteristics (flow control, buffering,
internal speed, etc)
FIFO buffers between switching levels allow to
recover scalability 50 efficiency Law of
nature for these characteristics
70Alteon Tigon 2
- Features
- Dual R4000-class processor running at 88 MHz
- Up to 2 MB memory
- GigE MAClink-level interface
- PCI interface
- Development environment
- GNU C cross compiler with few special features to
support the hardware - Source-level remote debugger
71Controls System
- Common integrated controls system
- Detector controls
- High voltage
- Low voltage
- Crates
- Alarm generation and handling
- etc.
- DAQ controls
- RUN control
- Setup and configuration of all components (FE,
Trigger, DAQ, CPU Farm, Trigger algorithms,...) - Consequent and rigorous separation of controls
and DAQ path
Same system for both functions! Scale 100-200
Control PCs many 100s of Credit-Card PCs
By itself sizeable Network! Most likely Ethernet