Title: (A Taste of) Data Acquisition, Triggers, and Controls
1(A Taste of) Data Acquisition, Triggers,
and Controls
- Gregory Dubois-FelsmannCaltechCHEP 2003
2Routine disclaimer
- Much interesting material presented
- About 35 talks (13.5 hours, 100MB as
uploaded)from many experiments covering a rich
variety of issues - Many thanks to the speakers!
- Fitting this into 25 minutes requires procedures
familiar to the DAQ community - Feature extraction
- and unfortunately also triggering with a
decidedly imperfect data acquisition system - So I apologize in advance for the things Ive
missed or perhaps misunderstood!
3Outline
- Overview of talks presented
- Some technological themes
- Trigger architectural issues
- The great challenge for the next years scaling
- Conclusions
4Talks presented
W. Badgett CDF Run II Data Acquisition
R. Rechenmacher Run II DZERO DAQ / Level 3 Trigger System
S. Luitz The BaBar Event Building and Level-3 Trigger Farm Upgrade
R. Itoh Upgrade of Belle DAQ System
A. Polini The Architecture of the ZEUS Micro Vertex Detector DAQ and Second Level Global Track Trigger
J. Schambach STAR TOF Readout Electronics and DAQ
R. Divià Challenging the challenge Handling data in the Gigabit/s range ALICE
J. Gutleber, L. Orsini XDAQ - Real Scale Application Scenarios CMS et al.
G. Lehmann The DataFlow of the ATLAS Trigger and Data Acquisition System
S. Stancu The use of Ethernet in the DataFlow of the ATLAS Trigger DAQ
S. Gadomski Experience with multi-threaded C applications in the ATLAS DataFlow
5Talks presented
A. Ceseracciu A Modular Object Oriented Data Acquisition System for the Gravitational Wave AURIGA experiment
R. Mahapatra Cryogenic Dark Matter Search Remote Controlled DAQ
T. Steinbeck A Software Data Transport Framework for Trigger Applications on Clusters ALICE
T. Higuchi Development of PCI Bus Based DAQ Platform for Higher Luminosity Experiments e.g., Super-Belle
J. Mans Data Acquisition Software for CMS HCAL Testbeams
B. Lee An Impact Parameter Trigger for DØ
G. Comune The Algorithm Steering and Trigger Decision mechanism of the ATLAS High Level Trigger
V. Boisvert The Region of Interest Strategy for the ATLAS Second Level Trigger
S. Wheeler Supervision of the ATLAS High Level Triggers
6Talks presented
- Thursday parallel session 5 DAQ and controls
J. Kowalkowski Understanding and Coping with Hardware and Software Failures in a Very Large Trigger Farm BTeV
M. Gulmini Run Control and Monitor System for the CMS Experiment
S. Kolos Online Monitoring Software Framework in the ATLAS Experiment
G. Watts DAQ Monitoring and Auto Recovery at DØ
K. Maeshima Online Monitoring for the CDF Run II Experiment
M. Gonzalez Berges The Joint COntrols Project Framework CERN multi-expt. LHC et al.
S. Lüders The Detector Safety System for LHC Experiments
J. Hamilton A Generic Multi-node State Monitoring System BaBar
V. Gyurjyan FIPA Agent Based Network Distributed Control System JLAB
M. Elsing Configuration of the ATLAS Trigger System
In parallel with session 5a, unfortunately
7Talks presented
- Thursday parallel session 5a first level
triggers - Related plenary talks
G. Grastveit FPGA Co-processor for the ALICE High Level Trigger
B. Scurlock A 3-D Track-Finding Processor for the CMS Level-1 Muon Trigger
P. Chumney Level-1 Regional Calorimeter System for CMS
F. Meijers The CMS Event Builder
M. Grothe Architecture of the ATLAS High Level Trigger Event Selection Software
8Projects represented
- Strong emphasis onLHC, continuingrecent trend
(by number of talks)
9Some technological themes
- Triumph of C for HEP DAQ confirmed
- Along with Java for GUIs
- Triumph of commodity computing hardware (Intel
IA-32) and operating system (Linux) - Large-farms-of-small-boxes model confirmed
- Near-triumph of commodity networking hardware
- Fast and GB Ethernet, standard commercial switches
10More technological themes
- Continuation of long trend of reducing scope of
application of custom hardware - Yet most DAQ software is still custom in present
experiments - Serious efforts to find what is generic in DAQ
programming - Not just to isolate patterns (knowledge
applicable by others) but also actual programming
toolkits - Continuation of trend of moving offline code
and/or frameworks into high level triggers - New widespread use of XML for non-event
information(configuration, monitoring) - Many of the major new challenges relate to
scaling to huge farms - Performance
- Operability, control, monitoring
11Programming languages
- C has to a large extent proven itself
- Some current experiments have DAQ systems written
from scratch almost entirely in C, including
real-time code running in an embedded RTOS
environment - E.g., BaBar DataFlow, feature extraction, Level
3 trigger, and rest of online system written in
serious OO C on VxWorks, Solaris, Linux - Achieves virtually zero deadtime at 5.5kHz
L1Accept rate on 1997-vintage 300 MHz Motorola
SBCs and 1.4 GHz Linux P-IIIs Luitz - New projects are fairly uniformly continuing to
adopt it for code in the event data flow path - Caveats remain, though usually worth the cost
- Executable size
- Dependency management seems to remain a challenge
- Non-trivial work in creating shareables
- Ease with which naïve users can write
non-performant code - Threading requires care (see below)
- Even see use in hardware trigger FPGA coding (see
below) Scurlock, CMS
12BaBar online and DAQ system
13Programming languages II
- Java has emerged as the other major player
- Especially for graphical applications
- E.g., run control GUIs
- Good points
- (Some say) ease of programming vs. C
- Universal availability including rich GUI
graphics library - Simple API for remote object programming (RMI)
- Caveats
- Performance (although results vary considerably)
- JVM quality-of-implementation, platform
(non-)independence - No other real competitors on the horizon except
for niche applications - Saw appearances of Python, LISP, etc.
14Computing hardware and operating systems
- Farms of Intel IA-32 / Linux 1,2-CPU machines are
the coin of the realm today tomorrow? - Linux will continue to be!
- Speakers had little to say about CPU chips except
that they will buy the most cost-effective farms
they can shortly before each major project goes
into final commissioning - Intel Itanium line may get some traction by then
- Buy-late is a big Moores Law win, but see
scaling concerns below - Linux success is particularly striking
- In use in essentially every HEP role, HLT through
laptops - Even approaching in the embedded world PCI DAQ
component development for Super-KEK-B et al.,
Higuchi - Still some lingering attachment to other Unix
flavors for disk servers(though Linux-based IDE
RAID storage is also becoming very common, in the
offline world, too) - Linux is so successful in HEP that cross-platform
portability may erode
15Linux in the embedded DAQ card world
16Multithreading issues
- Language and library level
- Need to stay aware of serialization from locking
mechanisms used in outside libraries - Example C Standard Library containers memory
pool by default uses a single lock found to
produce x2 penalty in ATLAS HLT tests Gadomski - O/S level
- Linux is not a real-time operating system
- Still no full implementation of POSIX threads
- Implementation of pthread yield operation
interacts poorly with time slicing in scheduler
(cant reschedule immediately). - Found to produce x4 penalty in ATLAS tests
Gadomski kernel patch available - See also under offline code in the online
world below
17Networking
- Commodity networking hardware!
- The various flavors of Ethernet (Fast, GB, and
beyond) have become the almost unchallenged
fabric of higher-level triggering and event
building (one major exception CMS still
considering Myrinet) - Standard protocols (TCP, UDP) also ascendant
some efforts to explore raw Ethernet - All groups seem to be making some of the same
discoveries, notably Network switches are not
simple, transparent devices! - Flow control and buffering behavior must be
understood in detail - Vendors can be cagey about the details
(proprietary internal arch.) - Need good tools to monitor traffic behavior
18Networking adventures
19Offline code in the online world
- Use of offline code in high-level software
triggers - application framework
- or even offline reconstruction code
- Several current experiments and most future ones
are doing this - Problems
- Dependencies
- Performance offline code has often not been
exposed to the close scrutiny typical for
online, and may have axes of flexibility at odds
with high performance (CPU cycles and memory
utilization) - Multithreading offline code is almost never
written to be thread-safePresents a problem when
thread parallelism is needed in a high level
trigger - Make a subset of the offline code, and its
framework, thread-safe (ATLAS L2)? - Replace threads with process and a shared-memory
data model (BaBar, D0)? - Offline event loop model may not be directly
usable (BaBar, D0, ATLAS)
-
-
- Benefits
- Greatly simplifies incorporation of trigger
algorithms in simulations - Eases development and validation of trigger
algorithms
20Genericity patterns and products
- We have always noticedThe same ideas keep
coming up and the same problems have to be solved
over and over again. - We have always thoughtThere must be something
to be gained from applying that knowledge. Can
generic problems be solved with generic tools? - We have tried in various ways
- Identifying patterns learning how to think about
these common problems building up expert
knowledge that can be applied to the next
experiment learning lessons - Thats what CHEP is all about
- But we also aspire to reuse applying a software
product in more than one place
21Reusing concrete products
- Sounds great has a mixed history
- In some places this has come to work well
- CERNLIB, GEANT4, ROOT are ubiquitous in HEP
- But there are real obstacles, chief among them
- The difficulty of sharing code bases between
experiments in different phases of development
(example divergence of BaBar and CDF versions of
their originally shared application framework) - The devil is in the details often the high
level features of a system seem generic (the
patterns) but the implementation picks up
experiment-specific features - Sometimes this is because of concern with
compatibility with historical code - Sometimes it arises when the high-level
architecture turns out to need to be driven by
some low-level optimization - Perhaps in principle the high-level design could
be extended to cover both users needs, but the
press of deadlines favors a quick hack that
doesnt require renegotiation
22Reuse in the online environment
- Reuse has been perhaps less successful on average
in the online and DAQ worlds. - Often online code is prepared later in the
construction of an experiment(since simu/reco
code is usually already required at the proposal
stage),thus under more time pressure. - Online code tends to require more low-level
optimizations. Often these come with serious
tradeoffs that can limit the flexibility of a
design, even to its in-house users. - But perhaps the next round of experiments
presents a rare opportunity to do better - The LHC experiments are on the same time scale
and they still have a fair amount of time left. - The use of a common language, O/S, and networking
environment helps. - There are some interesting projects under way!
23Quest for generic online software
- Data acquisition
- XDAQ (arising from the CMS project)
Gutleber/Orsini - So far mostly beingused to provide acommon
platform for several subdetectors
commissioning DAQ systems and ease their
integration into the main DAQ - Exploring collaborationwith other expts
- Performance seems good.
- How CMS-free can it bekept over time, though?
24Generic software for online
- Control and monitoring frameworks
- Lots of projects in this area a couple of
examples - (see Thursday program for more)
- Inherently fault-tolerant architectures
Kowalkowski - Motivated by BTeV, but is ata very generic
level, with CScollaborators viewing it as
ageneral research project - In early stages, but worthwatching
- Generic monitoringframeworks
- Example D0s XML-baseddistributed
monitoringWatts
25XML
- A fairly new trend XML is cropping up all over
in online configuration and monitoring
applications - Perhaps surprising? Not-very-compact, textual
representation! - But we are willing to spend some CPU and network
bandwidth here (since other things in the systems
require so much more) - Benefits
- Avoids private toy language problem, when
combined with scripting tools - No more hand-written run card-type parsers
- Easy to parse in many languages, and thus pick an
appropriate (and perhaps different) language for
generating the data and for applying it. - Very easy to transmit over a network (byte
stream) - Aids in a) using existing generic tools (editors,
validators)b) allowing new tools we build to be
more generic within HEP
26Triggering scope
27Triggering
- Far too much information presented to cover in
detail - Remarkable things can now be done in commodity
CPUs - E.g., ZEUS second level tracking trigger / global
tracking trigger Polini - Silicon vertex tracking information becoming
absorbed into tracking triggers ahead of Level
3 - ZEUS
- Fermilab (for B physics efficiency) Lee, D0
hardware
28ZEUS software tracking trigger
29Hardware triggers
- Still indispensable at the first level
- Used in some places as adjuncts to second level
- Good progress on testing and production for LHC
experiments - Scurlock Generate VHDL from C codeeases
production of highly accurate board-level
simulation of trigger
B. Lee An Impact Parameter Trigger for DØ
G. Grastveit FPGA Co-processor for the ALICE High Level Trigger
B. Scurlock A 3-D Track-Finding Processor for the CMS Level-1 Muon Trigger
P. Chumney Level-1 Regional Calorimeter System for CMS
30The CMS ATLAS choice
F. Meijers The CMS Event Builder
G. Lehmann The DataFlow of the ATLAS Trigger and Data Acquisition System
V. Boisvert The Region of Interest Strategy for the ATLAS Second Level Trigger
-
- and many other talks on configuration and other
details - CMS baseline
- Build full events at output of Level 1 (100 kHz,
1MB events) - Risk this is a lot of data to handleAble to
fall back to a partial-readout Level 2 model - ATLAS baseline
- L2 trigger operates on ROIs nominally 2 of
event data at output of Level 1 (75 kHz, 1MB
events, 20 kB ROI data) - Full event build at L2 rate of 1 kHz, sent to
Event Filter (EF) farm - Risk not yet completely clear that small ROIs
provide enough informationAble to shift boundary
between L2, EF somewhat - Both experiments finding present or readily
foreseeable technology adequate - at least at level of individual subsystems full
scale end-to-end tests beginning
31Scaling
- Many issues remain to be confronted fully in
building - systems with many thousands of CPUs!
- Fault tolerance
- Overseeing huge constantly-changing collections
of active entities(too many for direct human
oversight) - Performance issues
- Image activation
- Calibration constant loading
- Configuration
- Global knowledge updates required to keep system
coherent
File server and/ordatabase contention?
32An exotic tidbit
- A familiar reassurance to nervous newcomersGo
ahead, type whatever you want the worst that can
happen is that we might have to reboot the
computer. - A cautionary tale from CDF
- Observed unexpected losses of silicon detector
readout channels - Proposed explanation Vibrational resonances due
to Lorentz forces on digital power lines to the
front end chips - Limits trigger rates
- Probably related to high deadtime setting up
steady patterns
Test stand results simulating overloading the DAQ
system, within a magnetic field
Net result physical damage can be caused by
changing trigger configuration!!!
Good wire bond
Broken bond after enduring vibrational resonance
stress
Up close
33Regretfully omitted
- Developments in front-end DAQ electronics
- STAR TOF readout Schambach
- Importance of development tools
- Network performance monitoring many
- Thread debugging Gadomski
-
34Conclusions
- Trigger/DAQ systems have kept up well with the
demands of doing physics with the present
generation of experiments - Many new technologies and ideas have made this
possible - But we are entering an entirely new regime with
experiments of the LHC scale and must take care
that we are not overwhelmed by complexity - We should try hard to find ways to get realistic
advance looks at systems integration and scaling
before the last minute bulk hardware buys - Very interesting research and reduction-to-practic
e work lies ahead in the next few years looking
forward especially to CHEP 2003 2