Title: BTeV-RTES Project
1BTeV-RTES Project Very Lightweight Agents
VLAs Daniel Mossé, Jae Oh, Madhura Tamhankar,
John Gross Computer Science Department University
of Pittsburgh
3BTeV Test Station
Collider detectors are about the size of a small
apartment building. Fermilab's two detectors-CDF
and DZero-are about four stories high, weighing
some 5,000 tons (10 million pounds) each.
Particle collisions occur in the middle of the
detectors, which are crammed with electronic
instrumentation. Each detector has about 800,000
individual pathways for recording electronic data
generated by the particle collisions. Signals are
carried over nearly a thousand miles of wire and
4L1/L2/L3 Trigger Overview
5System Characteristics
Software Perspective
- Reconfigurable node allocation
- L1 runs one physics application, severely time
constrained - L2/L3 runs several physics applications, little
time constraints - Multiple operating systems and differing
processors - TI DSP BIOS, Linux, Windows?
- Communication among system sections via fast
network - Fault tolerance is essentially absent in
embedded and RT systems
6L1/L2/L3 Trigger Hierarchy
Regional L2/L3 Manager (1) TimeSys RT
Linux Regional Manager VLA
Global Manager TimeSys RT Linux Global Manager VLA
Regional L1 Manager (1) TimeSys RT Linux Regional
Manager VLA
Gigabit Ethernet
Gigabit Ethernet
Section Managers (8), RH 8.x Linux, Section
Manager VLA
Crate Managers (20), TimeSys RT Linux, Crate
Manager VLA
Linux Nodes (320) RH 8.x Linux Low-Level VLA
Farmlet Managers (16) TimeSys RT Linux Farmlet
Manager VLA
DSPs (8) TI DSP BIOS Low-Level VLA
Data Archive External Level
7Very Lightweight Agents (VLAs)
Proposed Solution Very Lightweight Agent
- Minimize footprint
- Platform independence
- Monitor hardware
- Monitor software
- Comprehensible source code
- Communication with high-level software entity
- Error prediction
- Error logging and messaging
- Schedule and priorities of test events
8VLAs on L1 and L2/3 nodes
Level 1 Farm Nodes
OS Kernel (DSP BIOS)
Physics Application
Network API
L1 Manager Nodes
9VLA Error Reporting
Level 1/2/3 Manager Nodes
Linux Kernel
Manager Application
Network API
To Network
10VLA Error Prediction
Buffer overflow 1. VLA message or application
data input buffers may overflow 2. Messages or
data lost in each case 3. Detection through
monitoring fill rate and overflow condition 4.
High fill rate indicative of high error rate,
producing messages undersized data
buffers Throttled CPU 1. Throttled from high
temperature 2. Throttle by erroneous power
saving feature 3. Causes missed deadlines due
to low CPU speed 4. Potentially critical
failure if L1 data not processed fast enough
Note the the CPU may be throttled on purpose
11VLA Error Logging
Hardware Failures
Software Failures
Message Buffer
Communication API
ARMOR 1. Reads messages 2. Stores/uses for error
prediction 3. Appends appropriate info 4.
Sends to archive
TCP/IP Ethernet
VLA Packages info 1. Message time 2. Operational
data 3. Environmental data 4. Sensor values 5.
App OS error codes 6. Beam crossing ID
15 Message Buffer
Communication API
Data Archive
12VLA Scheduling Issues
- L1 trigger application has highest priority
- VLA must run sufficiently to ensure efficacy of
purpose - VLA must internally prioritize error tests
- VLA must preempt the L1 trigger app on critical
errors - Task priorities must be alterable during
13VLA Scheduling Issues
Physics Application
Physics Application
Normal Scheduling
14VLA Scheduling Issues
External Message Source (FPGA)
VLA Inhibitor
Physics Application
15VLA Status
- Current Status
- VLA skeleton and timing implemented in Syracuse
(poster) - Hardware platform from Vandy
- Software (muon application) from Fermi and UIUC
- Linux drivers to use GME and Vandy devkit
- Near term
- Muon application to run on the dsp board
- Muon application timing
- Instantiate VLAs with Vandy hardware and Muon
16VLA and Network Usage
- Network usage influences amount of data dropped
by Triggers and other Filters - Network usage typically not considered in load
balancing algorithms (assume network is fast
enough) - VLAs monitor and report network usage
- Agents use this information to re-distribute
loads - Network architecture to control flows on a
per-process basis (http//www.netnice.org)