Title: Experience with multi-threaded C applications in the ATLAS DataFlow
1Experience with multi-threaded C applications
in the ATLAS DataFlow
- Performance problems
- found and solved
- STL containers
- thread scheduling
- other
- Szymon Gadomski
- University of Bern, Switzerland
- and INP Cracow, Poland
- on behalf of the ATLAS Trigger/DAQ DataFlow,
- CHEP 2003 conference
2ATLAS DataFlow software
- Flow of data in the ATLAS DAQ system
- Data to LVL2 (part of event), to EF (whole
event), to mass storage. - See talks by Giovanna Lehman (overview of
DataFlow) and by Stefan Stancu (networking). - PCs, standard Linux, applications written in C
(so far using only gcc to compile), standard
network technology (Gb ethernet). - Soft real time system, no guaranteed response
time. The average response time is what matters. - Common tasks (exchanging messages, state machine,
access configuration db, reporting errors, )
using a framework (well, actually two).
3ATLAS Data Flow software (2)
- State of the project
- development done mostly in 2001-2002,
- measurements for Technical Design Report
performance, - preparation for beam test support stability,
robustness and deployment. - 7 kinds of applications (3 kinds of controllers)
- Always several threads (independent processes
within one application without their own
resources). - Roles, challenges and use of threads very
different. - In this short talk only a few examples
- use of threads, problems, solutions.
4Testbed at CERN
4U PCs gt 2 GHz
1U PCs gt 2 GHz
FPGA Traffic generators
5LVL2 processing unit (L2PU) - role
Detector data!
- gets LVL1 decision
- asks for data
- gets it
- makes LVL2 decision
- sends it
- sends detailed result
ROB
ROB
ROB
ROB
DataFlow application
ROB
ROB
1600x
ROB
ROB
Open choice.
Interface with control software
ROS
140x
detailed LVL2 result
data request (RoI only)
data
1x
L1 RoI data
L2SV
pROS
L2PU
LVL2 decision
10x
Up to 500x
Multiplicties are indicative only
MassStorage
6L2PU design
LVL1 Result
Input Thread
L2PU
L2SV
Add to Event Queue
Assemble RoI Data
LVL2 Decision
RoI Data
If complete restart Worker
RoI Data
Worker Thread
RoI Data
Get next Event from Queue
Worker Thread
Worker Thread
Worker Thread
Run LVL2 Selection code
RoI Data
RoI Data Requests
Request data wait
Continue Selection code
LVL2 Result
pROS
If Accept send Result
Send Decision
7Sub-farm Interface (SFI) - role
- gets event id (L2 accept)
- asks for all event data
- gets it
- builds complete event
- buffers it
- sends it to Event Filter
140x
DataFlow application
Interface with control
ROS
data
clear
request
assign
SFI
DFM
50x
LVL2 accepts and rejects
EoE
1x
complete event
request
Multiplicties are indicative only
EF
MassStorage
8SFI Design
DFM
ROS
EB Rate/SFI 50 Hz
End of Event
Data Requests
Event Assigns
Event Data
Input Thread
Event Handler
Request Thread
Assigns
Reask Fragment IDs
Assembly Thread
Events
ROSFragments
SFI
- Different threads for requesting and receiving
data - Threads for assembly and for sending to Event
Handler
Full Event
EF
9Lesson with L2PU and SFI STL containers
- With no apparent dependence between threads in
code, it was observed that threads were not
running independently. No effect from more
threads. - VisualThreads, using instrumented pthread
library - STL containers use a memory pool, by default one
per executable. There is a lock, threads may
block each other.
threads
blocked!
time
10Lesson with L2PU and SFI STL containers (2)
- The solution is to use pthread allocator.
Independent memory pools for each thread, no
lock, no blocking. - Use for all containers used at event rate.
- Careful with creating objects in one thread and
deleting in another.
threads
blocked less often
11SFI History
Date Change EB EB Output to EF
30 Oct 02 First integration on testbed 0.5 MB/s -
13 Nov Sending data requests at a regular pace 8.0 MB/s -
14 Nov Reduce the number of threads 15 MB/s -
20 Nov Switch off hyper-threading 17 MB/s -
21 Nov Introduce credit based traffic shaping 28 MB/s -
13 Dec First try on throughput - 14 MB/s
17 Jan Chose pthread allocator for STL object 53 MB/s 18 MB/s
29 Jan DC Buffer recycling when sending 56 MB/s 19 MB/s
05 Feb IOVec storage type in the EFormat library 58 MB/s 46 MB/s
21 Feb Buffer pool per thread 64 MB/s 48 MB/s
21 Feb Grouping interthread communication 73 MB/s 51 MB/s
26 Feb Avoiding one system call per message 80 MB/s 55 MB/s
threads
threads
threads
threads
threads
threads
Most improvements (and most problems) are related
to threads.
12Lessons from SFI
- Traffic shaping (limiting the number of
outstanding requests for data) eliminates packet
loss. - Grouping interthread communication decrease
frequency of thread activation. - Some improvements in more predictable areas
- avoiding copies and system calls,
- avoiding creations by recycling buffers,
- avoiding contention, each thread has its own
buffers. - Optimizations driven by measurements with full
functionality. -
- Effective development developer works on a good
testbed, tests and optimizes, short cycle.
13Performance of the SFI
95 MB/s IO limited
EB only
Throughput
CPU limited (2.4 GHz CPU)
ROLs/ROS
- Reaching I/O limit at 95 MB/s otherwise CPU
limited - 35 performance gain with at least 8 ROLs/ROS
- Will approach I/O limit for 1 ROL/ROS with faster
CPU
14Readout System (ROS) - role
ROI collection and partial event building. Not
exactly like SFI
ROBin
12 bufers for data
ROBin
ROBin
ROS SFI
Request Rate 24 kHz L2 3 kHz EB 50 Hz
Data per req. 2 kB LVL2 8 kB EB 1.5 MB
Data rate 72 MB/s 75 MB/s
request
data
I/O Manager
ROS controller
LVL2 or EB Data request
data
All numbers approximate.
15IOManager in ROS
RobIns
Request Handlers
Control, error
Trigger
Requests (L2, EB, Delete)
Process
Request Queue
Linux Scheduler
The number of request handlers is configurable
Thread
16Thread scheduling problem
- System without interrupt. Poll and yield.
- Standard linux scheduler puts the thread away
until next time slice. Up to 10 ms.
- Solution is to change scheduling in kernel
- For 2.4.9 kernels there exists an unofficial
patch (tested on CERN RH7.2) - For CERN RH7.3 there is a CERN-certified patch
linux_2.4.18_18_sched.yield.patch
This is and evolving field, need to continue
evaluating thread-related changes of Linux
kernels.
20 ms latency for getting data
17Conclusions
- The DataFlow of ATLAS DAQ has a set of
applications managing the flow of data. - All prototypes exist, have been optimized, are
used for performance measurements and are
prepared for Beam Test. - Standard technology (Gb ethernet, PCs, standard
Linux, C with gcc, multi-threaded) meets ATLAS
requirements. - A few lessons were learned.
18Backup slides
19Data Flow Manager (DFM) - role
200x
EF
ROS
data
data
clear
request
16x
30x
assign
L2SV
SFO
DFM
SFI
EoE
1x
100x
Multiplicties are indicative only
Disk files
DataFlow application
I/F with OnlineSW
MassStorage
20DFM Design
SFI
ROS
EventAssigns
I/O Rate 4 kHz
L2 Decisions
Clears
EndOfEvent
Cleanup Thread
I/O Thread
L2 Desicions EndOfEvent
Load Balancing Bookkeeping
Timeouts
SFI Assigns
DFM
Threads allow for independent and parallel
processing within an application
- Bulk of work done in I/O thread
- Cleanup thread identifies timed out events
- Fully embedded in the DC framework
21STL containers (3)
22SFI performance
- Input up to 95 Mb/s (3/4 of the 1 Gb line)
- Input and output at 55 Mb/s (1/2 line speed)
- With all the logic of EventBuilding and all the
objects involved, the performance is already
close to the network limit (on a 2.4 GHz PC).
23 Performance of Event Building
- N SFIs
- 1 DFM
- hardware emulators of ROS
max EB rate with 8 SFIs 350Hz (17 of ATLAS EB
rate)
24After the patch
Xeon/2GHz - Linux 2.4.18CERN scheduling patch
200
150
Simulated I/O latency
L2 request rate (kHz)
100
50
0
0
10
20
30
40
request handlers
25Flow of messages
SFI
p ROS
DFM
L2PU
L2SV
ROS/ROB
RoIB
1a L2SV_LVL1Result
2a L2PU_Data Request
1..i
sequential
processing
or time out
2b ROS/ROB_Fragment
wait LVL2
decision
1..i
3a L2PU_LVL2Result
or time out
3b pROS_Ack
1b L2PU_LVL2Decision
4a L2SV_LVL2 Decision
Note
5a DFM_Decision
6a SFI_DataRequest
associated with
4b DFM_Ack
1..n
5a DFM_Decision used
for error recovery.
5a' DFM_SFIAssign
6a SFI_DataRequest
reassign
1..n
time-out event
receive or
1..n
timeout
wait EoE
6b ROS/ROB_EventFragment
or time out
1..n
Build event
5b SFI_EoE
EF
7 DFM_Clear
DFM_FlowControl
SFI_FlowControl
26Deployment view
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
ROB,S
ROB,S
ROB,S
ROB,S
ROB/S
RoIB
EB Switch
LVL2 Switch
LVL2 Supervisors
DFMs
SFIs
Local EF Farms
LVL2 Processors
To Remote EF Farm