Experience with multi-threaded C applications in the ATLAS DataFlow - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Experience with multi-threaded C applications in the ATLAS DataFlow

Description:

NOTE: the popQ % are meaningless: when standaloneROS starts, it loops (pop+yield) on an empty Q. This is due to the new Qs. – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 27

Provided by: SzymonG1

Learn more at: https://chep03.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Experience with multi-threaded C applications in the ATLAS DataFlow

1
Experience with multi-threaded C applications
in the ATLAS DataFlow

Performance problems
found and solved
STL containers
thread scheduling
other

Szymon Gadomski
University of Bern, Switzerland
and INP Cracow, Poland
on behalf of the ATLAS Trigger/DAQ DataFlow,
CHEP 2003 conference

2
ATLAS DataFlow software

Flow of data in the ATLAS DAQ system
Data to LVL2 (part of event), to EF (whole
event), to mass storage.
See talks by Giovanna Lehman (overview of
DataFlow) and by Stefan Stancu (networking).
PCs, standard Linux, applications written in C
(so far using only gcc to compile), standard
network technology (Gb ethernet).
Soft real time system, no guaranteed response
time. The average response time is what matters.
Common tasks (exchanging messages, state machine,
access configuration db, reporting errors, )
using a framework (well, actually two).

3
ATLAS Data Flow software (2)

State of the project
development done mostly in 2001-2002,
measurements for Technical Design Report
performance,
preparation for beam test support stability,
robustness and deployment.
7 kinds of applications (3 kinds of controllers)
Always several threads (independent processes
within one application without their own
resources).
Roles, challenges and use of threads very
different.
In this short talk only a few examples
use of threads, problems, solutions.

4
Testbed at CERN
4U PCs gt 2 GHz
1U PCs gt 2 GHz
FPGA Traffic generators
5
LVL2 processing unit (L2PU) - role
Detector data!

gets LVL1 decision
asks for data
gets it
makes LVL2 decision
sends it
sends detailed result

ROB
ROB
ROB
ROB
DataFlow application
ROB
ROB
1600x
ROB
ROB
Open choice.
Interface with control software
ROS
140x
detailed LVL2 result
data request (RoI only)
data
1x
L1 RoI data
L2SV
pROS
L2PU
LVL2 decision
10x
Up to 500x
Multiplicties are indicative only
MassStorage
6
L2PU design
LVL1 Result
Input Thread
L2PU
L2SV
Add to Event Queue
Assemble RoI Data
LVL2 Decision
RoI Data
If complete restart Worker
RoI Data
Worker Thread
RoI Data
Get next Event from Queue
Worker Thread
Worker Thread
Worker Thread
Run LVL2 Selection code
RoI Data
RoI Data Requests
Request data wait
Continue Selection code
LVL2 Result
pROS
If Accept send Result
Send Decision
7
Sub-farm Interface (SFI) - role

gets event id (L2 accept)
asks for all event data
gets it
builds complete event
buffers it
sends it to Event Filter

140x
DataFlow application
Interface with control
ROS
data
clear
request
assign
SFI
DFM
50x
LVL2 accepts and rejects
EoE
1x
complete event
request
Multiplicties are indicative only
EF
MassStorage
8
SFI Design
DFM
ROS
EB Rate/SFI 50 Hz
End of Event
Data Requests
Event Assigns
Event Data
Input Thread
Event Handler
Request Thread
Assigns
Reask Fragment IDs
Assembly Thread
Events
ROSFragments
SFI

Different threads for requesting and receiving
data
Threads for assembly and for sending to Event
Handler

Full Event
EF
9
Lesson with L2PU and SFI STL containers

With no apparent dependence between threads in
code, it was observed that threads were not
running independently. No effect from more
threads.
VisualThreads, using instrumented pthread
library
STL containers use a memory pool, by default one
per executable. There is a lock, threads may
block each other.

threads
blocked!
time
10
Lesson with L2PU and SFI STL containers (2)

The solution is to use pthread allocator.
Independent memory pools for each thread, no
lock, no blocking.
Use for all containers used at event rate.
Careful with creating objects in one thread and
deleting in another.

threads
blocked less often
11
SFI History
Date Change EB EB Output to EF
30 Oct 02 First integration on testbed 0.5 MB/s -
13 Nov Sending data requests at a regular pace 8.0 MB/s -
14 Nov Reduce the number of threads 15 MB/s -
20 Nov Switch off hyper-threading 17 MB/s -
21 Nov Introduce credit based traffic shaping 28 MB/s -
13 Dec First try on throughput - 14 MB/s
17 Jan Chose pthread allocator for STL object 53 MB/s 18 MB/s
29 Jan DC Buffer recycling when sending 56 MB/s 19 MB/s
05 Feb IOVec storage type in the EFormat library 58 MB/s 46 MB/s
21 Feb Buffer pool per thread 64 MB/s 48 MB/s
21 Feb Grouping interthread communication 73 MB/s 51 MB/s
26 Feb Avoiding one system call per message 80 MB/s 55 MB/s
threads
threads
threads
threads
threads
threads
Most improvements (and most problems) are related
to threads.
12
Lessons from SFI

Traffic shaping (limiting the number of
outstanding requests for data) eliminates packet
loss.
Grouping interthread communication decrease
frequency of thread activation.
Some improvements in more predictable areas
avoiding copies and system calls,
avoiding creations by recycling buffers,
avoiding contention, each thread has its own
buffers.
Optimizations driven by measurements with full
functionality.
Effective development developer works on a good
testbed, tests and optimizes, short cycle.

13
Performance of the SFI
95 MB/s IO limited
EB only
Throughput
CPU limited (2.4 GHz CPU)
ROLs/ROS

Reaching I/O limit at 95 MB/s otherwise CPU
limited
35 performance gain with at least 8 ROLs/ROS
Will approach I/O limit for 1 ROL/ROS with faster
CPU

14
Readout System (ROS) - role
ROI collection and partial event building. Not
exactly like SFI
ROBin
12 bufers for data
ROBin
ROBin
ROS SFI
Request Rate 24 kHz L2 3 kHz EB 50 Hz
Data per req. 2 kB LVL2 8 kB EB 1.5 MB
Data rate 72 MB/s 75 MB/s
request
data
I/O Manager
ROS controller
LVL2 or EB Data request
data
All numbers approximate.
15
IOManager in ROS
RobIns
Request Handlers
Control, error
Trigger
Requests (L2, EB, Delete)
Process
Request Queue
Linux Scheduler
The number of request handlers is configurable
Thread
16
Thread scheduling problem

System without interrupt. Poll and yield.
Standard linux scheduler puts the thread away
until next time slice. Up to 10 ms.

Solution is to change scheduling in kernel
For 2.4.9 kernels there exists an unofficial
patch (tested on CERN RH7.2)
For CERN RH7.3 there is a CERN-certified patch
linux_2.4.18_18_sched.yield.patch

This is and evolving field, need to continue
evaluating thread-related changes of Linux
kernels.
20 ms latency for getting data
17
Conclusions

The DataFlow of ATLAS DAQ has a set of
applications managing the flow of data.
All prototypes exist, have been optimized, are
used for performance measurements and are
prepared for Beam Test.
Standard technology (Gb ethernet, PCs, standard
Linux, C with gcc, multi-threaded) meets ATLAS
requirements.
A few lessons were learned.

18
Backup slides
19
Data Flow Manager (DFM) - role
200x
EF
ROS
data
data
clear
request
16x
30x
assign
L2SV
SFO
DFM
SFI
EoE
1x
100x
Multiplicties are indicative only
Disk files
DataFlow application
I/F with OnlineSW
MassStorage
20
DFM Design
SFI
ROS
EventAssigns
I/O Rate 4 kHz
L2 Decisions
Clears
EndOfEvent
Cleanup Thread
I/O Thread
L2 Desicions EndOfEvent
Load Balancing Bookkeeping
Timeouts
SFI Assigns
DFM
Threads allow for independent and parallel
processing within an application

Bulk of work done in I/O thread
Cleanup thread identifies timed out events
Fully embedded in the DC framework

21
STL containers (3)
22
SFI performance

Input up to 95 Mb/s (3/4 of the 1 Gb line)
Input and output at 55 Mb/s (1/2 line speed)
With all the logic of EventBuilding and all the
objects involved, the performance is already
close to the network limit (on a 2.4 GHz PC).

23
Performance of Event Building

N SFIs
1 DFM
hardware emulators of ROS

max EB rate with 8 SFIs 350Hz (17 of ATLAS EB
rate)
24
After the patch
Xeon/2GHz - Linux 2.4.18CERN scheduling patch
200
150
Simulated I/O latency
L2 request rate (kHz)
100

50
0
0
10
20
30
40
request handlers
25
Flow of messages
SFI
p ROS
DFM
L2PU
L2SV
ROS/ROB
RoIB
1a L2SV_LVL1Result
2a L2PU_Data Request
1..i
sequential
processing
or time out
2b ROS/ROB_Fragment
wait LVL2
decision
1..i
3a L2PU_LVL2Result
or time out
3b pROS_Ack
1b L2PU_LVL2Decision
4a L2SV_LVL2 Decision
Note
5a DFM_Decision
6a SFI_DataRequest
associated with
4b DFM_Ack
1..n
5a DFM_Decision used
for error recovery.
5a' DFM_SFIAssign
6a SFI_DataRequest
reassign
1..n
time-out event
receive or
1..n
timeout
wait EoE
6b ROS/ROB_EventFragment
or time out
1..n
Build event
5b SFI_EoE
EF
7 DFM_Clear
DFM_FlowControl
SFI_FlowControl
26
Deployment view
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
ROB,S
ROB,S
ROB,S
ROB,S
ROB/S
RoIB
EB Switch
LVL2 Switch
LVL2 Supervisors
DFMs
SFIs
Local EF Farms
LVL2 Processors
To Remote EF Farm

Write a Comment

User Comments (0)