Experience with multi-threaded C applications in the ATLAS DataFlow - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Experience with multi-threaded C applications in the ATLAS DataFlow

Description:

NOTE: the popQ % are meaningless: when standaloneROS starts, it loops (pop+yield) on an empty Q. This is due to the new Qs. – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 27
Provided by: SzymonG1
Learn more at: https://chep03.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Experience with multi-threaded C applications in the ATLAS DataFlow


1
Experience with multi-threaded C applications
in the ATLAS DataFlow
  • Performance problems
  • found and solved
  • STL containers
  • thread scheduling
  • other
  • Szymon Gadomski
  • University of Bern, Switzerland
  • and INP Cracow, Poland
  • on behalf of the ATLAS Trigger/DAQ DataFlow,
  • CHEP 2003 conference

2
ATLAS DataFlow software
  • Flow of data in the ATLAS DAQ system
  • Data to LVL2 (part of event), to EF (whole
    event), to mass storage.
  • See talks by Giovanna Lehman (overview of
    DataFlow) and by Stefan Stancu (networking).
  • PCs, standard Linux, applications written in C
    (so far using only gcc to compile), standard
    network technology (Gb ethernet).
  • Soft real time system, no guaranteed response
    time. The average response time is what matters.
  • Common tasks (exchanging messages, state machine,
    access configuration db, reporting errors, )
    using a framework (well, actually two).

3
ATLAS Data Flow software (2)
  • State of the project
  • development done mostly in 2001-2002,
  • measurements for Technical Design Report
    performance,
  • preparation for beam test support stability,
    robustness and deployment.
  • 7 kinds of applications (3 kinds of controllers)
  • Always several threads (independent processes
    within one application without their own
    resources).
  • Roles, challenges and use of threads very
    different.
  • In this short talk only a few examples
  • use of threads, problems, solutions.

4
Testbed at CERN
4U PCs gt 2 GHz
1U PCs gt 2 GHz
FPGA Traffic generators
5
LVL2 processing unit (L2PU) - role
Detector data!
  • gets LVL1 decision
  • asks for data
  • gets it
  • makes LVL2 decision
  • sends it
  • sends detailed result

ROB
ROB
ROB
ROB
DataFlow application
ROB
ROB
1600x
ROB
ROB
Open choice.
Interface with control software
ROS
140x
detailed LVL2 result
data request (RoI only)
data
1x
L1 RoI data
L2SV
pROS
L2PU
LVL2 decision
10x
Up to 500x
Multiplicties are indicative only
MassStorage
6
L2PU design
LVL1 Result
Input Thread
L2PU
L2SV
Add to Event Queue
Assemble RoI Data
LVL2 Decision
RoI Data
If complete restart Worker
RoI Data
Worker Thread
RoI Data
Get next Event from Queue
Worker Thread
Worker Thread
Worker Thread
Run LVL2 Selection code
RoI Data
RoI Data Requests
Request data wait
Continue Selection code
LVL2 Result
pROS
If Accept send Result
Send Decision
7
Sub-farm Interface (SFI) - role
  • gets event id (L2 accept)
  • asks for all event data
  • gets it
  • builds complete event
  • buffers it
  • sends it to Event Filter

140x
DataFlow application
Interface with control
ROS
data
clear
request
assign
SFI
DFM
50x
LVL2 accepts and rejects
EoE
1x
complete event
request
Multiplicties are indicative only
EF
MassStorage
8
SFI Design
DFM
ROS
EB Rate/SFI 50 Hz
End of Event
Data Requests
Event Assigns
Event Data
Input Thread
Event Handler
Request Thread
Assigns
Reask Fragment IDs
Assembly Thread
Events
ROSFragments
SFI
  • Different threads for requesting and receiving
    data
  • Threads for assembly and for sending to Event
    Handler

Full Event
EF
9
Lesson with L2PU and SFI STL containers
  • With no apparent dependence between threads in
    code, it was observed that threads were not
    running independently. No effect from more
    threads.
  • VisualThreads, using instrumented pthread
    library
  • STL containers use a memory pool, by default one
    per executable. There is a lock, threads may
    block each other.

threads
blocked!
time
10
Lesson with L2PU and SFI STL containers (2)
  • The solution is to use pthread allocator.
    Independent memory pools for each thread, no
    lock, no blocking.
  • Use for all containers used at event rate.
  • Careful with creating objects in one thread and
    deleting in another.

threads
blocked less often
11
SFI History
Date Change EB EB Output to EF
30 Oct 02 First integration on testbed 0.5 MB/s -
13 Nov Sending data requests at a regular pace 8.0 MB/s -
14 Nov Reduce the number of threads 15 MB/s -
20 Nov Switch off hyper-threading 17 MB/s -
21 Nov Introduce credit based traffic shaping 28 MB/s -
13 Dec First try on throughput - 14 MB/s
17 Jan Chose pthread allocator for STL object 53 MB/s 18 MB/s
29 Jan DC Buffer recycling when sending 56 MB/s 19 MB/s
05 Feb IOVec storage type in the EFormat library 58 MB/s 46 MB/s
21 Feb Buffer pool per thread 64 MB/s 48 MB/s
21 Feb Grouping interthread communication 73 MB/s 51 MB/s
26 Feb Avoiding one system call per message 80 MB/s 55 MB/s
threads
threads
threads
threads
threads
threads
Most improvements (and most problems) are related
to threads.
12
Lessons from SFI
  • Traffic shaping (limiting the number of
    outstanding requests for data) eliminates packet
    loss.
  • Grouping interthread communication decrease
    frequency of thread activation.
  • Some improvements in more predictable areas
  • avoiding copies and system calls,
  • avoiding creations by recycling buffers,
  • avoiding contention, each thread has its own
    buffers.
  • Optimizations driven by measurements with full
    functionality.
  • Effective development developer works on a good
    testbed, tests and optimizes, short cycle.

13
Performance of the SFI
95 MB/s IO limited
EB only
Throughput
CPU limited (2.4 GHz CPU)
ROLs/ROS
  • Reaching I/O limit at 95 MB/s otherwise CPU
    limited
  • 35 performance gain with at least 8 ROLs/ROS
  • Will approach I/O limit for 1 ROL/ROS with faster
    CPU

14
Readout System (ROS) - role
ROI collection and partial event building. Not
exactly like SFI
ROBin
12 bufers for data
ROBin
ROBin
ROS SFI
Request Rate 24 kHz L2 3 kHz EB 50 Hz
Data per req. 2 kB LVL2 8 kB EB 1.5 MB
Data rate 72 MB/s 75 MB/s
request
data
I/O Manager
ROS controller
LVL2 or EB Data request
data
All numbers approximate.
15
IOManager in ROS
RobIns
Request Handlers
Control, error
Trigger
Requests (L2, EB, Delete)
Process
Request Queue
Linux Scheduler
The number of request handlers is configurable
Thread
16
Thread scheduling problem
  • System without interrupt. Poll and yield.
  • Standard linux scheduler puts the thread away
    until next time slice. Up to 10 ms.
  • Solution is to change scheduling in kernel
  • For 2.4.9 kernels there exists an unofficial
    patch (tested on CERN RH7.2)
  • For CERN RH7.3 there is a CERN-certified patch
    linux_2.4.18_18_sched.yield.patch

This is and evolving field, need to continue
evaluating thread-related changes of Linux
kernels.
20 ms latency for getting data
17
Conclusions
  • The DataFlow of ATLAS DAQ has a set of
    applications managing the flow of data.
  • All prototypes exist, have been optimized, are
    used for performance measurements and are
    prepared for Beam Test.
  • Standard technology (Gb ethernet, PCs, standard
    Linux, C with gcc, multi-threaded) meets ATLAS
    requirements.
  • A few lessons were learned.

18
Backup slides
19
Data Flow Manager (DFM) - role
200x
EF
ROS
data
data
clear
request
16x
30x
assign
L2SV
SFO
DFM
SFI
EoE
1x
100x
Multiplicties are indicative only
Disk files
DataFlow application
I/F with OnlineSW
MassStorage
20
DFM Design
SFI
ROS
EventAssigns
I/O Rate 4 kHz
L2 Decisions
Clears
EndOfEvent
Cleanup Thread
I/O Thread
L2 Desicions EndOfEvent
Load Balancing Bookkeeping
Timeouts
SFI Assigns
DFM
Threads allow for independent and parallel
processing within an application
  • Bulk of work done in I/O thread
  • Cleanup thread identifies timed out events
  • Fully embedded in the DC framework

21
STL containers (3)
22
SFI performance
  • Input up to 95 Mb/s (3/4 of the 1 Gb line)
  • Input and output at 55 Mb/s (1/2 line speed)
  • With all the logic of EventBuilding and all the
    objects involved, the performance is already
    close to the network limit (on a 2.4 GHz PC).

23
Performance of Event Building
  • N SFIs
  • 1 DFM
  • hardware emulators of ROS

max EB rate with 8 SFIs 350Hz (17 of ATLAS EB
rate)
24
After the patch
Xeon/2GHz - Linux 2.4.18CERN scheduling patch
200
150
Simulated I/O latency
L2 request rate (kHz)
100

50
0
0
10
20
30
40
request handlers
25
Flow of messages
SFI
p ROS
DFM
L2PU
L2SV
ROS/ROB
RoIB
1a L2SV_LVL1Result
2a L2PU_Data Request
1..i
sequential
processing
or time out
2b ROS/ROB_Fragment
wait LVL2
decision
1..i
3a L2PU_LVL2Result
or time out
3b pROS_Ack
1b L2PU_LVL2Decision
4a L2SV_LVL2 Decision
Note
5a DFM_Decision
6a SFI_DataRequest
associated with
4b DFM_Ack
1..n
5a DFM_Decision used
for error recovery.
5a' DFM_SFIAssign
6a SFI_DataRequest
reassign
1..n
time-out event
receive or
1..n
timeout
wait EoE
6b ROS/ROB_EventFragment
or time out
1..n
Build event
5b SFI_EoE
EF
7 DFM_Clear
DFM_FlowControl
SFI_FlowControl
26
Deployment view
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
RODs
ROB,S
ROB,S
ROB,S
ROB,S
ROB/S
RoIB
EB Switch
LVL2 Switch
LVL2 Supervisors
DFMs
SFIs
Local EF Farms
LVL2 Processors
To Remote EF Farm
Write a Comment
User Comments (0)
About PowerShow.com