Title: QM Performance Analysis
1QM Performance Analysis
John DeHart
2ONL NP Router
Large SRAM Ring
xScale
xScale (3 Rings?)
Assoc. Data ZBT-SRAM
Small SRAM Ring
Scratch Ring
SRAM
LD
TCAM
SRAM
Except
Errors
NN Ring
NN
64KW
Parse, Lookup, Copy (3 MEs)
Rx (2 ME)
HdrFmt (1 ME)
Mux (1 ME)
Tx (1 ME)
QM (1 ME)
NN
Mostly Unchanged
xScale
64KW
64KW
64KW
64KW
64KW
64KW
Plugin System Update Requests
Plugin Ctrl Msgs
512W
512W
512W
512W
512W
New
NN
NN
NN
NN
Plugin0
Plugin1
Plugin2
Plugin3
Plugin4
512W
512W
SRAM
Needs A Lot Of Mod.
Rx Mux HF Copy Plugins Tx
Needs Some Mod.
Tx, QM Parse Plugin XScale
FreeList Mgr (1 ME)
Stats (1 ME)
SRAM
3Performance
- What is our performance target?
- To hit 5 Gb rate
- Minimum Ethernet frame 76B
- 64B frame 12B InterFrame Spacing
- 5 Gb/sec 1B/8b packet/76B 8.22 Mpkt/sec
- IXP ME processing
- 1.4Ghz clock rate
- 1.4Gcycle/sec 1 sec/ 8.22 Mp 170.3 cycles
per packet - compute budget (MEs170)
- 1 ME 170 cycles
- 2 ME 340 cycles
- 3 ME 510 cycles
- 4 ME 680 cycles
- latency budget (threads170)
- 1 ME 8 threads 1360 cycles
- 2 ME 16 threads 2720 cycles
- 3 ME 24 threads 4080 cycles
- 4 ME 32 threads 5440 cycles
4QM Performance
- 1 ME using 7 threads
- Threads each run once per iteration
- Enqueue thread and Dequeue threads run in
parallel - Their latencies can overlap
- Freelist management thread runs in isolation of
the other threads - 1 Enqueue Thread
- Processes a batch of 5 packets per iteration
- 5 Dequeue Threads
- Each processes 1 packet per iteration
- 1 Freelist management Thread
- Maintains state of freelist once every iteration
- Each iteration can enqueue and dequeue 5 packets
- Total latency for an iteration 5 170 cycles
850 cycles - Sum of
- Latency of Freelist management thread
- Combined latency of Enqueue thread and Dequeue
Threads - Compute budget
- (FL_cpu/5) (DQ_cpu) (ENQ_cpu/5) lt 170 cycles
- Current (June 2007) BEST CASE (all queues already
loaded) estimates
5QM Performance Improvements
- These are simple improvements that might save us
10s of cycles each. - Change the way we read the scratch ring on input
to Enqueue - Currently we do this (Each gets the input data
for 1 pkt) - .xfer_order rdata_a
- .xfer_order rdata_b
- .xfer_order rdata_c
- .xfer_order rdata_d
- .xfer_order rdata_e
- scratchget, rdata_a0, 0, ring, 3,
sig_donesram_sig0 - scratchget, rdata_b0, 0, ring, 3,
sig_donesram_sig1 - scratchget, rdata_c0, 0, ring, 3,
sig_donesram_sig2 - scratchget, rdata_d0, 0, ring, 3,
sig_donesram_sig3 - scratchget, rdata_e0, 0, ring, 3,
sig_donesram_sig4 - The fifth scratch get always causes a stall since
the cmd fifo on the ME is only 4 deep. - And when it stalls it also causes an abort of it
and the following 2 instructions. - Total of 15 cycles consumed by abort (3 cycles)
and stall (12 cycles) - This seems more efficient
- .xfer_order rdata_a, rdata_b
- .xfer_order rdata_c, rdata_d
6QM Snapshots
- Breakpoint set at start of maintain_fl() macro in
FL management Thread - All queues should be already loaded
- Run for one iteration
- ENQ processes 5 pkts
- Each of the 5 DQ thread processes 1 pkt
- Rx reports 10 packets received and Tx reports 5
packets transmitted
7QM Snapshots
- Breakpoint set at start of maintain_fl() macro in
FL management Thread - All queues should be already loaded
- Run for one iteration
- ENQ processes 5 pkts
- Each of the 5 DQ thread processes 1 pkt
- Rx reports 10 packets received and Tx reports 5
packets transmitted
8200Byte Eth Frames
- With 200 Byte Eth Frames packets, 5 ports sending
at full rate - Dequeue can not keep up. After about 1030 packets
we start discarding in Enqueue because the queues
are full. Queue thresholds were set to 0xfff - Port rates were set to 0x1000 (greater than 1Gb/s)
9400Byte Eth Frames
- With 400 Byte Eth Frames packets, 5 ports sending
at full rate - Queues build up eventually.
- I suspect there is an inherent problem in the way
that dequeue is working that causes it to not be
able to keep up. - Tx is flow controlling the dequeue engines in
this case. - This seems to be what is causing the queues to
build up.
10More snapshots (June 13, 2007)
11More snapshots (June 13, 2007)
12More snapshots (June 13, 2007)
13More snapshots (June 13, 2007)
14More snapshots (June 13-15, 2007)
- WORST CASEEvery pkt causes queue to be evicted
by Enqueue and new one loaded.
- BEST CASE Queues are always already loaded,
nothing gets evicted.