NetThreads: Programming NetFPGA with Threaded Software - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

NetThreads: Programming NetFPGA with Threaded Software

Description:

no free PLL: processors run at the speed of the Ethernet MACs, 125MHz. Platform: ... Difficult to pipeline the code into balanced stages ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 37

Provided by: Mart700

Category:

more less

Transcript and Presenter's Notes

Title: NetThreads: Programming NetFPGA with Threaded Software

1
NetThreads Programming NetFPGA with Threaded
Software
Geoff Salmon Monia Ghobadi Yashar Ganjali
Martin Labrecque Gregory Steffan
ECE Dept.
CS Dept.
University of Toronto
2
Real-Life Customers

Hardware
NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA
Collaboration with CS researchers
Interested in performing network experiments
Not in coding Verilog
Want to use GigE link at maximum capacity

Requirements
Easy to program system
Efficient system

What would the ideal solution look like?
3
Envisioned System (Someday)
data-level parallelism

Many Compute Engines
Delivers the expected performance
Hardware handles communication and synchronizaton

Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
control-flow parallelism
Processors inside an FPGA?
4
Soft Processors in FPGAs

Soft processors processors in the FPGA fabric
FPGAs increasingly implement SoCs with CPUs
Commercial soft processors NIOS-II and Microblaze

What is the performance requirement?
5
Performance In Packet Processing

The application defines the throughput required

Edge routing ( 1 Gbps/link)
Home networking (100 Mbps/link)
Scientific instruments (lt 100 Mbps/link)

Our measure of throughput
Bisection search of the minimum packet
inter-arrival
Must not drop any packet

Are soft processors fast enough?
6
Realistic Goals

109 bps stream with normal inter-frame gap of 12
bytes
2 processors running at 125 MHz
Cycle budget
152 cycles for minimally-sized 64B packets
3060 cycles for maximally-sized 1518B packets

Soft processors non-trivial processing at line
rate!
How can they efficiently be organized?
7
Key Design Features
8
Efficient Network Processing
3
Multithreaded soft processor
9
Multiprocessor System Diagram
Synch. Unit
Instr.
Data
Input mem.
Output mem.
Input Buffer
Data Cache
Output Buffer
packet output
packet input
Off-chip DDR
- Overcomes the 2-port limitation of block RAMs -
Shared data cache is not the main bottleneck in
our experiments
10
Performance of Single-Threaded Processors

Single-issue, in order pipeline
Should commit 1 instruction every cycle, but
stall on instruction dependences
stall on memory, I/O, accelerators accesses
Throughput depends on sequential execution
packet processing
device control
event monitoring

many concurrent threads
Solution to Avoid Stalls Multithreading
11
Avoiding Processor Stall Cycles
F
F
F
Data or control hazard
D
D
D
Single-Thread Traditional execution
E
E
E
5 stages
BEFORE
M
M
M
W
W
W
Time

4 threads eliminate hazards in 5-stage pipeline

5-stage pipeline is 77 more area efficient
FPL07

12
Multithreading Evaluation
13
Infrastructure

Compilation
modified versions of GCC 4.0.2 and Binutils 2.16
for the MIPS-I ISA
Timing
no free PLL processors run at the speed of the
Ethernet MACs, 125MHz
Platform
2 processors, 4 MAC 1 DMA ports, 64 Mbytes 200
MHz DDR2 SDRAM
Virtex II Pro 50 (speed grade 7ns)
16KB private instruction caches and shared data
write-back cache
Capacity would be increased on a more modern FPGA
Validation
Reference trace from MIPS simulator
Modelsim and online instruction trace collection

- PC server can send 0.7 Gbps maximally size
packets - Simple packet echo application can keep
up - Complex applications are the bottleneck, not
the architecture
14
Our benchmarks
Realistic non-trivial applications, dominated by
control flow
15
What is limiting performance?
Packet Backlog due to Synchronization Serializing
Tasks
Lets focus on the underlying problem
Synchronization
16
Addressing Synchronization Overhead
17
Real Threads Synchronize

All threads execute the same code
Concurrent threads may access shared data
Critical sections ensure correctness

Thread1 Thread2 Thread3 Thread4
Lock() shared_var f() Unlock()
Impact on round-robin scheduled threads?
18
Multithreaded processor with Synchronization
F
D
Release lock
E
5 stages
M
Acquire lock
W
Time
19
Synchronization Wrecks Round-Robin Multithreading
F
D
Release lock
E
5 stages
M
Acquire lock
W
Time
With round-robin thread scheduling and contention
on locks lt 4 threads execute concurrently gt 18
cycles are wasted while blocked on synchronization
20
Better Handling of Synchronization
F
F
F
F
F
F
D
D
D
D
D
D
E
E
E
E
E
E
BEFORE
5 stages
M
M
M
M
M
M
W
W
W
W
W
W
Time
21
Thread scheduler

Suspend any thread waiting for a lock
Round-robin among the remaining threads
Unlock operation resumes threads across processors

- Multithreaded processor hides hazards across
active threads - Fewer than N threads requires
hazard detection
But, hazard detection was on critical path of
single threaded processor
Is there a low cost solution?
22
Static Hazard Detection

Hazards can be determined at compile time

- Hazard distances are encoded as part of the
instructions
Static hazard detection allows scheduling without
an extra pipeline stage Very low area overhead
(5), no frequency penalty
23
Thread Scheduler Evaluation
24
Results on 3 benchmark applications
- Thread scheduling improves throughput by 63,
31, and 41 - Why isnt the 2nd processor always
improving throughput?
25
Cycle Breakdown in Simulation
Classifier
NAT
UDHCP
- Removed cycles stalled waiting for a lock -
What is the bottleneck?
26
Impact of Allowing Packet Drops
- System still under-utilized - Throughput still
dominated by serialization
27
Future Work

Adding custom hardware accelerators
Same interconnect as processors
Same synchronization interface
Evaluate speculative threading
Alleviate need for fine grained-synchronization
Reduce conservative synchronization overhead

28
Conclusions

Efficient multithreaded design
Parallel threads hide stalls on one thread
Thread scheduler mitigates synchronization costs
System Features
System is easy to program in C
Performance from parallelism is easy to get

On the lookout for relevant applications suitable
for benchmarking NetThreads available with
compiler at http//netfpga.org/netfpgawiki/index.
php/ProjectsNetThreads
29
Geoff Salmon Monia Ghobadi Yashar Ganjali
Martin Labrecque Gregory Steffan
ECE Dept.
CS Dept.
University of Toronto
NetThreads available with compiler
at http//netfpga.org/netfpgawiki/index.php/Proje
ctsNetThreads
30
Backup
31
Software Network Processing

Not meant for
Straightforward tasks accomplished at line speed
in hardware
E.g. basic switching and routing
Advantages compared to Hardware
Complex applications are best described in a
high-level software
Easier to design and fast time-to-market
Can interface with custom accelerators,
controllers
Can be easily updated
Our focus stateful applications
Data structures modified by most packets
Difficult to pipeline the code into balanced
stages

Run-to-Completion/Pool-of-Threads model for
parallelism
Each thread processes a packet from beginning to
end
No thread-specific behavior

32
Impact of allowing packet drops
t
NAT benchmark
33
Cycle Breakdown in Simulation
Classifier
NAT
UDHCP
- Removed cycles stalled waiting for a lock -
Throughput still dominated by serialization
34
More Sophisticated Thread Scheduling

Add pipeline stage to pick hazard-free
instruction
Result
Increased instruction latency
Increased hazard window
Increased branch mis-prediction cost

MUX
Add hazard detection without an extra pipeline
stage?
35
Implementation

Where to store the hazard distance bits?
Block RAMs are multiple of 9 bits wide
36 bits word leaves 4 bits available
Also encode lock and unlock flags

32 Bits
4 Bits
How to convert instructions from 36 bits to 32
bits?
36
Instruction Compaction 36 ? 32 bits
R-Type Instructions
Example add rd, rs, rt
J-Type Instructions
Example j label
I-Type Instructions
Example addi rt, rs, immediate
- De-compaction 2 block RAMs some logic
between DDR and cache - Not a critical path of
the pipeline

Write a Comment

User Comments (0)