Title: NetThreads: Programming NetFPGA with Threaded Software
1NetThreads Programming NetFPGA with Threaded
Software
Geoff Salmon Monia Ghobadi Yashar Ganjali
Martin Labrecque Gregory Steffan
ECE Dept.
CS Dept.
University of Toronto
2Real-Life Customers
- Hardware
- NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA
- Collaboration with CS researchers
- Interested in performing network experiments
- Not in coding Verilog
- Want to use GigE link at maximum capacity
- Requirements
- Easy to program system
- Efficient system
What would the ideal solution look like?
3Envisioned System (Someday)
data-level parallelism
- Many Compute Engines
- Delivers the expected performance
- Hardware handles communication and synchronizaton
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
control-flow parallelism
Processors inside an FPGA?
4Soft Processors in FPGAs
- Soft processors processors in the FPGA fabric
- FPGAs increasingly implement SoCs with CPUs
- Commercial soft processors NIOS-II and Microblaze
What is the performance requirement?
5Performance In Packet Processing
- The application defines the throughput required
Edge routing ( 1 Gbps/link)
Home networking (100 Mbps/link)
Scientific instruments (lt 100 Mbps/link)
- Our measure of throughput
- Bisection search of the minimum packet
inter-arrival - Must not drop any packet
Are soft processors fast enough?
6Realistic Goals
- 109 bps stream with normal inter-frame gap of 12
bytes - 2 processors running at 125 MHz
- Cycle budget
- 152 cycles for minimally-sized 64B packets
- 3060 cycles for maximally-sized 1518B packets
Soft processors non-trivial processing at line
rate!
How can they efficiently be organized?
7Key Design Features
8Efficient Network Processing
3
Multithreaded soft processor
9Multiprocessor System Diagram
Synch. Unit
Instr.
Data
Input mem.
Output mem.
Input Buffer
Data Cache
Output Buffer
packet output
packet input
Off-chip DDR
- Overcomes the 2-port limitation of block RAMs -
Shared data cache is not the main bottleneck in
our experiments
10Performance of Single-Threaded Processors
- Single-issue, in order pipeline
- Should commit 1 instruction every cycle, but
- stall on instruction dependences
- stall on memory, I/O, accelerators accesses
- Throughput depends on sequential execution
- packet processing
- device control
- event monitoring
many concurrent threads
Solution to Avoid Stalls Multithreading
11Avoiding Processor Stall Cycles
F
F
F
Data or control hazard
D
D
D
Single-Thread Traditional execution
E
E
E
5 stages
BEFORE
M
M
M
W
W
W
Time
- 4 threads eliminate hazards in 5-stage pipeline
- 5-stage pipeline is 77 more area efficient
FPL07
12Multithreading Evaluation
13Infrastructure
- Compilation
- modified versions of GCC 4.0.2 and Binutils 2.16
for the MIPS-I ISA - Timing
- no free PLL processors run at the speed of the
Ethernet MACs, 125MHz - Platform
- 2 processors, 4 MAC 1 DMA ports, 64 Mbytes 200
MHz DDR2 SDRAM - Virtex II Pro 50 (speed grade 7ns)
- 16KB private instruction caches and shared data
write-back cache - Capacity would be increased on a more modern FPGA
- Validation
- Reference trace from MIPS simulator
- Modelsim and online instruction trace collection
- PC server can send 0.7 Gbps maximally size
packets - Simple packet echo application can keep
up - Complex applications are the bottleneck, not
the architecture
14Our benchmarks
Realistic non-trivial applications, dominated by
control flow
15What is limiting performance?
Packet Backlog due to Synchronization Serializing
Tasks
Lets focus on the underlying problem
Synchronization
16Addressing Synchronization Overhead
17Real Threads Synchronize
- All threads execute the same code
- Concurrent threads may access shared data
- Critical sections ensure correctness
Thread1 Thread2 Thread3 Thread4
Lock() shared_var f() Unlock()
Impact on round-robin scheduled threads?
18Multithreaded processor with Synchronization
F
D
Release lock
E
5 stages
M
Acquire lock
W
Time
19Synchronization Wrecks Round-Robin Multithreading
F
D
Release lock
E
5 stages
M
Acquire lock
W
Time
With round-robin thread scheduling and contention
on locks lt 4 threads execute concurrently gt 18
cycles are wasted while blocked on synchronization
20Better Handling of Synchronization
F
F
F
F
F
F
D
D
D
D
D
D
E
E
E
E
E
E
BEFORE
5 stages
M
M
M
M
M
M
W
W
W
W
W
W
Time
21Thread scheduler
- Suspend any thread waiting for a lock
- Round-robin among the remaining threads
- Unlock operation resumes threads across processors
- Multithreaded processor hides hazards across
active threads - Fewer than N threads requires
hazard detection
But, hazard detection was on critical path of
single threaded processor
Is there a low cost solution?
22Static Hazard Detection
- Hazards can be determined at compile time
- Hazard distances are encoded as part of the
instructions
Static hazard detection allows scheduling without
an extra pipeline stage Very low area overhead
(5), no frequency penalty
23Thread Scheduler Evaluation
24Results on 3 benchmark applications
- Thread scheduling improves throughput by 63,
31, and 41 - Why isnt the 2nd processor always
improving throughput?
25Cycle Breakdown in Simulation
Classifier
NAT
UDHCP
- Removed cycles stalled waiting for a lock -
What is the bottleneck?
26Impact of Allowing Packet Drops
- System still under-utilized - Throughput still
dominated by serialization
27Future Work
- Adding custom hardware accelerators
- Same interconnect as processors
- Same synchronization interface
- Evaluate speculative threading
- Alleviate need for fine grained-synchronization
- Reduce conservative synchronization overhead
28Conclusions
- Efficient multithreaded design
- Parallel threads hide stalls on one thread
- Thread scheduler mitigates synchronization costs
- System Features
- System is easy to program in C
- Performance from parallelism is easy to get
On the lookout for relevant applications suitable
for benchmarking NetThreads available with
compiler at http//netfpga.org/netfpgawiki/index.
php/ProjectsNetThreads
29Geoff Salmon Monia Ghobadi Yashar Ganjali
Martin Labrecque Gregory Steffan
ECE Dept.
CS Dept.
University of Toronto
NetThreads available with compiler
at http//netfpga.org/netfpgawiki/index.php/Proje
ctsNetThreads
30Backup
31Software Network Processing
- Not meant for
- Straightforward tasks accomplished at line speed
in hardware - E.g. basic switching and routing
- Advantages compared to Hardware
- Complex applications are best described in a
high-level software - Easier to design and fast time-to-market
- Can interface with custom accelerators,
controllers - Can be easily updated
- Our focus stateful applications
- Data structures modified by most packets
- Difficult to pipeline the code into balanced
stages
- Run-to-Completion/Pool-of-Threads model for
parallelism - Each thread processes a packet from beginning to
end - No thread-specific behavior
32Impact of allowing packet drops
t
NAT benchmark
33Cycle Breakdown in Simulation
Classifier
NAT
UDHCP
- Removed cycles stalled waiting for a lock -
Throughput still dominated by serialization
34More Sophisticated Thread Scheduling
- Add pipeline stage to pick hazard-free
instruction - Result
- Increased instruction latency
- Increased hazard window
- Increased branch mis-prediction cost
MUX
Add hazard detection without an extra pipeline
stage?
35Implementation
- Where to store the hazard distance bits?
- Block RAMs are multiple of 9 bits wide
- 36 bits word leaves 4 bits available
- Also encode lock and unlock flags
32 Bits
4 Bits
How to convert instructions from 36 bits to 32
bits?
36Instruction Compaction 36 ? 32 bits
R-Type Instructions
Example add rd, rs, rt
J-Type Instructions
Example j label
I-Type Instructions
Example addi rt, rs, immediate
- De-compaction 2 block RAMs some logic
between DDR and cache - Not a critical path of
the pipeline