Title: Data Parallel FPGA Workloads: Software Versus Hardware
1Data Parallel FPGA Workloads Software Versus
Hardware
- Peter Yiannacouras
- J. Gregory Steffan
- Jonathan Rose
- FPL 2009
2FPGA Systems and Soft Processors
Digital System
computation
Weeks
Months
HDL CAD
Software Compiler
Used in 25 of designs source Altera,
2009
? Faster ? Smaller ? Less Power
? Easier
COMPETE
? Configurable
3Vector Processing Primer
vadd
// C code for(i0ilt16 i) ciaibi //
Vectorized code set vl,16 vload
vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c
vr215vr015vr115
vr214vr014vr114
vr213vr013vr113
vr212vr012vr112
vr211vr011vr111
vr210vr010vr110
vr29 vr09vr19
vr28 vr08vr18
vr27 vr07vr17
vr26 vr06vr16
vr25 vr05vr15
vr24 vr04vr14
Each vector instruction holds many units of
independent operations
vr23 vr03vr13
vr22 vr02vr12
vr21 vr01vr11
vr20 vr00vr10
1 Vector Lane
4Vector Processing Primer
vadd
// C code for(i0ilt16 i) ciaibi //
Vectorized code set vl,16 vload
vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c
16 Vector Lanes
vr215vr015vr115
vr214vr014vr114
vr213vr013vr113
vr212vr012vr112
- Previous Work (on Soft Vector Processors)
- Scalability
- Flexibility
- Portability
vr211vr011vr111
vr210vr010vr110
vr29 vr09vr19
vr28 vr08vr18
vr27 vr07vr17
vr26 vr06vr16
vr25 vr05vr15
vr24 vr04vr14
Each vector instruction holds many units of
independent operations
vr23 vr03vr13
vr22 vr02vr12
vr21 vr01vr11
vr20 vr00vr10
5Soft Vector Processors vs HW
Soft Vector Processor
Weeks
Months
HDL CAD
Software Compiler
Vectorizer
Lane 1
Lane 2
Lane 3
Lane 4
Lane 5
Lane 6
Lane 7
Lane 8 16
Scalable Fine-tunable Customizable
Vector Lanes
How much?
? Faster ? Smaller ? Less Power
? Easier
6Measuring the Gap
EEMBC Benchmarks
Soft Vector Processor
Scalar Soft Processor
HW Circuits
Evaluation
Evaluation
Evaluation
Compare
Compare
Speed Area
Speed Area
Speed Area
Conclusions
7VESPA Architecture Design(Vector Extended Soft
Processor Architecture)
Icache
Dcache
Legend Pipe stage Logic Storage
M U X
WB
Decode
RF
Scalar Pipeline 3-stage
A L U
VC RF
VC WB
Supports integer and fixed-point operations
VIRAM
Vector Control Pipeline 3-stage
Logic
Shared Dcache
Decode
VS RF
VS WB
Decode
Repli- cate
Hazard check
VR RF
VR WB
Vector Pipeline 6-stage
VR RF
Lane 1 ALU,Mem Unit
VR WB
Lane 2 ALU, Mem, Mul
32-bit Lanes
8VESPA Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8,
Memory Crossbar Lanes M 1,2, , L
Multiplier Lanes X 1,2, , L
Maximum Vector Length MVL 2,4,8,
Width of Lanes (in bits) W 1-32
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK lt DD
Vector Data Prefetch Size DPV lt DD/MVL
Compute Architecture
Instruction Set Architecture
Memory Hierarchy
9VESPA Evaluation Infrastructure
SOFTWARE
HARDWARE
Verilog
EEMBC C Benchmarks
GCC
ld
scalar µP
ELF Binary
Vectorized assembly subroutines
GNU as
vpu
TM4
Instruction Set Simulation
RTL Simulation
Altera Quartus II v 8.1
area, clock frequency
cycles
verification
verification
10Measuring the Gap
EEMBC Benchmarks
Soft Vector Processor
Scalar Soft Processor
HW Circuits
VESPA
Evaluation
Evaluation
Evaluation
Compare
Compare
Speed Area
Speed Area
Speed Area
Conclusions
11Designing HW Circuits(with simplifying
assumptions)
HW
Memory Request
Idealized
Altera Quartus II v 8.1
Control
area, clock frequency
DDR Core
Datapath
- cycle count (modelled)
- ? Assume fed at full DDR bandwidth
- ? Calculate execution time from data size
12Benchmarks Converted to HW
Stratix III 3S200C2
EEMBC
VIRAM
HW Clock 275-475 MHz
VESPA Clock 120-140 MHz
13Performance/Area Space (vs HW)
Scalar 432x slower, 7x larger
HW Speed Advantage
Slowdown vs HW
HW Area Advantage
Area vs HW
fastest VESPA 17x slower, 64x larger
HW (1,1) optimistic
14Area-Delay Product
- Commonly used to measure efficiency in silicon
- Considers both performance and area
- Inverse of performance-per-area
- Calculated using
(Area) (Wall Clock Execution Time)
15Area-Delay Space (vs HW)
Area-Delay vs HW
2900x
HW Area-Delay Advantage
900x
HW Area Advantage
16Reducing the Performance Gap
- Previously VESPA was 50x slower than HW
- Reducing loop overhead
- VESPA Decoupled pipelines (7 speed)
- Improving data delivery
- VESPA Parameterized cache (2x speed, 2x area)
- VESPA Data Prefetching (42 speed)
17Wider Cache Line Size
vld.w (load 16 sequential 32-bit words)
VESPA 16 lanes
Scalar
Vector Coproc
Lane 0
Lane 0
Lane 0
Lane 4
Lane 0
Lane 0
Lane 0
Lane 8
Lane 0
Lane 0
Lane 0
Lane 12
Lane 4
Lane 4
Lane 15
Lane 16
Vector Memory Crossbar
Dcache 4KB, 16B line
18Wider Cache Line Size
vld.w (load 16 sequential 32-bit words)
VESPA 16 lanes
Scalar
Vector Coproc
Lane 0
Lane 0
Lane 0
Lane 4
Lane 0
Lane 0
Lane 0
Lane 8
Lane 0
Lane 0
Lane 0
Lane 12
Lane 4
Lane 4
Lane 15
Lane 16
Vector Memory Crossbar
4x
Dcache 16KB, 64B line
4x
19Hardware Prefetching Example
No Prefetching
Prefetching 3 blocks
vld.w
vld.w
vld.w
vld.w
MISS
MISS
MISS
HIT
Dcache
Dcache
10 cycle penalty
10 cycle penalty
DDR
DDR
20Reducing the Area Gap (by Customizing the
Instruction Set)
- FPGAs can be reconfigured between applications
- Observations Not all applications
- Operate on 32-bit data types
- Use the entire vector instruction set
- Eliminate unused hardware
21VESPA Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8,
Maximum Vector Length MVL 2,4,8,
Width of Lanes (in bits) W 1-32
Memory Crossbar Lanes M 1,2, , L
Multiplier Lanes X 1,2, , L
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK lt DD
Vector Data Prefetch Size DPV lt DD/MVL
Reduce width
Subset instruction set
22Customized VESPA vs HW
HW Speed Advantage
Slowdown vs HW
Area vs HW
HW Area Advantage
45
23Summary
- VESPA more competitive with HW design
- Fastest VESPA only 17x slower than HW
- Scalar soft processor was 432x slower than HW
- Attacking loop overhead and data delivery was key
- Decoupled pipelines, cache tuning, data
prefetching - Further enhancements can reduce the gap more
- VESPA improves efficiency of silicon usage
- 900x worse area-delay than HW
- Scalar soft processor 2900x worse area-delay than
HW - Subsetting/width reduction can further reduce to
561x
24Thank You!
- Stay tuned for public release
- GNU assembler ported for VIRAM (integer only)
- VESPA hardware design (DE3 ready)
25Breaking Down Performance
- Components of performance
Iteration-level parallelism
Loop ltworkgt goto Loop
Loop ltworkgt goto Loop
b)
Loop ltworkgt goto Loop
Cycles per iteration Clock period
a)
c)
Measure the HW advantage in each of these
components
26Breakdown of Performance Loss(16 lane VESPA vs
HW)
Benchmark Clock Frequency Iteration Level Parallelism Cycles Per Iteration
autcor 2.6x 1x 9.1x
conven 3.9x 1x 6.1x
rgbcmyk 3.7x 0.375x 13.8x
rgbyiq 2.2x 0.375x 19.0x
ip_checksum 3.7x 0.5x 4.8x
imgblend 3.6x 1x 4.4x
GEOMEAN 3.2x 0.64x 8.2x
Total
17x
Largest factor
271-Lane VESPA vs Scalar
- Efficient pipeline execution
- Large vector register file for storage
- Amortization of loop control instructions.
- More powerful ISA (VIRAM vs MIPS)
- Support for fixed-point operations
- Predication
- Built-in min/max/absolute instructions
- Execution in both scalar and vector co-processor
- Manual vectorization in assembly versus scalar GCC
28Measuring the Gap
C
- Scalar MIPS soft processor
- VESPA VIRAM soft vector processor
- HW Custom circuit for each benchmark
EEMBC C Benchmarks
(complete real)
COMPARE
assembly
(complete real)
COMPARE
Verilog
(simplified idealized)
29Reporting Comparison Results
1. Scalar (C)
vs HW (Verilog)
vs HW (Verilog)
2.
VESPA (Vector assembly)
3.
HW (Verilog)
- Performance (wall clock time)
- Area (actual silicon area)
Execution Time of Processor
HW Speed Advantage
Execution Time of Hardware
Area of Processor
HW Area Advantage
Area of Hardware
30Cache Design Space Performance (Wall Clock Time)
122MHz
123MHz
126MHz
129MHz
31Vector Length Prefetching - Performance
Peak 29
Not receptive
21
2.2x
no cache pollution
32Overall Memory System Performance
16 lanes
67
48
31
4
(4KB)
(16KB)
15