Data Parallel FPGA Workloads: Software Versus Hardware - PowerPoint PPT Presentation

About This Presentation

Title:

Data Parallel FPGA Workloads: Software Versus Hardware

Description:

Simplify FPGA design: Customize soft processor ... Lane 1. ALU,Mem Unit. Lane 2. ALU, Mem, Mul. VESPA Parameters. 2,4,8, ... MVL. Maximum Vector Length ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 33

Provided by: looie6

Category:

more less

Transcript and Presenter's Notes

Title: Data Parallel FPGA Workloads: Software Versus Hardware

1
Data Parallel FPGA Workloads Software Versus
Hardware

Peter Yiannacouras
J. Gregory Steffan
Jonathan Rose
FPL 2009

2
FPGA Systems and Soft Processors
Digital System
computation
Weeks
Months
HDL CAD
Software Compiler
Used in 25 of designs source Altera,
2009
? Faster ? Smaller ? Less Power
? Easier
COMPETE
? Configurable
3
Vector Processing Primer
vadd
// C code for(i0ilt16 i) ciaibi //
Vectorized code set vl,16 vload
vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c
vr215vr015vr115
vr214vr014vr114
vr213vr013vr113
vr212vr012vr112
vr211vr011vr111
vr210vr010vr110
vr29 vr09vr19
vr28 vr08vr18
vr27 vr07vr17
vr26 vr06vr16
vr25 vr05vr15
vr24 vr04vr14
Each vector instruction holds many units of
independent operations
vr23 vr03vr13
vr22 vr02vr12
vr21 vr01vr11
vr20 vr00vr10
1 Vector Lane
4
Vector Processing Primer
vadd
// C code for(i0ilt16 i) ciaibi //
Vectorized code set vl,16 vload
vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c
16 Vector Lanes
vr215vr015vr115
vr214vr014vr114
vr213vr013vr113
vr212vr012vr112

Previous Work (on Soft Vector Processors)
Scalability
Flexibility
Portability

vr211vr011vr111
vr210vr010vr110
vr29 vr09vr19
vr28 vr08vr18
vr27 vr07vr17
vr26 vr06vr16
vr25 vr05vr15
vr24 vr04vr14
Each vector instruction holds many units of
independent operations
vr23 vr03vr13
vr22 vr02vr12
vr21 vr01vr11
vr20 vr00vr10
5
Soft Vector Processors vs HW
Soft Vector Processor
Weeks
Months
HDL CAD
Software Compiler
Vectorizer
Lane 1
Lane 2
Lane 3
Lane 4
Lane 5
Lane 6
Lane 7
Lane 8 16
Scalable Fine-tunable Customizable
Vector Lanes
How much?
? Faster ? Smaller ? Less Power
? Easier
6
Measuring the Gap
EEMBC Benchmarks
Soft Vector Processor
Scalar Soft Processor
HW Circuits
Evaluation
Evaluation
Evaluation
Compare
Compare
Speed Area
Speed Area
Speed Area
Conclusions
7
VESPA Architecture Design(Vector Extended Soft
Processor Architecture)
Icache
Dcache
Legend Pipe stage Logic Storage
M U X
WB
Decode
RF
Scalar Pipeline 3-stage
A L U
VC RF
VC WB
Supports integer and fixed-point operations
VIRAM
Vector Control Pipeline 3-stage
Logic
Shared Dcache
Decode
VS RF
VS WB
Decode
Repli- cate
Hazard check
VR RF
VR WB
Vector Pipeline 6-stage
VR RF
Lane 1 ALU,Mem Unit
VR WB
Lane 2 ALU, Mem, Mul
32-bit Lanes
8
VESPA Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8,
Memory Crossbar Lanes M 1,2, , L
Multiplier Lanes X 1,2, , L
Maximum Vector Length MVL 2,4,8,
Width of Lanes (in bits) W 1-32
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK lt DD
Vector Data Prefetch Size DPV lt DD/MVL
Compute Architecture
Instruction Set Architecture
Memory Hierarchy
9
VESPA Evaluation Infrastructure
SOFTWARE
HARDWARE
Verilog
EEMBC C Benchmarks
GCC
ld
scalar µP
ELF Binary

Vectorized assembly subroutines
GNU as
vpu
TM4
Instruction Set Simulation
RTL Simulation
Altera Quartus II v 8.1
area, clock frequency
cycles
verification
verification
10
Measuring the Gap
EEMBC Benchmarks
Soft Vector Processor
Scalar Soft Processor
HW Circuits
VESPA
Evaluation
Evaluation
Evaluation
Compare
Compare
Speed Area
Speed Area
Speed Area
Conclusions
11
Designing HW Circuits(with simplifying
assumptions)
HW
Memory Request
Idealized
Altera Quartus II v 8.1
Control
area, clock frequency
DDR Core
Datapath

cycle count (modelled)
? Assume fed at full DDR bandwidth
? Calculate execution time from data size

12
Benchmarks Converted to HW
Stratix III 3S200C2
EEMBC
VIRAM
HW Clock 275-475 MHz
VESPA Clock 120-140 MHz
13
Performance/Area Space (vs HW)
Scalar 432x slower, 7x larger
HW Speed Advantage
Slowdown vs HW
HW Area Advantage
Area vs HW
fastest VESPA 17x slower, 64x larger
HW (1,1) optimistic
14
Area-Delay Product

Commonly used to measure efficiency in silicon
Considers both performance and area
Inverse of performance-per-area
Calculated using

(Area) (Wall Clock Execution Time)
15
Area-Delay Space (vs HW)
Area-Delay vs HW
2900x
HW Area-Delay Advantage
900x
HW Area Advantage
16
Reducing the Performance Gap

Previously VESPA was 50x slower than HW
Reducing loop overhead
VESPA Decoupled pipelines (7 speed)
Improving data delivery
VESPA Parameterized cache (2x speed, 2x area)
VESPA Data Prefetching (42 speed)

17
Wider Cache Line Size
vld.w (load 16 sequential 32-bit words)
VESPA 16 lanes
Scalar
Vector Coproc
Lane 0
Lane 0
Lane 0
Lane 4
Lane 0
Lane 0
Lane 0
Lane 8
Lane 0
Lane 0
Lane 0
Lane 12
Lane 4
Lane 4
Lane 15
Lane 16
Vector Memory Crossbar
Dcache 4KB, 16B line
18
Wider Cache Line Size
vld.w (load 16 sequential 32-bit words)
VESPA 16 lanes
Scalar
Vector Coproc
Lane 0
Lane 0
Lane 0
Lane 4
Lane 0
Lane 0
Lane 0
Lane 8
Lane 0
Lane 0
Lane 0
Lane 12
Lane 4
Lane 4
Lane 15
Lane 16
Vector Memory Crossbar
4x
Dcache 16KB, 64B line
4x
19
Hardware Prefetching Example
No Prefetching
Prefetching 3 blocks
vld.w
vld.w
vld.w
vld.w
MISS
MISS
MISS
HIT
Dcache
Dcache
10 cycle penalty
10 cycle penalty
DDR
DDR
20
Reducing the Area Gap (by Customizing the
Instruction Set)

FPGAs can be reconfigured between applications
Observations Not all applications
Operate on 32-bit data types
Use the entire vector instruction set
Eliminate unused hardware

21
VESPA Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8,
Maximum Vector Length MVL 2,4,8,
Width of Lanes (in bits) W 1-32
Memory Crossbar Lanes M 1,2, , L
Multiplier Lanes X 1,2, , L
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK lt DD
Vector Data Prefetch Size DPV lt DD/MVL
Reduce width
Subset instruction set
22
Customized VESPA vs HW
HW Speed Advantage
Slowdown vs HW
Area vs HW
HW Area Advantage
45
23
Summary

VESPA more competitive with HW design
Fastest VESPA only 17x slower than HW
Scalar soft processor was 432x slower than HW
Attacking loop overhead and data delivery was key
Decoupled pipelines, cache tuning, data
prefetching
Further enhancements can reduce the gap more
VESPA improves efficiency of silicon usage
900x worse area-delay than HW
Scalar soft processor 2900x worse area-delay than
HW
Subsetting/width reduction can further reduce to
561x

24
Thank You!

Stay tuned for public release
GNU assembler ported for VIRAM (integer only)
VESPA hardware design (DE3 ready)

25
Breaking Down Performance

Components of performance

Iteration-level parallelism
Loop ltworkgt goto Loop

Loop ltworkgt goto Loop
b)
Loop ltworkgt goto Loop
Cycles per iteration Clock period
a)
c)
Measure the HW advantage in each of these
components
26
Breakdown of Performance Loss(16 lane VESPA vs
HW)
Benchmark Clock Frequency Iteration Level Parallelism Cycles Per Iteration
autcor 2.6x 1x 9.1x
conven 3.9x 1x 6.1x
rgbcmyk 3.7x 0.375x 13.8x
rgbyiq 2.2x 0.375x 19.0x
ip_checksum 3.7x 0.5x 4.8x
imgblend 3.6x 1x 4.4x
GEOMEAN 3.2x 0.64x 8.2x
Total
17x
Largest factor
27
1-Lane VESPA vs Scalar

Efficient pipeline execution
Large vector register file for storage
Amortization of loop control instructions.
More powerful ISA (VIRAM vs MIPS)
Support for fixed-point operations
Predication
Built-in min/max/absolute instructions
Execution in both scalar and vector co-processor
Manual vectorization in assembly versus scalar GCC

28
Measuring the Gap
C

Scalar MIPS soft processor
VESPA VIRAM soft vector processor
HW Custom circuit for each benchmark

EEMBC C Benchmarks
(complete real)
COMPARE
assembly
(complete real)
COMPARE
Verilog
(simplified idealized)
29
Reporting Comparison Results
1. Scalar (C)
vs HW (Verilog)
vs HW (Verilog)
2.
VESPA (Vector assembly)
3.
HW (Verilog)

Performance (wall clock time)
Area (actual silicon area)

Execution Time of Processor
HW Speed Advantage
Execution Time of Hardware
Area of Processor
HW Area Advantage
Area of Hardware
30
Cache Design Space Performance (Wall Clock Time)
122MHz
123MHz
126MHz
129MHz
31
Vector Length Prefetching - Performance
Peak 29
Not receptive
21
2.2x
no cache pollution
32
Overall Memory System Performance
16 lanes
67
48
31
4
(4KB)
(16KB)
15

Write a Comment

User Comments (0)