The Microarchitecture of FPGA-Based Soft Processors

About This Presentation

Title:

The Microarchitecture of FPGA-Based Soft Processors

Description:

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras Jonathan Rose Greg Steffan University of Toronto Electrical and Computer Engineering – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 31

Provided by: looie2

Category:

more less

Transcript and Presenter's Notes

Title: The Microarchitecture of FPGA-Based Soft Processors

1
The Microarchitectureof FPGA-Based Soft
Processors

Peter Yiannacouras
Jonathan Rose
Greg Steffan
University of Toronto
Electrical and Computer Engineering

2
Processors and FPGAs

Processors present in many digital systems

Processor
Custom Logic

Soft processors - implemented in FPGA fabric

3
Motivation for understanding soft processor
architecture

Soft processors are popular
16 of FPGA designs use a soft processor
FPGA Journal, November 2003
This number has and will continue to increase
Soft processors are end-user customizable
Application-specific architectural tradeoffs
Can be tuned by designers

4
Dont we already understand processor
architecture?

Not accurately/completely
Accurate cycle-to-cycle behaviour
Estimated area/power
No clock frequency impact
Not in FPGA domain
Lookup tables vs transistors
Dedicated RAMs and Multipliers fast

5
Research Goals

Generate soft processor implementations
System for generating RTL
Develop measurement methodology
Metrics for comparing soft processors
Develop understanding of architectural tradeoffs
Analyze area/performance/power space

6
Soft Processor Rapid Exploration Environment
(SPREE)
7
Input Instruction Set Architecture (ISA)
Description

Graph of Generic Operations (GENOPs)
Edges indicate flow of data

ISA
Datapath

MIPS ADD add rd, rs, rt
FETCH
SPREE
RFREAD
RFREAD
ADD
RFWRITE
8
Input Datapath Description

Interconnection of hand-coded components
Allows efficient synthesis
Described using C

ISA
Datapath

Ifetch
Reg File
Ifetch
Reg File
SPREE
Mul
Data Mem
Mul

ALU
Shifter
Write Back
ALU
SPREE Component Library
9
Step 1.ISA vs Datapath Verification

ISA
Datapath

Components described using GENOPs

Verify
FETCH
SPREE
RFREAD
RFREAD
ADD
RFWRITE
10
Step 2.Datapath Instantiation

ISA
Datapath

Multiplexer insertion
Unused connection/component removal

SPREE
11
Step 3.Control Generation

ISA
Datapath

Control
Control
Control
Control
Mul
Reg File
Ifetch

Write Back
SPREE
ALU
Data Mem
12
Output Verilog RTL Description

ISA
Datapath

Verilog RTL
Control
Control
Control
Control
Mul
Reg File
SPREE
Ifetch

Write Back
ALU
RTL
Data Mem
13
Back-end Infrastructure
Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc)
Quartus II 4.2 CAD Software
Modelsim RTL Simulator
Stratix 1S40
2. Resource Usage 3. Clock Frequency 4. Power

Cycle Count

14
Metrics for Measurement

Area Equivalent Stratix Logic Elements (LEs)
Relative silicon areas used for RAMs/Multipliers
Performance Wall clock time
Cycle count clock frequency
Arithmetic mean across benchmark set
Energy Dynamic Energy (eg. nJ/instr)
Excluding I/O

15
Trace-Based Verification

Ensure SPREE generates functional processors

Trace
RTL
110100 101011 111101
Modelsim (RTL Simulator)
?
Compare
Benchmark Applications
Trace
?
MINT (Instruction-set Simulator)
110100 101011 111101
16
Architectural Exploration Results
17
Architectural Features Explored

Hardware vs software multiplication
Shifter implementation
Pipelining
Depth
Organization
Forwarding

18
Validation of SPREE Through Comparison to
Alteras Nios II

Has three variations
Nios II/e unpipelined, no HW multiplier
Nios II/s 5-stage, with HW multiplier
Nios II/f 6-stage, dynamic branch prediction
Caveats not completely fair comparison
Very similar but tweaked ISA
Nios II Supports exceptions, OS, and caches
We do not and save on the hardware costs

19
SPREE vs Nios II
faster

3-stage pipe
HW multiply
Multiply-based
shifter

smaller
20
Architectural Features Explored

Hardware vs software multiplication
Shifter implementation
Pipelining
Depth
Organization
Forwarding

21
Hardware vs Software Multiplication

Hardware multiply is fast but not always needed
Wastes area (220 LEs) and can waste energy

22
Shifter Implementation

Shifters are expensive in FPGAs
We explore three implementations
Serial shifter (shift register)
Multiplier-based barrel shifter (hard multiplier)
LUT-based barrel shifter (multiplexer tree)

23
Performance-Area of Different Shifter
Implementations
faster
smaller
24
Pipeline Depth

Explored between 2 and 7 stages
1-stage and 6-stage pipeline not interesting

F/D/R/EX/M
WB
2-stage
F/D
R/EX/M
WB
3-stage
F
D
R/EX/M
WB
4-stage
F
D
R/EX
EX/M
WB
5-stage
F
D
EX
EX/M
WB
R
EX
(new) 7-stage
25
Pipeline Depth and Performance
26
Pipeline Organization Tradeoff
4-stage (A)
F
D
R/EX/M
WB
4-stage (B)
F/D
R/EX
EX/M
WB
27
Pipeline Forwarding
F
D/R
EX
M
WB