Title: The Microarchitecture of FPGA-Based Soft Processors
1The Microarchitectureof FPGA-Based Soft
Processors
- Peter Yiannacouras
- Jonathan Rose
- Greg Steffan
- University of Toronto
- Electrical and Computer Engineering
2Processors and FPGAs
- Processors present in many digital systems
Processor
Custom Logic
- Soft processors - implemented in FPGA fabric
3Motivation for understanding soft processor
architecture
- Soft processors are popular
- 16 of FPGA designs use a soft processor
- FPGA Journal, November 2003
- This number has and will continue to increase
- Soft processors are end-user customizable
- Application-specific architectural tradeoffs
- Can be tuned by designers
4Dont we already understand processor
architecture?
- Not accurately/completely
- Accurate cycle-to-cycle behaviour
- Estimated area/power
- No clock frequency impact
- Not in FPGA domain
- Lookup tables vs transistors
- Dedicated RAMs and Multipliers fast
5Research Goals
- Generate soft processor implementations
- System for generating RTL
- Develop measurement methodology
- Metrics for comparing soft processors
- Develop understanding of architectural tradeoffs
- Analyze area/performance/power space
6Soft Processor Rapid Exploration Environment
(SPREE)
7Input Instruction Set Architecture (ISA)
Description
- Graph of Generic Operations (GENOPs)
- Edges indicate flow of data
MIPS ADD add rd, rs, rt
FETCH
SPREE
RFREAD
RFREAD
ADD
RFWRITE
8Input Datapath Description
- Interconnection of hand-coded components
- Allows efficient synthesis
- Described using C
Ifetch
Reg File
Ifetch
Reg File
SPREE
Mul
Data Mem
Mul
ALU
Shifter
Write Back
ALU
SPREE Component Library
9Step 1.ISA vs Datapath Verification
- Components described using GENOPs
Verify
FETCH
SPREE
RFREAD
RFREAD
ADD
RFWRITE
10Step 2.Datapath Instantiation
- Multiplexer insertion
- Unused connection/component removal
SPREE
11Step 3.Control Generation
Control
Control
Control
Control
Mul
Reg File
Ifetch
Write Back
SPREE
ALU
Data Mem
12Output Verilog RTL Description
Verilog RTL
Control
Control
Control
Control
Mul
Reg File
SPREE
Ifetch
Write Back
ALU
RTL
Data Mem
13Back-end Infrastructure
Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc)
Quartus II 4.2 CAD Software
Modelsim RTL Simulator
Stratix 1S40
2. Resource Usage 3. Clock Frequency 4. Power
- Cycle Count
14Metrics for Measurement
- Area Equivalent Stratix Logic Elements (LEs)
- Relative silicon areas used for RAMs/Multipliers
- Performance Wall clock time
- Cycle count clock frequency
- Arithmetic mean across benchmark set
- Energy Dynamic Energy (eg. nJ/instr)
- Excluding I/O
15Trace-Based Verification
- Ensure SPREE generates functional processors
Trace
RTL
110100 101011 111101
Modelsim (RTL Simulator)
?
Compare
Benchmark Applications
Trace
?
MINT (Instruction-set Simulator)
110100 101011 111101
16Architectural Exploration Results
17Architectural Features Explored
- Hardware vs software multiplication
- Shifter implementation
- Pipelining
- Depth
- Organization
- Forwarding
18Validation of SPREE Through Comparison to
Alteras Nios II
- Has three variations
- Nios II/e unpipelined, no HW multiplier
- Nios II/s 5-stage, with HW multiplier
- Nios II/f 6-stage, dynamic branch prediction
- Caveats not completely fair comparison
- Very similar but tweaked ISA
- Nios II Supports exceptions, OS, and caches
- We do not and save on the hardware costs
19SPREE vs Nios II
faster
- 3-stage pipe
- HW multiply
- Multiply-based
- shifter
smaller
20Architectural Features Explored
- Hardware vs software multiplication
- Shifter implementation
- Pipelining
- Depth
- Organization
- Forwarding
21Hardware vs Software Multiplication
- Hardware multiply is fast but not always needed
- Wastes area (220 LEs) and can waste energy
22Shifter Implementation
- Shifters are expensive in FPGAs
- We explore three implementations
- Serial shifter (shift register)
- Multiplier-based barrel shifter (hard multiplier)
- LUT-based barrel shifter (multiplexer tree)
23Performance-Area of Different Shifter
Implementations
faster
smaller
24Pipeline Depth
- Explored between 2 and 7 stages
- 1-stage and 6-stage pipeline not interesting
F/D/R/EX/M
WB
2-stage
F/D
R/EX/M
WB
3-stage
F
D
R/EX/M
WB
4-stage
F
D
R/EX
EX/M
WB
5-stage
F
D
EX
EX/M
WB
R
EX
(new) 7-stage
25Pipeline Depth and Performance
26Pipeline Organization Tradeoff
4-stage (A)
F
D
R/EX/M
WB
4-stage (B)
F/D
R/EX
EX/M
WB
27Pipeline Forwarding
F
D/R
EX
M
WB
- Prevent stalls when data hazards occur
- MIPS has two source operands (rs rt)
- Four forwarding configuration are possible
- No forwarding
- Forward rs
- Forward rt
- Forward both rs and rt
28Pipeline Forwarding
29Summary of Presented Architectural Conclusions
- Hardware multiplication can be wasteful
- Multiplier-based shifter is a sweet spot
- 3-stage pipelines are attractive
- Tradeoffs exist within pipeline organization
- Forwarding
- Improves performance by 20
- Favours the rs operand
30Future Work
- Explore other exciting architectural axes
- Branch prediction, aggressive forwarding
- ISA changes
- VLIW datapaths
- Caches and memory hierarchy
- Compiler optimizations
- Port to other devices
- Explore aggressive customization
- Add exceptions and OS support