CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL

1 / 65

About This Presentation

Title:

CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL

Description:

CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL –

Number of Views:109

Avg rating:3.0/5.0

Slides: 66

Provided by: phillip77

Category:

more less

Transcript and Presenter's Notes

Title: CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL

1
CPRE 583Reconfigurable ComputingLecture 3 Wed
9/2/2009(Reconfigurable Computing Architectures,
VHDL Overview 3)
Instructor Dr. Phillip Jones (phjones_at_iastate.edu
) Reconfigurable Computing Laboratory Iowa
State University Ames, Iowa, USA
http//class.ece.iastate.edu/cpre583/
2
Overview

Reinforce some common questions
Finish Chapter 1 Lecture
Continue Chapter 2
VHDL review

3
Common Questions

How does an FPGA work?
How does VHDL execute on an FPGA?
How many LUT on the classes FPGA? 44,000
State machines will be cover more next lecture
Final Project group selection choose your own
groups
Class machine resources
Coover 2048, 1212 Coover 2041 ML507 (will be 2)
Distance students xilinx.ece.iastate.edu (other
servers on the way)

4
What you should learn

Basic trade-offs associated with different
aspects of a Reconfigurable Architecture.
(Chapter 2)
Practice with timing diagrams, start state
machines

5
Reconfigurable Architectures

Main Idea Chapter 2s author wants to convey
Applications often have one or more small
computationally intense regions of code (kernels)
Can these kernels be sped up using dedicated
hardware?
Different kernels have different needs. How does
a kernels requirements guide design decisions
when implementing a Reconfigurable Architecture?

6
Reconfigurable Architectures

Forces that drive a Reconfigurable Architecture
Price
Mass production 100K to millions
Experimental 1 to 10s
Granularity of reconfiguration
Fine grain
Course Grain
Degree of system integration/coupling
Tightly
Loosely

All are a function of the application that will
run on the Architecture
7
Example Points in (Price,Granularity,Coupling)
Space
1Ms
Exec
Int
Intel / AMD
Decode
Store
float
RFU
Processor
Price
Coupling
Tight
100s
Loose
Coarse
PC
Ethernet
Granularity
ML507
Fine
8
Whats the point of a Reconfigurable Architecture

Performance metrics
Computational
Throughput
Latency
Power
Total power dissipation
Thermal
Reliability
Recovery from faults

Increase application performance!
9
Typical Approach for Increasing Performance

Application/algorithm implemented in software
Often easier to write an application in software
Profile application (e.g. gprof)
Determine where the application is spending its
time
Identify kernels of interest
e.g. application spends 90 of its time in
function matrix_multiply()
Design custom hardware/instruction to accelerate
kernel(s)
Analysis to kernel to determine how to extract
fine/coarse grain parallelism (does any
parallelism even exist?)

Amdahls Law!
10
Amdahls Law Example

Application My_app
Running time 100 seconds
Spends 90 seconds in matrix_mul()
What is the maximum possible speed up of My_app
if I place matrix_mul() in hardware?
What if the original My_app spends 99 seconds in
matrx_mul()?

10 seconds 10x faster
1 seconds 100x faster
Good recent FPGA paper that illustrates
increasing an algorithms performance with
Hardware
NOVEL FPGA BASED HAAR CLASSIFIER FACE DETECTION
ALGORITHM ACCELERATION, FPL 2008
http//class.ece.iastate.edu/cpre583/papers/Shih-L
ien_Lu_FPL2008.pdf
11
Reconfigurable Architectures

RPF -gt VIC (short slide)

12
Granularity
13
Granularity Coarse Grain

rDPA reconfigurable Data Path Array
Function Units with programmable interconnects

Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
14
Granularity Coarse Grain

rDPA reconfigurable Data Path Array
Function Units with programmable interconnects

Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
15
Granularity Coarse Grain

rDPA reconfigurable Data Path Array
Function Units with programmable interconnects

Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
16
Granularity Fine Grain

FPGA Field Programmable Gate Array
Sea of general purpose logic gates

17
Granularity Fine Grain

FPGA Field Programmable Gate Array
Sea of general purpose logic gates

CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
18
Granularity Fine Grain

FPGA Field Programmable Gate Array
Sea of general purpose logic gates

CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
19
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Microprocessor
1024-bits
20
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
3
A
10-LUT
Microprocessor
3
B
1024-bits
21
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
3
A
10-LUT
Microprocessor
3
B
4
op
1024-bits
3
3
A
3
B
4
op
3
3
A
3
B
22
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
A
10-LUT
Microprocessor
3
3
B
1024-bits
op
A
3
3
3
B
4
op
3
A
3
B
3
23
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Microprocessor
1024-bits
4
op
3
A
3
3
B
24
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Bit logic and constants
1024-bits
25
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Bit logic and constants
1024-bits
(A and 1100) or (B or 1000)
26
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
A
10-LUT
B
Bit logic and constants
1024-bits
(A and 1100) or (B or 1000)
27
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
AND
4
A
10-LUT
1
Bit logic and constants
1024-bits
OR
Area that was required using 2-LUTS
(A and 1100) or (B or 1000)
0
OR
4
B
Its much worse, each 10-LUT only has one output
28
Granularity Example Architectures

Fine grain GARP
Course grain PipeRench

29
Granularity GARP
Memory
D-cache
I-cache
CPU
RFU
Config cache
Garp chip
30
Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
31
Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
Example computations in one cycle Altlt10
(bc) (A-2bc)
32
Granularity GARP
Memory

Impact of configuration size
1 GHz bus frequency
128-bit memory bus
512Kbits of configuration size

D-cache
I-cache
On a RFU context switch how long to load a new
full configuration?
CPU
RFU
4 microseconds
An estimate of amount of time for the CPU perform
a context switch is 5 microseconds
Config cache
Garp chip
2x increase context switch latency!!
33
Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
The Garp Architecture and C Compiler http//www.
cs.cmu.edu/tcal/IEEE-Computer-Garp.pdf
34
Granularity PipeRench

Coarse granularity
Higher (higher) level programming
Reference papers
PipeRench A Coprocessor for Streaming Multimedia
Acceleration (ISCA 1999) http//www.cs.cmu.edu/m
ihaib/research/isca99.pdf
PipeRench Implementation of the Instruction Path
Coprocessor (Micro 2000) http//class.ee.iastate.
edu/cpre583/papers/piperench_Micro_2000.pdf

35
Granularity PipeRench
Interconnect
Global bus
Interconnect
36
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
37
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
38
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
PE
PE
PE
PE
1
PE
PE
PE
PE
PE
PE
PE
PE
39
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
2
PE
PE
PE
PE
PE
PE
PE
PE
40
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
1
2
2
PE
PE
PE
PE
3
PE
PE
PE
PE
41
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
4
PE
PE
PE
PE
42
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
43
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
44
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
45
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
1
46
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
1
1
2
47
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
1
1
1
2
2
48
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
3
1
1
1
4
2
2
2
49
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
3
3
1
1
1
4
4
2
2
2
0
50
Degree of Integration/Coupling

Independent Reconfigurable Coprocessor
Reconfigurable Fabric does not have direct
communication with the CPU
Processor Reconfigurable Processing Fabric
Loosely coupled on the same chip
Tightly coupled on the same chip

51
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
52
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
RPF
Hard Drive
NIC
53
Degree of Integration/Coupling
CPU
Execute
Write Back
RPF
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
54
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
55
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
56
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
I/O
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
57
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
RFU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
58
MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
59
MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
60
MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
61
MP2 Notes

MUCH less VHDL coding than MP1
But you will be writing most of the VHDL from
scratch
The focus will be more on learning to read a
specification (Power PC coprocessor interface
protocol), and designing hardware that follows
that protocol.
You will be dealing with some pointer intensive
C-code. Its a small amount of C code, but
somewhat challenging to get the pointer math
right.

62
Lecture 3 notes / slides in progress
63
Granularity PipeRench