Title: CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL
1CPRE 583Reconfigurable ComputingLecture 3 Wed
9/2/2009(Reconfigurable Computing Architectures,
VHDL Overview 3)
Instructor Dr. Phillip Jones (phjones_at_iastate.edu
) Reconfigurable Computing Laboratory Iowa
State University Ames, Iowa, USA
http//class.ece.iastate.edu/cpre583/
2Overview
- Reinforce some common questions
- Finish Chapter 1 Lecture
- Continue Chapter 2
- VHDL review
3Common Questions
- How does an FPGA work?
- How does VHDL execute on an FPGA?
- How many LUT on the classes FPGA? 44,000
- State machines will be cover more next lecture
- Final Project group selection choose your own
groups - Class machine resources
- Coover 2048, 1212 Coover 2041 ML507 (will be 2)
- Distance students xilinx.ece.iastate.edu (other
servers on the way)
4What you should learn
- Basic trade-offs associated with different
aspects of a Reconfigurable Architecture.
(Chapter 2) - Practice with timing diagrams, start state
machines
5Reconfigurable Architectures
- Main Idea Chapter 2s author wants to convey
- Applications often have one or more small
computationally intense regions of code (kernels) - Can these kernels be sped up using dedicated
hardware? - Different kernels have different needs. How does
a kernels requirements guide design decisions
when implementing a Reconfigurable Architecture?
6Reconfigurable Architectures
- Forces that drive a Reconfigurable Architecture
- Price
- Mass production 100K to millions
- Experimental 1 to 10s
- Granularity of reconfiguration
- Fine grain
- Course Grain
- Degree of system integration/coupling
- Tightly
- Loosely
All are a function of the application that will
run on the Architecture
7Example Points in (Price,Granularity,Coupling)
Space
1Ms
Exec
Int
Intel / AMD
Decode
Store
float
RFU
Processor
Price
Coupling
Tight
100s
Loose
Coarse
PC
Ethernet
Granularity
ML507
Fine
8Whats the point of a Reconfigurable Architecture
- Performance metrics
- Computational
- Throughput
- Latency
- Power
- Total power dissipation
- Thermal
- Reliability
- Recovery from faults
Increase application performance!
9Typical Approach for Increasing Performance
- Application/algorithm implemented in software
- Often easier to write an application in software
- Profile application (e.g. gprof)
- Determine where the application is spending its
time - Identify kernels of interest
- e.g. application spends 90 of its time in
function matrix_multiply() - Design custom hardware/instruction to accelerate
kernel(s) - Analysis to kernel to determine how to extract
fine/coarse grain parallelism (does any
parallelism even exist?)
Amdahls Law!
10Amdahls Law Example
- Application My_app
- Running time 100 seconds
- Spends 90 seconds in matrix_mul()
- What is the maximum possible speed up of My_app
if I place matrix_mul() in hardware? - What if the original My_app spends 99 seconds in
matrx_mul()?
10 seconds 10x faster
1 seconds 100x faster
Good recent FPGA paper that illustrates
increasing an algorithms performance with
Hardware
NOVEL FPGA BASED HAAR CLASSIFIER FACE DETECTION
ALGORITHM ACCELERATION, FPL 2008
http//class.ece.iastate.edu/cpre583/papers/Shih-L
ien_Lu_FPL2008.pdf
11Reconfigurable Architectures
- RPF -gt VIC (short slide)
12Granularity
13Granularity Coarse Grain
- rDPA reconfigurable Data Path Array
- Function Units with programmable interconnects
Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
14Granularity Coarse Grain
- rDPA reconfigurable Data Path Array
- Function Units with programmable interconnects
Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
15Granularity Coarse Grain
- rDPA reconfigurable Data Path Array
- Function Units with programmable interconnects
Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
16Granularity Fine Grain
- FPGA Field Programmable Gate Array
- Sea of general purpose logic gates
17Granularity Fine Grain
- FPGA Field Programmable Gate Array
- Sea of general purpose logic gates
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
18Granularity Fine Grain
- FPGA Field Programmable Gate Array
- Sea of general purpose logic gates
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
19Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Microprocessor
1024-bits
20Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
3
A
10-LUT
Microprocessor
3
B
1024-bits
21Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
3
A
10-LUT
Microprocessor
3
B
4
op
1024-bits
3
3
A
3
B
4
op
3
3
A
3
B
22Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
A
10-LUT
Microprocessor
3
3
B
1024-bits
op
A
3
3
3
B
4
op
3
A
3
B
3
23Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Microprocessor
1024-bits
4
op
3
A
3
3
B
24Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Bit logic and constants
1024-bits
25Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Bit logic and constants
1024-bits
(A and 1100) or (B or 1000)
26Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
A
10-LUT
B
Bit logic and constants
1024-bits
(A and 1100) or (B or 1000)
27Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
AND
4
A
10-LUT
1
Bit logic and constants
1024-bits
OR
Area that was required using 2-LUTS
(A and 1100) or (B or 1000)
0
OR
4
B
Its much worse, each 10-LUT only has one output
28Granularity Example Architectures
- Fine grain GARP
- Course grain PipeRench
29Granularity GARP
Memory
D-cache
I-cache
CPU
RFU
Config cache
Garp chip
30Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
31Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
Example computations in one cycle Altlt10
(bc) (A-2bc)
32Granularity GARP
Memory
- Impact of configuration size
- 1 GHz bus frequency
- 128-bit memory bus
- 512Kbits of configuration size
D-cache
I-cache
On a RFU context switch how long to load a new
full configuration?
CPU
RFU
4 microseconds
An estimate of amount of time for the CPU perform
a context switch is 5 microseconds
Config cache
Garp chip
2x increase context switch latency!!
33Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
The Garp Architecture and C Compiler http//www.
cs.cmu.edu/tcal/IEEE-Computer-Garp.pdf
34Granularity PipeRench
- Coarse granularity
- Higher (higher) level programming
- Reference papers
- PipeRench A Coprocessor for Streaming Multimedia
Acceleration (ISCA 1999) http//www.cs.cmu.edu/m
ihaib/research/isca99.pdf - PipeRench Implementation of the Instruction Path
Coprocessor (Micro 2000) http//class.ee.iastate.
edu/cpre583/papers/piperench_Micro_2000.pdf
35Granularity PipeRench
Interconnect
Global bus
Interconnect
36Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
37Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
38Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
PE
PE
PE
PE
1
PE
PE
PE
PE
PE
PE
PE
PE
39Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
2
PE
PE
PE
PE
PE
PE
PE
PE
40Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
1
2
2
PE
PE
PE
PE
3
PE
PE
PE
PE
41Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
4
PE
PE
PE
PE
42Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
43Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
44Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
45Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
1
46Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
1
1
2
47Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
1
1
1
2
2
48Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
3
1
1
1
4
2
2
2
49Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
3
3
1
1
1
4
4
2
2
2
0
50Degree of Integration/Coupling
- Independent Reconfigurable Coprocessor
- Reconfigurable Fabric does not have direct
communication with the CPU - Processor Reconfigurable Processing Fabric
- Loosely coupled on the same chip
- Tightly coupled on the same chip
51Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
52Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
RPF
Hard Drive
NIC
53Degree of Integration/Coupling
CPU
Execute
Write Back
RPF
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
54Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
55Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
56Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
I/O
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
57Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
RFU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
58MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
59MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
60MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
61MP2 Notes
- MUCH less VHDL coding than MP1
- But you will be writing most of the VHDL from
scratch - The focus will be more on learning to read a
specification (Power PC coprocessor interface
protocol), and designing hardware that follows
that protocol. - You will be dealing with some pointer intensive
C-code. Its a small amount of C code, but
somewhat challenging to get the pointer math
right.
62Lecture 3 notes / slides in progress
63Granularity PipeRench
- Scheduling virtual stage on to physical
- Partial/Dynamically reconfig (each cycle)
64Granularity GARP
- Impact of configuration size on performance
- Context switching
- Garp feature
- Dynamic reconfigurable
- Store multiple configurations in an on chip cache
(4) - One configuration at a time
- Example app mapping to GARP (loop)
- Amdahl's Law
- The Garp Architecture and C Compiler
- http//www.cs.cmu.edu/tcal/IEEE-Computer-Garp.pdf
65Overview
- Dimensions
- Price
- Granularity
- Coupling
- To optimize App Performance (compute (throughput,
latency), Power, reliability) - RPF to efficiently implement VICs
- Main picture authors' wants to convey
- Whats the point or having a Reconfigure arch
- Example (Increase App performance)
- App -gt SW/CPU
- Profile
- ID kernels of intense compute
- Design custom hardware/instruction (Amdels law)
- Intel FPL paper, great example for reading by
Friday