CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL

1 / 65
About This Presentation
Title:

CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL

Description:

CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL –

Number of Views:109
Avg rating:3.0/5.0
Slides: 66
Provided by: phillip77
Category:

less

Transcript and Presenter's Notes

Title: CPRE 583 Reconfigurable Computing Lecture 3: Wed 922009 Reconfigurable Computing Architectures, VHDL


1
CPRE 583Reconfigurable ComputingLecture 3 Wed
9/2/2009(Reconfigurable Computing Architectures,
VHDL Overview 3)
Instructor Dr. Phillip Jones (phjones_at_iastate.edu
) Reconfigurable Computing Laboratory Iowa
State University Ames, Iowa, USA
http//class.ece.iastate.edu/cpre583/
2
Overview
  • Reinforce some common questions
  • Finish Chapter 1 Lecture
  • Continue Chapter 2
  • VHDL review

3
Common Questions
  • How does an FPGA work?
  • How does VHDL execute on an FPGA?
  • How many LUT on the classes FPGA? 44,000
  • State machines will be cover more next lecture
  • Final Project group selection choose your own
    groups
  • Class machine resources
  • Coover 2048, 1212 Coover 2041 ML507 (will be 2)
  • Distance students xilinx.ece.iastate.edu (other
    servers on the way)

4
What you should learn
  • Basic trade-offs associated with different
    aspects of a Reconfigurable Architecture.
    (Chapter 2)
  • Practice with timing diagrams, start state
    machines

5
Reconfigurable Architectures
  • Main Idea Chapter 2s author wants to convey
  • Applications often have one or more small
    computationally intense regions of code (kernels)
  • Can these kernels be sped up using dedicated
    hardware?
  • Different kernels have different needs. How does
    a kernels requirements guide design decisions
    when implementing a Reconfigurable Architecture?

6
Reconfigurable Architectures
  • Forces that drive a Reconfigurable Architecture
  • Price
  • Mass production 100K to millions
  • Experimental 1 to 10s
  • Granularity of reconfiguration
  • Fine grain
  • Course Grain
  • Degree of system integration/coupling
  • Tightly
  • Loosely

All are a function of the application that will
run on the Architecture
7
Example Points in (Price,Granularity,Coupling)
Space
1Ms
Exec
Int
Intel / AMD
Decode
Store
float
RFU
Processor
Price
Coupling
Tight
100s
Loose
Coarse
PC
Ethernet
Granularity
ML507
Fine
8
Whats the point of a Reconfigurable Architecture
  • Performance metrics
  • Computational
  • Throughput
  • Latency
  • Power
  • Total power dissipation
  • Thermal
  • Reliability
  • Recovery from faults

Increase application performance!
9
Typical Approach for Increasing Performance
  • Application/algorithm implemented in software
  • Often easier to write an application in software
  • Profile application (e.g. gprof)
  • Determine where the application is spending its
    time
  • Identify kernels of interest
  • e.g. application spends 90 of its time in
    function matrix_multiply()
  • Design custom hardware/instruction to accelerate
    kernel(s)
  • Analysis to kernel to determine how to extract
    fine/coarse grain parallelism (does any
    parallelism even exist?)

Amdahls Law!
10
Amdahls Law Example
  • Application My_app
  • Running time 100 seconds
  • Spends 90 seconds in matrix_mul()
  • What is the maximum possible speed up of My_app
    if I place matrix_mul() in hardware?
  • What if the original My_app spends 99 seconds in
    matrx_mul()?

10 seconds 10x faster
1 seconds 100x faster
Good recent FPGA paper that illustrates
increasing an algorithms performance with
Hardware
NOVEL FPGA BASED HAAR CLASSIFIER FACE DETECTION
ALGORITHM ACCELERATION, FPL 2008
http//class.ece.iastate.edu/cpre583/papers/Shih-L
ien_Lu_FPL2008.pdf
11
Reconfigurable Architectures
  • RPF -gt VIC (short slide)

12
Granularity
13
Granularity Coarse Grain
  • rDPA reconfigurable Data Path Array
  • Function Units with programmable interconnects

Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
14
Granularity Coarse Grain
  • rDPA reconfigurable Data Path Array
  • Function Units with programmable interconnects

Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
15
Granularity Coarse Grain
  • rDPA reconfigurable Data Path Array
  • Function Units with programmable interconnects

Example
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
16
Granularity Fine Grain
  • FPGA Field Programmable Gate Array
  • Sea of general purpose logic gates

17
Granularity Fine Grain
  • FPGA Field Programmable Gate Array
  • Sea of general purpose logic gates

CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
18
Granularity Fine Grain
  • FPGA Field Programmable Gate Array
  • Sea of general purpose logic gates

CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
19
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Microprocessor
1024-bits
20
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
3
A
10-LUT
Microprocessor
3
B
1024-bits
21
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
3
A
10-LUT
Microprocessor
3
B
4
op
1024-bits
3
3
A
3
B
4
op
3
3
A
3
B
22
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
4
op
3
A
10-LUT
Microprocessor
3
3
B
1024-bits
op
A
3
3
3
B
4
op
3
A
3
B
3
23
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Microprocessor
1024-bits
4
op
3
A
3
3
B
24
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Bit logic and constants
1024-bits
25
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
10-LUT
Bit logic and constants
1024-bits
(A and 1100) or (B or 1000)
26
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
A
10-LUT
B
Bit logic and constants
1024-bits
(A and 1100) or (B or 1000)
27
Granularity Trade-offs
Trade-offs associated with LUT size Example
2-LUT (42x2 bits) vs. 10-LUT (102432x32 bits)
1024-bits
2-LUT
AND
4
A
10-LUT
1
Bit logic and constants
1024-bits
OR
Area that was required using 2-LUTS
(A and 1100) or (B or 1000)
0
OR
4
B
Its much worse, each 10-LUT only has one output
28
Granularity Example Architectures
  • Fine grain GARP
  • Course grain PipeRench

29
Granularity GARP
Memory
D-cache
I-cache
CPU
RFU
Config cache
Garp chip
30
Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
31
Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
Example computations in one cycle Altlt10
(bc) (A-2bc)
32
Granularity GARP
Memory
  • Impact of configuration size
  • 1 GHz bus frequency
  • 128-bit memory bus
  • 512Kbits of configuration size

D-cache
I-cache
On a RFU context switch how long to load a new
full configuration?
CPU
RFU
4 microseconds
An estimate of amount of time for the CPU perform
a context switch is 5 microseconds
Config cache
Garp chip
2x increase context switch latency!!
33
Granularity GARP
Memory
RFU
Execution (16, 2-bit)
control (1)
D-cache
I-cache
CPU
RFU
N
Config cache
PE (Processing Element)
Garp chip
The Garp Architecture and C Compiler http//www.
cs.cmu.edu/tcal/IEEE-Computer-Garp.pdf
34
Granularity PipeRench
  • Coarse granularity
  • Higher (higher) level programming
  • Reference papers
  • PipeRench A Coprocessor for Streaming Multimedia
    Acceleration (ISCA 1999) http//www.cs.cmu.edu/m
    ihaib/research/isca99.pdf
  • PipeRench Implementation of the Instruction Path
    Coprocessor (Micro 2000) http//class.ee.iastate.
    edu/cpre583/papers/piperench_Micro_2000.pdf

35
Granularity PipeRench
Interconnect
Global bus
Interconnect
36
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
37
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
38
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
PE
PE
PE
PE
1
PE
PE
PE
PE
PE
PE
PE
PE
39
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
2
PE
PE
PE
PE
PE
PE
PE
PE
40
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
1
2
2
PE
PE
PE
PE
3
PE
PE
PE
PE
41
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
4
PE
PE
PE
PE
42
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
43
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
44
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
45
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
1
46
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
1
1
2
47
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
1
1
1
2
2
48
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
3
1
1
1
4
2
2
2
49
Granularity PipeRench
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2 3 4
0
0
0
0
PE
PE
PE
PE
1
1
1
2
2
2
PE
PE
PE
PE
3
3
3
4
4
PE
PE
PE
PE
Cycle
1 2 3 4 5 6
Pipeline stage
0 1 2
0
0
0
3
3
3
1
1
1
4
4
2
2
2
0
50
Degree of Integration/Coupling
  • Independent Reconfigurable Coprocessor
  • Reconfigurable Fabric does not have direct
    communication with the CPU
  • Processor Reconfigurable Processing Fabric
  • Loosely coupled on the same chip
  • Tightly coupled on the same chip

51
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
52
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
RPF
Hard Drive
NIC
53
Degree of Integration/Coupling
CPU
Execute
Write Back
RPF
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
54
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
55
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
56
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
DMA Controller
L1 Cache
Config I/F
Main Memory
Memory Controller
L2 Cache
I/O Controller
I/O
RPF
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
57
Degree of Integration/Coupling
CPU
Execute
Write Back
Memory
Decode
ALU
Fetch
FPU
RFU
DMA Controller
L1 Cache
Main Memory
Memory Controller
L2 Cache
I/O Controller
USB
PCI
PCI-Express
SATA
Hard Drive
NIC
58
MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
59
MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
60
MP2
FPGA
Power PC
PC
Display.c
Ethernet (UDP/IP)
User Defined Instruction
VGA
Monitor
61
MP2 Notes
  • MUCH less VHDL coding than MP1
  • But you will be writing most of the VHDL from
    scratch
  • The focus will be more on learning to read a
    specification (Power PC coprocessor interface
    protocol), and designing hardware that follows
    that protocol.
  • You will be dealing with some pointer intensive
    C-code. Its a small amount of C code, but
    somewhat challenging to get the pointer math
    right.

62
Lecture 3 notes / slides in progress
63
Granularity PipeRench
  • Scheduling virtual stage on to physical
  • Partial/Dynamically reconfig (each cycle)

64
Granularity GARP
  • Impact of configuration size on performance
  • Context switching
  • Garp feature
  • Dynamic reconfigurable
  • Store multiple configurations in an on chip cache
    (4)
  • One configuration at a time
  • Example app mapping to GARP (loop)
  • Amdahl's Law
  • The Garp Architecture and C Compiler
  • http//www.cs.cmu.edu/tcal/IEEE-Computer-Garp.pdf

65
Overview
  • Dimensions
  • Price
  • Granularity
  • Coupling
  • To optimize App Performance (compute (throughput,
    latency), Power, reliability)
  • RPF to efficiently implement VICs
  • Main picture authors' wants to convey
  • Whats the point or having a Reconfigure arch
  • Example (Increase App performance)
  • App -gt SW/CPU
  • Profile
  • ID kernels of intense compute
  • Design custom hardware/instruction (Amdels law)
  • Intel FPL paper, great example for reading by
    Friday
Write a Comment
User Comments (0)
About PowerShow.com