Title: ECE 697F Reconfigurable Computing Lecture 24 Course Wrapup
1ECE 697FReconfigurable ComputingLecture
24Course Wrap-up
2What is Reconfigurable Computing?
- Computation using hardware that can adapt at the
logic level to solve specific problems - Why is this interesting?
- Some applications are poorly suited to
microprocessor. - VLSI explosion provides increasing resources.
- Hardware/Software
- Relatively new research area.
- Acknowledgement Wolf text
3Design abstractions
4Processor FPGA
Three possibilities
daughtercard
Proc
FPGA
chip
Backplane bus (e.g. PCI)
1. FPGA serves as coprocessor for data
intensive applications possible project.
FPGA
chip
Proc
2. FPGA serves as embedded computer for low
latency transfer.
Reconfigurable Functional Unit
5Xilinx XC4000 Cell
- 2 4-input look-up tables
- 1 3-input look-up table
- 2 D flip flops
6Xilinx XC4000 Routing
25
7Actel Programmable Gate Arrays
I/O Buffers, Programming and Test Logic
Rows of programmable logic building
blocks rows of interconnect
I/O Buffers, Programming and Test Logic
Anti-fuse Technology Program Once
I/O Buffers, Programming and Test Logic
Use Anti-fuses to build up long wiring runs
from short segments
I/O Buffers, Programming and Test Logic
Logic Module
Wiring Tracks
8 input, single output combinational logic
blocks FFs constructed from discrete cross
coupled gates
8Altera Max 7000 Macrocell
9Example DPGA Prototype
10FPGA vs. DPGA Compare
11Min-cut bisecting partitioning
B
A
C
D
partition 1
partition 2
12Hill Climbing Algorithms
- To avoid getting trapped in local minima,
consider hill-climbing approach - Need to accept worse solutions or make bad
moves to get global minima. - Acceptance is probabalistic. Only accept
cost-increasing moves some of the time. -
Cost
Solution space
13Routing Tradeoffs
- Bias router to find first, best route.
- Vary number of node expansions using
- pcosti (1 a) x pcosti-1 ncosti a x disti
14Architectural Limitation
- Routing architecture necessitates domain
selection. - Bigger effect for multi-fanout nets
15Two-dimensional Layout
- Control network supports distributed signals.
- Data routed as four-bit values.
16Rapid Datapath
- Segmented linear architecture
- All RAMs and ALUs are pipelined
- Bus connectors also contain registers
17Basic Functional Unit
- Two inputs from adjacent blocks.
- Local memory for instructions, data.
18Chess Interconnect
- More like an FPGA
- Takes advantage of near-neighbor connectivity
19FPICs
- High internal connectivity
- Not always cost effective
20Hierarchical Crossbar
- Full connectivity occurs at top level
- Routing between FPGAs requires determining level
at which source and destination share an
ancestor. - Simplifies routing
21Linear Array
- Current hardware
- Programs implemented as systolic array
- Input key
- Search each RAM bank for sequence
22Emulation Software Steps
Netlist Translation
Technology Mapping
Many of these are dependent on device
interconnect topology
Divide netlist into fixed-sized chunks
Partitioner
Global Placer
Locate an FPGA for a chunk
Global Router
Make connections between devices
FPGA-specific PR
Xilinx PR
FPGA bitstreams
23Simulation Acceleration
- FPGA system takes the place of one portion of
simulated design - Inputs transported to FPGA system.
- Outputs returned from FPGA system.
24Network Routing
- FPGAs popular in network hardware
- New protocols implemented directly in silicon
- Easy to upgrade in the field
- Washington University Gigabit Switch (WUGS)
- Switch provides up to 160 Gbps of bandwidth.
25Pyramid Operations
- Gaussian Pyramid
- Down sample image to compress image size for
communication. - Average over a set of points to create new point
- Laplacian Pyramid
- Determine error found from Gaussian Pyramid
- Expand contracted picture and compare with
original
26Gaussian Pyramid Implementation
- Systolic array in which each device performs a
separate function. - Limited by clock rate of slowest device.
27Proposed Data Acquisition System
Gigabit Ethernet Interface
64K X 16 DUAL PORT RAM
GIGABIT ETHERNET PHY
RJ45
Radar Control Interface
36
36
Hard Disk Interface
32
32
FPGA2 Stratix EP1S40 (Storage Control)
FPGA1 Stratix EP1S40 (Data Processing)
3.3 V BUFFER
Gigabit Ethernet core
30
30
ATA66 IDE Channel 0
3.3 to 5 V BUFFER
14
AD6645 (105 MSPS)
H Channel
Analog
64
30
30
AD6645 (105 MSPS)
14
3.3 to 5 V BUFFER
ATA66 IDE Channel 1
V - Channel
Radar Unit
Analog
16
AD974 (200 KSPS)
Radar Positioner Data channel
SRAM 1 x 512K X36 DATA PROCESSING MEMORY
SRAM 3 x 512K X36 DATA PROCESSING MEMORY
16
10/100 Mbps Ethernet Interface
16
62
ETHERNET PHY
RJ45
MAX 7000A PLD
ATMEL AT91RM9200 MICROCONTROLLER ARM - RISC
CORE (209 MHz 32 BIT)
ETHERNET CONTROLLER
USB INTERFACE
SOFTWARE FLASH 1 X 4M X 16 CONFIGURATION MEMORY
BOOT FLASH 2 X 1M X 16 PROGRAM MEMORY
SDRAM 2 X 8M X 16 DATA MEMORY
USB BLOCK
JTAG PORT
RS232 DRIVER
SERIAL PORT
28Detailed View of Dharma
29Chimaera Architecture
- Live copy of register file values feed into array
- Each row of array may compute from register of
intermediates - Tag on array to indicate RFUOP
30Chimaera Architecture
- Array can operate on values as soon as placed in
register file. - Logic is combinational
- When RFUOP matches
- Stall until result ready
- Drive result from matching row
31Chimaera Results
- Three Spec92 benchmarks
- Compress 1.11 speedup
- Eqntott 1.8
- Life 2.06
- Small arrays with limited state
- Small speedup
- Perhaps focus on global router rather than local
optimization.
32Garp
- Integrate as coprocessor
- Similar bandwidth to processor as functional unit
- Own access to memory
- Support multi-cycle operation
- Allow state
- Cycle counter to track operation
- Configuration cache, path to memory
33Garp Array
- Row-oriented logic
- Dedicated path for processor/memory
- Processor does not have to be involved in
array-memory path
34System Model Adaptive Viterbi Decoder
35Compression Techniques
- Effectively we can consider an FPGA device as a
collection of cells, each with (x, y) location. - Instead of using a serial bit stream, could
consider loading data cell-by-cell like a
standard memory. - Specify location of cell through use of two
registers.
Row
36Hardware Support for Runlength
- Initially latch in base
- Down counter indicates number of strides to take.
- Offset used to augment initial base
- Fairly simple to implement.
37Determining Communication Level
Send, Receive, Wait
Application hardware (custom)
Register reads/writes
I/O driver
Interrupt service
Bus transactions
I/O bus
Interrupts
- Easier to program at application level
- (send, receive, wait) but difficult to predict
- More difficult to specify at low level
- Difficult to extract from program but timing and
resources easier to predict
38Interface Models
- Synchronization through a FIFO
- FIFO can be implemented either in hardware or in
software - Effectively reconfigure hardware (FPGA) to
allocate buffer space as needed - Interrupts used for software version of FIFO
r3
p1
p2
p3
r2
d1
FPGA
Control/Data FIFO
d3
d2
39Summary
- Reconfigurable computing relies heavily on new
VLSI technology - Device architectures maturing
- Application development progressing at rapid pace
- Integration of hardware and software a difficult
challenge - Active area of research at UMass.