Title: Reconfigurable Computing Systems: An Overview
1Reconfigurable Computing SystemsAn Overview
- Presented by
- Gurwant Kaur Koonar
- Vijay Pandya
- 14th March 2003
2Introduction
- Reconfigurable Computing (RC) is an emerging
paradigm for digital systems design. The key
feature of which is the ability to perform
computations in hardware to achieve performance
of ASIC and flexibility of GP processors. - Technology improvements have made possible new
programmable logic devices (FPGAs, CPLDs). - Objective of the talk Give an overview and the
hardware architectures of reconfigurable
computing, and the software that targets these
machines, such as compilation tools.
3Definition
- Reconfigurable Computing (RC) is a computing
paradigm in which algorithms are implemented as a
temporally and spatially ordered set of very
complex tasks. These tasks are executed on a
large set of interconnected programmable hardware
elements
4Definition(contd)
- computing paradigm - defines the basic RC
computing model without reference to
implementation. - very complex tasks commonly referred to as
configurations RC tasks require more time than
general purpose computing instructions and more
area than the typical general purpose execution
unit. - Spatial and temporal partitioning algorithms
are decomposed into tasks in both the space and
time domains. - hardware elements - at their core RC devices
consist of a very large set of simple
programmable elements collectively called
Reconfigurable Execution Unit (REU)
5- General Characteristics of RC
- Stored configuration algorithms
- No software
- Pipeline architectures are common
- Real-time applications
- Advantages
- Flexible
- Configurable
- Cost comparable to GPP
- Hardware is readily available
- Shorter development cycle than ASICs
- Parallelism
- Algorithm parallelism exploited in custom
architecture - Problem specific operators and control
- High-performance
- Reduced memory dependence and exploit
fine-grained algorithm parallelism. - Timesharing
- Hardware can be time multiplexed by multiple
applications
6Disadvantages
- Additional area requirements
- Configuration memory (internal/external),
Internal switches and other hardware overhead - Time Overhead
- Device configuration, and internal switches
7Traditional Computing
- Using Application-Specific Integrated Circuits
(ASICs) to hard-wire an algorithm in hardware.
- Extremely fast
- Require less Silicon area
- Less power hungry than GP architectures
- Extremely inflexible
- Expensive both in design and fabrication
- Errors are difficult to correct
- ExamplesConsumer Electronics, Telecommunications,
Automotive Industry
8Traditional Computing(Cont'd)
- General-purpose hardware, combined with
application-specific software - Extremely flexible due to versatile instruction
set. - Much less expensive to develop.
- Poor performance compared to ASICs.
- Errors can be dynamically patched.
- Examples Commodity PC hardware running
commercial software.
9Reasons for Poor Software Performance
- Fetching of instructions
- Interpretation of instructions
- Scheduling of instructions
- Wrong mix of hardware resources to suit a
particular applications needs - Therefore Reconfigurable computing is intended to
fill the gap between HW and SW.
10Flexibility and Efficiency Tradeoffs
11Can we call FPGAs to be Reconfigurable
Processing unit ?
- Traditional FPGAs are configurable, but not
run-time reconfigurable - Traditional FPGAs expect to read their
configuration out of a serial EEPROM, one bit at
a time. - Therefore, FPGA must be reprogrammed in its
entirety and that its previous internal state
cannot be captured beforehand.
12Features for Reconfigurable Hardware
- On-the-Fly Reprogrammability
- Partial Reprogrammability
- Externally-Visible Internal State
13Kress ALU Array-III(KrAA-III)
- instruction level parallelism
- transparently scalable
- fast routing and placement (seconds only)
- dynamically and partially reconfigurable
(microseconds) - suitable for full custom design
- on microprocessor chip much higher acceleration
than by caches - on microprocessor chip fast and low power by
full custom design - acceleration by massive run time to compile time
migration
14Kress ALU Array-III(KrAA-III)
- KrAA-III consists of PEs called rDPU-III
(reconfigurable DataPath Unit III) arranged in a
NEWS network. - Figure shows the KrAAIII chip containing 9
rDPUs.
15Basic Architecture of todays commercial
reconfigurable processor
16Devices which combined FPGA with Standard
processor core
- Triscends E5 and A7
- Alteras two Excalibur families
- Atmels FPSLIC
- Chameleon Systems CS2000
17Zippy Architecture
- It is used to develop reconfigurable processor
technology for domain of handheld and wearable
computing. - To investigate new trade offs between
performance, power consumption and system cost - It is an international research effort lead by
Swiss Federal Institute of Technology
18Reconfigurable Computing Merging Efficiency and
Versatility
19Hardware Design steps
20ExamplesSPLASH IIMulti FPGA parallel computer
with orchestrated systolic communications to
perform inter- FPGA data transfer
21GarpFor general purpose loop acceleration
22CMC Rapid Prototyping Platform
23RC Applications
- RC has demonstrated gt10x performance density
advantage over microprocessors and DSPs - Pattern matching
- Data encryption
- Data compression
- Video and image processing
- Commercial Push
- Handheld devices - PDAs, mobile Phones,
specialized tools - Networks - telecom switches, network routers,
network bridges - High-performance Computing super computers,
medical appliances, robot navigation and planning - Defense Ballistic Missiles, KV navigation,
Spacecraft processing
24RC Implementations
- Hardware
- Catalina Research Incorporated -
http//www.catalinaresearch.com/Chameleon - Annapolis Microsystems - http//www.annapmicro.com
/Wildstar - Alpha Data Parallel Systems - http//www.alpha-dat
a.com - Tools
- Celoxica - http//www.celoxica.com
- Star Bridge Systems - http//www.starbridgesystem
s.com - Annapolis Microsystems - http//www.annapmicro.co
m/CoreFire
25Content
- Coupling Approaches (Reconfigurable Hardware with
General Processor) - Granularity of the FPGA as an RCS
- Implementation Approaches
- Compile Time Reconfiguration
- Run Time Reconfiguration
- Some more advantages
- Challenges
- Software like Design environment
26Coupling Approaches for Reconfigurable Hardware
(RH)
- RH can be coupled to GP as
- A functional unit (Tight Coupling)
- A Co-processor
- An Attached processing unit
- A Standalone processing unit (Loosely coupled)
27Coupling Approaches Contd
- As a Functional Unit
- Within a host processor (General purpose GP)
- Uses data-path of a host machine
- As a Coprocessor
- Without constant supervision of the GP
- GP initializes the RH
- Independent parallel computation
- Less communication overhead
28Coupling Approaches Contd
- As an attached processing unit
- Behaves as an additional processor
- Memory Cache not visible
- Independent Computation but high communication
overhead
- As a Standalone
- The most loosely coupled to GP
- Infrequent Communication with the GP
- Independent computation for very long period of
time
29Different levels of coupling
Workstation
Attached Processing Unit
Coprocessor
Standalone Processing Unit
I/O Interface
CPU
Memory Caches
FU
30Pros and Cons of different coupling approaches
- The tight integration
- Very less communication overhead
- RH can not operate alone for long period of
time - Amount of Reconfig. Logic is limited
- The loose integration
- Greater parallelism
- Higher communication overhead
-
31Logic Block Granularity
- Referred to the size and complexity of the CLB
- Fine grained logic block
- Less complex, Altera Flex 10k consists of single
4 input LUT with flip-flop - Useful for bit-level manipulation
- Exceed the performance of GP in case of operation
on variable bit data width - Smaller area, high amount of computation
(Compact) - Encryption and image processing application
32Logic Block Granularity contd
- Coarse grained logic block
- Larger granularity of the CLB
- Helps perform more complex operations
- Four 2-bit inputs (GARP) and multiplier in each
logic block for 4 x 4 multiplication - Finite State Machine
- Word-width (16 bit) data path circuits
implementation in Very coarse-grained structure - Logic block closer to small processor
33Implementation Approaches
- Compile Time Reconfiguration (CTR)
- Static implementation strategy
- Single system wide configuration
- Configuration doesnt change during computation
- Similar to using ASIC for application
acceleration - Run Time Reconfiguration (RTR)
- Dynamic implementation strategy
- Multiple time-exclusive configurations
- Dynamic hardware allocation (run-time)
34RTR
- Main Task Dividing algorithm into time-exclusive
segments - Global RTR
- Allocates whole FPGA resources for each
configuration - Single system wide configuration for each phase
- Local RTR
- Locally reconfigure subsets of logic at run-time
- Partial reconfiguration, flexibility
- Functional division of labor
35RTR Contd
Global RTR
EXE. A
LOAD B
EXE. B
LOAD C
EXE. C
LOAD A
Local RTR
A
A
D
EXE.
EXE.
B
C
36Implementation Issues
- Temporal partitions a iterative process
- Possibly inefficient usage of FPGA resources in
global RTR - Simulation
- Efficient usage of hardware in local RTR
- Current CAD tools poor match for local RTR
- (Examples of Local RTR RRANN-2 and DISC )
37Power Savings in RC system
- Exploitation of numerical properties of an
application - Higher number of operations per clock due to deep
pipelines - Sensor/actuator pre-conditioning and glue logic
functions on chip
38Some Challenges
- Access to the development of RCS restricted to
hardware developers - Run-time environment, RTR scheduling
- Difficulties in routing for RC hardware having
large number of CLBs - Connection scheme in multi-FPGA system
39Software Aspect
- Software like design environment
- System C (Synopsys), Handel C (Celoxica)
- Hardware-Software co-design (ARM Rapid
Prototyping Platform (RPP) - Generation of detail gate level description
(netlist) by HLL (High level language) - Technology mapping, Placement and Routing
- Generation of .bit files (language of the FPGA)
40Software Aspect Contd
- Programming language/HDL
- SoC consists 50 to 90 software
- Wide acceptability of C/C
- Simulation timing
- Simulation takes long time in current CAD tools
- C/C debugger very efficient
41RC1000 Celoxica platform
- DK1 design suite (handel C)
- RC1000 plug-in card, PCI bus interfacing
- Xilinx Virtex-1000 FPGA (1 million gates)
- Design Flow
Handel C Source Files
Generate VHDL/Verilog
Simulate netlist
Compile
Generate EDIF (netlist)
Place Route Tools
Generation BitStream
42Hardware-Software Co-design
- Amdahls Law
- T 1
- (1 a) a / s
- T Overall speedup
- a Fraction of the original program that could
be enhanced by transferring to h/w - s Speedup obtained for particular fraction of
program
43Summary
- RCS to bridge the gap between Software and
hardware (flexibility and performance) - FPGA ideal candidate for an RH
- Spatial Execution
- Reprogrammability
- Design time
- Design and synthesis flow for CAD tools
- Hybrid Architecture
- Recent advancement in CAD tools
44Questions?????????????