Title: Concurrent Engineering
1Hardware/Software Codesign of Embedded Systems
Reconfigurable Computing
Voicu Groza SITE Hall, Room 5017 562 5800 ext.
2159 Groza_at_SITE.uOttawa.ca
2Outline
- Introduction
- Enabling Technologies
- Fix, configurable, reconfigurable ...
- Reconfigurable Architectures
- Run-Time-Reconfigurable System-on-Chip
- Conclusion and Future Work
- References
31. Introduction
- Reconfigurable computing Definition
- Why reconfigurable computing ?
4Reconfigurable Computing - Definition
- Reconfigurable Computing (RC) presence of
hardware (HW) that can be reconfigured
(reconfigware - RW) - 1960 Gerald Estrin, The UCLA Fixed-Plus-Variable
(FV) Structure Computer - DeHon and Wawrzynek computing via a
postfabrication and spatially programmed
connection of processing elements. - The architecture used in the computation is
determined postfabrication and can therefore
adapt to the characteristics of the executed
algorithms. - The computation is spatial, in contrast to the
more temporal style associated with
microprocessors.
5Re-inventing the wheel...
6Why reconfigurable computing ?
- Is your belt long enough?
- Embedded hand-held devices need to reduce
- the power consumption targets,
- the acceptable packaging and manufacturing costs,
- the time-to-market
- High-performance computing
- Todays computationally intensive applications
require more processing power - streaming video,
- image recognition and processing,
- highly interactive services
- telecommunications
- genes
- Cray revived its latest entry-level XD1
supercomputer by combining AMD Opteron processors
with FPGAs for compute acceleration in a Linux
environment.
7Why reconfigurable computing cont.
PRO CON
High-performance micro-processors Versatile SW Off the-shelf solution For some applications might not be fast enough power consumption (gt100W/gigaFLOP) cost (ks)
Reconfigurable Computing Systems Versatile SW HW Computing structure matches application Given fabric can implement numerous functional units. Built out of off-the-shelf components, reduce design-time wires are slow big bit-slices are costly to interconnect -gt large silicon area performance overhead devices must store configuration on the chip
Application-Specific Integrated Circuits (ASIC) Does not suffer from the serial (and often slow and power-hungry) instruction fetch, decode and execute cycle that is at the heart of all microprocessors. Consumes less power fixed structure the cost of producing an ASIC (the masks cost 1 M ), the time to develop a custom integrated circuit
82. Enabling Technologies
- Programmable ICs CPLD and FPGA (Xilinx 1984)
- HW Abstractions
- Fine-grained Reconfiguration is at the gate and
register level. - By reconfiguration of registers, gates, and their
interconnections, the internal structure of
functional units is changed. - 2 major technologies
- Complex Programmable Logic Devices (CPLD)
EEPROM based - Field-Programmable Gate Arrays (FPGA) SRAM
based - Coarse-grained Reconfiguration is based on a set
of fixed blocks, like functional units, processor
cores, and memory tiles. - The reconfiguration is merely the reprogramming
of the interconnections between the fixed blocks.
9Complex Programmable Logic Devices (CPLD)
- Supplied with no predetermined logic function.
- Programmed by user to implement any digital logic
function. - Requires specialized computer software for design
and programming. - Complex PLD (CPLD) A PLD that has several
programmable sections with internal
interconnections between the sections. - The basic building block of a CPLD is a macrocell
which implements a logic function that is
synthesized into a sum of product equations,
followed by a D-type register. - Macrocells are grouped into logic blocks which
are connected via a centralized interconnect
array.
10Altera MAX 7000 macrocell
11Field-Programmable Gate Array (FPGA)
- Reconfigurable functional units
- coarse grained - ALUs and storage
- fine-grained - small lookup tables
Interconnection network
Universal gates and/or storage elements
Switches
12Basic ingredient Look Up Table (LUT)
Universal gate Look-up table memory
0 0 0 1
a0
data
a1
a0
a1 a2
a1
13Configurable Logic Blocks (CLB - Xilinx)Logic
Array Block (LAB Altera)
XILINX Spartan II CLB
2 logic cells 1 slice (Xilinx) or 1 Adaptive
Logic Module (ALM - Altera) 2 slices HW
abstractions Configurable Logic Blocks (CLB -
Xilinx)
14Xilinx - Spartan II Architecture
- IOBs provide the interface between the package
pins and the internal logic - CLBs provide the functional elements for
constructing most logic - Dedicated block RAM memories (4096 bits each)
- Clock DLLs for clock distribution delay
compensation and clock domain control - Versatile multi-level interconnect structure
15Xilinx Virtex FPGA Model
Logic block
CLB
IO Mux
Switch Matrix
Switch Matrix
Line Segments
Programmable Interconnect Point (PIP)
16Virtex-II Architecture Overview
- 1 CLB 8 slices
- 1 slice contains 2 function generators F G
which are configurable as - 4-input look-up tables (LUTs), or
- 16-bit shift registers, or
- 16-bit distributed SelectRAM memory.
DCM Digital Clock Manager Block SelRAM 18 Kbit
(2k x 9bit of dual-port RAM) Multiplier blocks
18-bit x 18-bit
Device CLBsRow x Col Logic Cells Slices DistribRAM (Kb) DSP BlockRAM (Kb) SelRam
XC4VLX200 192 x 116 200,448 89,088 1392 96 336 6,048
173. Fix, configurable, reconfigurable ...
- A simple classification
- Non-configurable computing
- Configurable computing
- Reconfigurable computing
- Each has its own characteristics, (dis)advantages
and applications
183.1. Non-Configurable Computing
- Uses fixed hardware such as ASICs or Custom VLSI
circuits (eg. Microprocessors like x86, Sparc,
DEC, PowerPC, etc) - Long product turnaround time, usually around 3-6
months - Optimized for performance
- Can be quite costly
- Hardwired thus no room for error, re-work,
improvement
193.2. Configurable Computing
Bitstream
Configuring Host
11100100011111111111111111100110001111000111111111
01101001011101101110001001100011100000000011010101
011110101011010111111111111
01101001011101101110001001100011100111001010011000
11100111001010011000111001110010100110001110011100
10
- Configuring host supervises FPGA reconfiguration
of a new bitstream - A bitstream is a sequence of bits which
represents the burn-in configuration of the
Hardware Block (HB) eg. synthesized, place and
routed design
203.2. Configurable Computing (Contd)
- Advantages
- Uses configurable hardware such as FPGA or CPLD
- PLDs are soft wired for re-use of static hardware
resources - Cost effective
- Quick turnaround time
- Flexible and ease in design process
- Disadvantages
- Inefficient use of hardware resources, cannot use
idle FPGA area during run-time - Slow reconfiguration time, because of
reconfiguring the entire FPGA for a single
Hardware Block (HB) - Thus, must stop execution while reconfiguring a
new Hardware Block
213.3. Reconfigurable Computing
Configuring Host
Bitstream
01101001011101101110001001100011100111001010011000
11100111001010011000111001110010100110001110011100
10110010
11100100011111111111111111100110001111000111111111
01101001011101101110001001100011100
11100100011111111111111111100110001111000111111111
01101001011101101110001001100011100
We could also use a placement algorithm to
possibly fit all requested HBs into the FPGA
223. Reconfigurable Computing (Contd)
- Advantages
- Same as Configurable Computing
- No need to completely stop the execution while
reconfiguring the FPGA with a new HB - Efficient use of static hardware resources can
swap out or move HBs around to fit new HBs on the
FPGA, no need for a larger FPGA or a second one - Fast reconfiguration times
- Run-time reconfiguration on the fly
- Less power consumption, as we can swap out HBs
- Disadvantages
- Routing HBs can be a heavy overhead for the
configuring host especially if HBs are too large
or when defragmentation is necessary
23What is Run-Time Reconfiguration (RTR) ?
- On-the-fly flexibility
- Combines characteristics of co-processors with
those of reconfigurable computing - Introduces overhead to reconfigure the
co-processor but offsets by increasing execution
speed (faster in H/W!)
244. Reconfigurable Architectures
- External stand-alone processing unit
- Attached processing unit
- Reconfigurable functional unit
- Co-processor
- Processor embedded in a reconfigurable fabric
- (Compton Hauck)
25External stand-alone processing unit
RPU coupled to the I/O system bus
The RECON System John Reid Hauser John Wawrzynek
Randy H. Katz (University of California,
Berkeley) Consists of a SUN SparcStation host
and a reconfigurable coprocessor board (The board
exploits a XC4010 FPGA as the reconfigurable
processor unit).
26Attached processing unit
RPU coupled to the local bus
- TKDM
- Marco Platzner
- ETH Zurich
- FPGA module that uses the DIMM (dual inline
memory module) bus for high-bandwidth
communication with the host CPU. - It is integrated with the Linux host OS
- offers functions for data communication and FPGA
reconfiguration.
27Attached processing unit (Cont.)
- Morphosys
- Nader Bagherzadeh
- University of California, Irvine
- Coarse grain MorphoSys operates on 8 / 16-bit
data. - Configuration RC array is configured by context
words, which specify an instruction opcode for
RC. - Depth of programmability The Context Memory can
store up to 32 planes of configuration. - Dynamic reconfiguration Contexts are loaded into
Context Memory without interrupting RC operation. - Local/Host Processor The control processor (Tiny
RISC) and RC Array are resident on the same chip. - Fast Memory Interface Through DMA controller.
- Consists of a combination of a RISC processor
core with an array of coarse-grain reconfigurable
cells - It utilizes a DMA controller in order to load the
configuration data (context) into the Context
Memory
28Reconfigurable functional unit
RPU integrated in the CPU
- Chimaera
- S. Hauck
- University Washington, Seatle
- System treats the reconfigurable logic as a cache
for RPU instructions. - Those instructions that have recently been
executed, or that we can otherwise predict might
be needed soon, are kept in the reconfigurable
logic. - If another instruction is required, it is brought
into the RPU by overwriting one or more of the
currently loaded instructions.
Chimaera
29Co-processor
RPU coupled to the CPU
- GARP
- Hauser Wawrzynek
- University of California, Berkley
- A reconfigurable architecture that combines
reconfigurable hardware with a standard MIPS
processor on the same die to retain better
feature performance. - Two configurations can never be active at the
same time on its reconfigurable array which can
significantly reduce the overall performance of
the system.
305. RTR-SoC System Architecture
Execution unit of HBs
Allows dedicated OMA-RPU access
Stores program and data code
IBM OPB
Runs software instructions
Stores HB bitstreams
RTR-SoC System Architecture
31Application and Reconfiguration Flows
- While the application flow runs on AE, RE sends
RTR_PREP_HB to the ICAP controller, to start the
loading of the first HB bitstream onto the RPU. - Once this HB is ready in the RPU, the ICAP sends
back an RTR_ACK to the RE. - The newly implemented HB on the RPU starts to
work as soon as it is ENABLEd by the
reconfiguration flow on RE. - Upon completion, HB sets flag RTR_DONE to make
the application flow aware that it is ready for
use. - Once the application flow on AE has prepared data
that HB needs, AE asserts the flag DATA_READY. - HB asserts EXE_DONE when finishes its task and
has prepared the results to be read by the
application flow on AE. - When the application flow needs these results, it
checks the flag EXE_DONE, and waits if it is not
yet set. - The application flow gets the results and then
asserts DATA_ACK to acknowledge to HB that it got
data.
32Final system architecture
RE
AE
33Tasks running on AE and RE
34Physical Layer Overview
- Have already developed a physical layer in JBits
in order to evaluate RTR on a Xilinx Virtex
device - Physical layer has 3 main functions
- modeling the FPGA resources,
- running a placement algorithm for the different
Hardware Blocks, and - managing the physical resources of the FPGA and
any on-board peripherals.
- RTR Execution Model
- Bitstream(s) read by the JBits App
- JBits App configures the Virtex RC HW located in
the PCI slot using the XHWIF API. - XHWIF (Xilinx HardWare InterFace Standard)
- ? Java interface for communicating with FPGA-
- based boards.
- This Enables run-time reconfiguration of Virtex
Device.
JBits is a set of Java APIs and classes that
provide a High-Level language approach to develop
reconfigurable Systems, include RT
reconfiguration.
35Hardware Block (HB) Architecture
- An HB is a functional hardware module that
contains its own configuration (i.e. the
bitstream), and state information (e.g. status
and control registers) that define its current
state. - It is divided into two major components
- The HB Dependent Unit (HBDU)
- Encompasses several components that vary in
functionality and magnitude depending on the
functions supported by a particular HB. - The HB Independent Unit (HBIU)
- Designed as a core and hence follows a
standardized implementation scheme for all HBs.
36Hardware Block Reconfiguration
- The HBs are partially reconfigured by the
aforementioned Reconfigurable Processing Unit
(RPU). - The reconfiguration process is enabled by means
of a Self-Reconfiguration Platform (SRP). - It enables the FPGA to be dynamically
reconfigured under the control of an embedded
microprocessor. - It is divided into a H/W component and S/W
components.
- The H/W component consists of four primary
components the Internal Configuration Access
Port (ICAP), some control logic, a small
configuration cache - Block RAM (BRAM), and an
embedded processor. - The S/W component implements an API that defines
methods for accessing configuration logic through
the ICAP port.
37PR Methodology Xilinx Virtex II Architecture
- Virtex II FPGAs fabric composed of an array of
Configurable Logic Blocks (CLBs). - Block RAMs (BRAM).
- Input/Output Blocks (IOBs).
- Special functions blocks such as Multipliers,
PLLs etc.
- Each CLB contains four slices.
- Each slice contains two 4-input look-up tables, 2
D-type flip-flops to implement combinational and
sequential circuits.
38PR Methodology
- Bus Macros (BMs) are required between active and
static modules of the design. - The size and location of the reconfigurable
module (active) is always fixed. - The reconfigurable module is always the full
height of the device - All logic resources located within the width of
the module are considered part of the
reconfigurable modules bitstream frame. This
includes slices, tri-state buffers (TBUFs), block
RAMs (BRAMs), multipliers, input/output blocks
(IOBs), and all routing resources.
39PR Methodology
Bus Macro block Diagram
- Bus Macros (BMs) are predefined physical routing
bridges that connect the active to the static
one. - Any connection from active to static logic should
always go through a bus macro - We chose the slices bus macros (over the TBUF) as
they give higher concentration of communication
bits per CLB - Bus macros allows data to move in only one
direction either left-to-right or right-to-left.
40PR Methodology
Final Design Layout
Design contains only one active module. All other
logic components are on the static module.
41Xilinx Internal Configuration Access Port (ICAP)
PR Methodology
- Provides configuration interface to FPGA fabric.
- Cache BRAM to hold at least one frame.
- Control logic for the OPB bus interface.
- API calls to allow SW to read/Write configuration
memory.
42PR Methodology
- A partial bitstream is generated for the active
(dynamic) part of the FPGA - The device remains in full operation while the
new partial bitstream is downloaded - The full bitstream configuration must already be
programmed into the device before downloading the
partial bitstream. - Multiple bitstreams can be generated for every
partially reconfigurable module variation - Failing to utilize this command will assert the
global set reset (GSR) during configuration,
resetting the entire design - g ActiveReconfig Yes option
43PR Methodology
- Virtex-II configuration memory is arranged in
vertical frames that are one bit wide and stretch
from the top edge of the device to the bottom. - These frames are the smallest addressable
segments of the Virtex-II configuration memory
space therefore, all operations must act on
whole configuration frames. - The length of a Virtex-II frame is not fixed and
depends on the size of the device. - the number of frames per column type is constant
for all devices.
44Reconfigurable Processing Unit
The RPU high-level block diagram
45Preliminary Results
- Xilinx Virtex-II Platform FPGAs were used to
implement this system. - Preliminary results were generated using ModelSim
SE 5.7f.
Simulation results for the HB I/F interface. They
illustrate how the I/F is used in order to enable
proper synchronization among the reconfiguration
flow and the application flow.
466. Conclusion and Future Work
- A novel architecture of a RTR SoC is introduced
- RPU and HBs are designed
- This design targets adaptive embedded systems,
DSP-related and low-power applications - These functions are implemented as HBs and can be
exploited in a multi-purpose environment. For
example, the RTR SoC may execute various tasks to
perform DSP-related functions, and subsequently
reconfigured into a high-performance measurement
processing system - Future designs would allow the user more
flexibility by auto-reconfiguring the RPU
depending on the computational and functional
needs of its respective applications - Real-time applications is our future target, as
idle HBs are swapped out of the RPU, to save
power or to allow for updates to the HBs
47References
- Marco Platzner. Reconfigurable Computer
Architectures, ei Elektrotechnik und
Informationstechnik, 115(3)143-148, 1998.
Springer. - Y. Li, T. Callahan, E. Darnel, R. Harr, U.
Kurkure and J. Stockwood, HardwareSoftware
Co-Design of Embedded Reconfigurable
Architectures, 37th Design Automation
Conference, 2000. Proceedings DAC pp.507 - 512,
June 5-9, 2000. - J. P. Heron, R. Woods, S. Sezer, and R. H.
Turner. Development of a run-time
reconfiguration system with low reconfiguration
overhead, Journal of VLSI Signal Processing,
28(1/2)97-113, May 2001. - Xilinx Microblaze Soft Processor Core,
http//www.xilinx.com/ise/embedded/edk6_2docs/mb
ref_guide.pdf, last accessed on October 19, 2004 - G. Aggarwal, N. Thaper, K. Aggarwal, M.
Balakrishnan, and S. Kumar. A Novel
Reconfigurable Co-Processor Architecture, In
Proceedings of Tenth International Conference on
VLSI Design, pages 370-375, January 1997. - G. Haug and W. Rosenstiel. Reconfigurable
Hardware as Shared Resource in Multipurpose
Computers, In Reiner W. Hartenstein and Andres
Keevallik, editors, Field-Programmable Logic
From FPGAs to Computing Paradigm,
Springer-Verlag, pages 149-158, Berlin,
August/September 1998. - Xilinx Virtex-II Platform FPGAs Complete Data
Sheet, DS031 (14 Oct. 2003). - D. Wo and K. Forward, Compiling to the Gate
Level for a Reconfigurable Co-Processor In
Proceeding of FPGAs for Custom Computing Machines
(1994), pages 147-154. - V. Groza, R. Abielmona, M. El-Kadri, N. Sakr, and
M. Elbadri, A Reconfigurable Co-Processor for
Adaptive Embedded Systems, Workshop on
Intelligent Solutions in Embedded Systems, Graz,
Austria, June 2004. - IBM On-Chip Peripheral Bus, http//www-306.ibm.c
om/chips/techlib/techlib.nsf/techdocs/
9A7AFA74DAD200D087256AB30005F0C8/file/OpbBus.pdf
last accessed on October 19, 2004 - R. Abielmona, V. Groza, N. Sakr, and J. Ho,
Low-Level Run-Time Reconfiguration of FPGAs for
Dynamic Environments, IEEE Canadian Conference
on Electrical and Computer Engineering, CCECE
2003, Niagara Falls, May 2004. - B. Blodget, P. James-Roxby, E. Keller, S.
McMillian, and P. Sundararajan. A Self
reconfiguring Platform, Proceedings of the
International Conference on Field Programmable
Logic, Lisbon, Portugal, Sept. 2003.