Concurrent Engineering

About This Presentation

Title:

Concurrent Engineering

Description:

... of bits which represents the burn-in configuration of the Hardware Block (HB) eg. ... PLDs are soft wired for re-use of static hardware resources. Cost effective ... – PowerPoint PPT presentation

Number of Views:670

Avg rating:3.0/5.0

Slides: 48

Provided by: voicu1

Category:

more less

Transcript and Presenter's Notes

Title: Concurrent Engineering

1
Hardware/Software Codesign of Embedded Systems
Reconfigurable Computing
Voicu Groza SITE Hall, Room 5017 562 5800 ext.
2159 Groza_at_SITE.uOttawa.ca
2
Outline

Introduction
Enabling Technologies
Fix, configurable, reconfigurable ...
Reconfigurable Architectures
Run-Time-Reconfigurable System-on-Chip
Conclusion and Future Work
References

3
1. Introduction

Reconfigurable computing Definition
Why reconfigurable computing ?

4
Reconfigurable Computing - Definition

Reconfigurable Computing (RC) presence of
hardware (HW) that can be reconfigured
(reconfigware - RW)
1960 Gerald Estrin, The UCLA Fixed-Plus-Variable
(FV) Structure Computer
DeHon and Wawrzynek computing via a
postfabrication and spatially programmed
connection of processing elements.
The architecture used in the computation is
determined postfabrication and can therefore
adapt to the characteristics of the executed
algorithms.
The computation is spatial, in contrast to the
more temporal style associated with
microprocessors.

5
Re-inventing the wheel...

wire your own computer

6
Why reconfigurable computing ?

Is your belt long enough?
Embedded hand-held devices need to reduce
the power consumption targets,
the acceptable packaging and manufacturing costs,
the time-to-market
High-performance computing
Todays computationally intensive applications
require more processing power
streaming video,
image recognition and processing,
highly interactive services
telecommunications
genes
Cray revived its latest entry-level XD1
supercomputer by combining AMD Opteron processors
with FPGAs for compute acceleration in a Linux
environment.

7
Why reconfigurable computing cont.
PRO CON
High-performance micro-processors Versatile SW Off the-shelf solution For some applications might not be fast enough power consumption (gt100W/gigaFLOP) cost (ks)
Reconfigurable Computing Systems Versatile SW HW Computing structure matches application Given fabric can implement numerous functional units. Built out of off-the-shelf components, reduce design-time wires are slow big bit-slices are costly to interconnect -gt large silicon area performance overhead devices must store configuration on the chip
Application-Specific Integrated Circuits (ASIC) Does not suffer from the serial (and often slow and power-hungry) instruction fetch, decode and execute cycle that is at the heart of all microprocessors. Consumes less power fixed structure the cost of producing an ASIC (the masks cost 1 M ), the time to develop a custom integrated circuit
8
2. Enabling Technologies

Programmable ICs CPLD and FPGA (Xilinx 1984)
HW Abstractions
Fine-grained Reconfiguration is at the gate and
register level.
By reconfiguration of registers, gates, and their
interconnections, the internal structure of
functional units is changed.
2 major technologies
Complex Programmable Logic Devices (CPLD)
EEPROM based
Field-Programmable Gate Arrays (FPGA) SRAM
based
Coarse-grained Reconfiguration is based on a set
of fixed blocks, like functional units, processor
cores, and memory tiles.
The reconfiguration is merely the reprogramming
of the interconnections between the fixed blocks.

9
Complex Programmable Logic Devices (CPLD)

Supplied with no predetermined logic function.
Programmed by user to implement any digital logic
function.
Requires specialized computer software for design
and programming.
Complex PLD (CPLD) A PLD that has several
programmable sections with internal
interconnections between the sections.
The basic building block of a CPLD is a macrocell
which implements a logic function that is
synthesized into a sum of product equations,
followed by a D-type register.
Macrocells are grouped into logic blocks which
are connected via a centralized interconnect
array.

10
Altera MAX 7000 macrocell
11
Field-Programmable Gate Array (FPGA)

Reconfigurable functional units
coarse grained - ALUs and storage
fine-grained - small lookup tables

Interconnection network
Universal gates and/or storage elements
Switches
12
Basic ingredient Look Up Table (LUT)
Universal gate Look-up table memory

Logic Cell

0 0 0 1
a0
data
a1
a0
a1 a2

Memory elements SRAM

a1
13
Configurable Logic Blocks (CLB - Xilinx)Logic
Array Block (LAB Altera)
XILINX Spartan II CLB
2 logic cells 1 slice (Xilinx) or 1 Adaptive
Logic Module (ALM - Altera) 2 slices HW
abstractions Configurable Logic Blocks (CLB -
Xilinx)
14
Xilinx - Spartan II Architecture

IOBs provide the interface between the package
pins and the internal logic
CLBs provide the functional elements for
constructing most logic
Dedicated block RAM memories (4096 bits each)
Clock DLLs for clock distribution delay
compensation and clock domain control
Versatile multi-level interconnect structure

15
Xilinx Virtex FPGA Model
Logic block
CLB
IO Mux
Switch Matrix
Switch Matrix
Line Segments
Programmable Interconnect Point (PIP)
16
Virtex-II Architecture Overview

1 CLB 8 slices
1 slice contains 2 function generators F G
which are configurable as
4-input look-up tables (LUTs), or
16-bit shift registers, or
16-bit distributed SelectRAM memory.

DCM Digital Clock Manager Block SelRAM 18 Kbit
(2k x 9bit of dual-port RAM) Multiplier blocks
18-bit x 18-bit
Device CLBsRow x Col Logic Cells Slices DistribRAM (Kb) DSP BlockRAM (Kb) SelRam
XC4VLX200 192 x 116 200,448 89,088 1392 96 336 6,048
17
3. Fix, configurable, reconfigurable ...

A simple classification
Non-configurable computing
Configurable computing
Reconfigurable computing
Each has its own characteristics, (dis)advantages
and applications

18
3.1. Non-Configurable Computing

Uses fixed hardware such as ASICs or Custom VLSI
circuits (eg. Microprocessors like x86, Sparc,
DEC, PowerPC, etc)
Long product turnaround time, usually around 3-6
months
Optimized for performance
Can be quite costly
Hardwired thus no room for error, re-work,
improvement

19
3.2. Configurable Computing
Bitstream
Configuring Host
11100100011111111111111111100110001111000111111111
01101001011101101110001001100011100000000011010101
011110101011010111111111111
01101001011101101110001001100011100111001010011000
11100111001010011000111001110010100110001110011100
10

Configuring host supervises FPGA reconfiguration
of a new bitstream
A bitstream is a sequence of bits which
represents the burn-in configuration of the
Hardware Block (HB) eg. synthesized, place and
routed design

20
3.2. Configurable Computing (Contd)

Advantages
Uses configurable hardware such as FPGA or CPLD
PLDs are soft wired for re-use of static hardware
resources
Cost effective
Quick turnaround time
Flexible and ease in design process
Disadvantages
Inefficient use of hardware resources, cannot use
idle FPGA area during run-time
Slow reconfiguration time, because of
reconfiguring the entire FPGA for a single
Hardware Block (HB)
Thus, must stop execution while reconfiguring a
new Hardware Block

21
3.3. Reconfigurable Computing
Configuring Host
Bitstream
01101001011101101110001001100011100111001010011000
11100111001010011000111001110010100110001110011100
10110010
11100100011111111111111111100110001111000111111111
01101001011101101110001001100011100
11100100011111111111111111100110001111000111111111
01101001011101101110001001100011100
We could also use a placement algorithm to
possibly fit all requested HBs into the FPGA
22
3. Reconfigurable Computing (Contd)

Advantages
Same as Configurable Computing
No need to completely stop the execution while
reconfiguring the FPGA with a new HB
Efficient use of static hardware resources can
swap out or move HBs around to fit new HBs on the
FPGA, no need for a larger FPGA or a second one
Fast reconfiguration times
Run-time reconfiguration on the fly
Less power consumption, as we can swap out HBs
Disadvantages
Routing HBs can be a heavy overhead for the
configuring host especially if HBs are too large
or when defragmentation is necessary

23
What is Run-Time Reconfiguration (RTR) ?

On-the-fly flexibility
Combines characteristics of co-processors with
those of reconfigurable computing
Introduces overhead to reconfigure the
co-processor but offsets by increasing execution
speed (faster in H/W!)

24
4. Reconfigurable Architectures

External stand-alone processing unit
Attached processing unit
Reconfigurable functional unit
Co-processor
Processor embedded in a reconfigurable fabric
(Compton Hauck)

25
External stand-alone processing unit
RPU coupled to the I/O system bus
The RECON System John Reid Hauser John Wawrzynek
Randy H. Katz (University of California,
Berkeley) Consists of a SUN SparcStation host
and a reconfigurable coprocessor board (The board
exploits a XC4010 FPGA as the reconfigurable
processor unit).
26
Attached processing unit
RPU coupled to the local bus

TKDM
Marco Platzner
ETH Zurich
FPGA module that uses the DIMM (dual inline
memory module) bus for high-bandwidth
communication with the host CPU.
It is integrated with the Linux host OS
offers functions for data communication and FPGA
reconfiguration.

27
Attached processing unit (Cont.)

Morphosys
Nader Bagherzadeh
University of California, Irvine
Coarse grain MorphoSys operates on 8 / 16-bit
data.
Configuration RC array is configured by context
words, which specify an instruction opcode for
RC.
Depth of programmability The Context Memory can
store up to 32 planes of configuration.
Dynamic reconfiguration Contexts are loaded into
Context Memory without interrupting RC operation.
Local/Host Processor The control processor (Tiny
RISC) and RC Array are resident on the same chip.
Fast Memory Interface Through DMA controller.

Consists of a combination of a RISC processor
core with an array of coarse-grain reconfigurable
cells
It utilizes a DMA controller in order to load the
configuration data (context) into the Context
Memory

28
Reconfigurable functional unit
RPU integrated in the CPU

Chimaera
S. Hauck
University Washington, Seatle
System treats the reconfigurable logic as a cache
for RPU instructions.
Those instructions that have recently been
executed, or that we can otherwise predict might
be needed soon, are kept in the reconfigurable
logic.
If another instruction is required, it is brought
into the RPU by overwriting one or more of the
currently loaded instructions.

Chimaera
29
Co-processor
RPU coupled to the CPU

GARP
Hauser Wawrzynek
University of California, Berkley
A reconfigurable architecture that combines
reconfigurable hardware with a standard MIPS
processor on the same die to retain better
feature performance.
Two configurations can never be active at the
same time on its reconfigurable array which can
significantly reduce the overall performance of
the system.

30
5. RTR-SoC System Architecture
Execution unit of HBs
Allows dedicated OMA-RPU access
Stores program and data code
IBM OPB
Runs software instructions
Stores HB bitstreams
RTR-SoC System Architecture
31
Application and Reconfiguration Flows

While the application flow runs on AE, RE sends
RTR_PREP_HB to the ICAP controller, to start the
loading of the first HB bitstream onto the RPU.
Once this HB is ready in the RPU, the ICAP sends
back an RTR_ACK to the RE.
The newly implemented HB on the RPU starts to
work as soon as it is ENABLEd by the
reconfiguration flow on RE.
Upon completion, HB sets flag RTR_DONE to make
the application flow aware that it is ready for
use.
Once the application flow on AE has prepared data
that HB needs, AE asserts the flag DATA_READY.
HB asserts EXE_DONE when finishes its task and
has prepared the results to be read by the
application flow on AE.
When the application flow needs these results, it
checks the flag EXE_DONE, and waits if it is not
yet set.
The application flow gets the results and then
asserts DATA_ACK to acknowledge to HB that it got
data.

32
Final system architecture
RE
AE
33
Tasks running on AE and RE
34
Physical Layer Overview

Have already developed a physical layer in JBits
in order to evaluate RTR on a Xilinx Virtex
device
Physical layer has 3 main functions
modeling the FPGA resources,
running a placement algorithm for the different
Hardware Blocks, and
managing the physical resources of the FPGA and
any on-board peripherals.

RTR Execution Model
Bitstream(s) read by the JBits App
JBits App configures the Virtex RC HW located in
the PCI slot using the XHWIF API.
XHWIF (Xilinx HardWare InterFace Standard)
? Java interface for communicating with FPGA-
based boards.
This Enables run-time reconfiguration of Virtex
Device.

JBits is a set of Java APIs and classes that
provide a High-Level language approach to develop
reconfigurable Systems, include RT
reconfiguration.
35
Hardware Block (HB) Architecture

An HB is a functional hardware module that
contains its own configuration (i.e. the
bitstream), and state information (e.g. status
and control registers) that define its current
state.
It is divided into two major components
The HB Dependent Unit (HBDU)
Encompasses several components that vary in
functionality and magnitude depending on the
functions supported by a particular HB.
The HB Independent Unit (HBIU)
Designed as a core and hence follows a
standardized implementation scheme for all HBs.

36
Hardware Block Reconfiguration

The HBs are partially reconfigured by the
aforementioned Reconfigurable Processing Unit
(RPU).
The reconfiguration process is enabled by means
of a Self-Reconfiguration Platform (SRP).
It enables the FPGA to be dynamically
reconfigured under the control of an embedded
microprocessor.
It is divided into a H/W component and S/W
components.

The H/W component consists of four primary
components the Internal Configuration Access
Port (ICAP), some control logic, a small
configuration cache - Block RAM (BRAM), and an
embedded processor.
The S/W component implements an API that defines
methods for accessing configuration logic through
the ICAP port.

37
PR Methodology Xilinx Virtex II Architecture

Virtex II FPGAs fabric composed of an array of
Configurable Logic Blocks (CLBs).
Block RAMs (BRAM).
Input/Output Blocks (IOBs).
Special functions blocks such as Multipliers,
PLLs etc.

Each CLB contains four slices.
Each slice contains two 4-input look-up tables, 2
D-type flip-flops to implement combinational and
sequential circuits.

38
PR Methodology

Bus Macros (BMs) are required between active and
static modules of the design.
The size and location of the reconfigurable
module (active) is always fixed.
The reconfigurable module is always the full
height of the device
All logic resources located within the width of
the module are considered part of the
reconfigurable modules bitstream frame. This
includes slices, tri-state buffers (TBUFs), block
RAMs (BRAMs), multipliers, input/output blocks
(IOBs), and all routing resources.

39
PR Methodology
Bus Macro block Diagram

Bus Macros (BMs) are predefined physical routing
bridges that connect the active to the static
one.
Any connection from active to static logic should
always go through a bus macro
We chose the slices bus macros (over the TBUF) as
they give higher concentration of communication
bits per CLB
Bus macros allows data to move in only one
direction either left-to-right or right-to-left.

40
PR Methodology
Final Design Layout
Design contains only one active module. All other
logic components are on the static module.
41
Xilinx Internal Configuration Access Port (ICAP)
PR Methodology

Provides configuration interface to FPGA fabric.
Cache BRAM to hold at least one frame.
Control logic for the OPB bus interface.
API calls to allow SW to read/Write configuration
memory.

42
PR Methodology

A partial bitstream is generated for the active
(dynamic) part of the FPGA
The device remains in full operation while the
new partial bitstream is downloaded
The full bitstream configuration must already be
programmed into the device before downloading the
partial bitstream.
Multiple bitstreams can be generated for every
partially reconfigurable module variation
Failing to utilize this command will assert the
global set reset (GSR) during configuration,
resetting the entire design
g ActiveReconfig Yes option

43
PR Methodology

Virtex-II configuration memory is arranged in
vertical frames that are one bit wide and stretch
from the top edge of the device to the bottom.
These frames are the smallest addressable
segments of the Virtex-II configuration memory
space therefore, all operations must act on
whole configuration frames.
The length of a Virtex-II frame is not fixed and
depends on the size of the device.
the number of frames per column type is constant
for all devices.

44
Reconfigurable Processing Unit
The RPU high-level block diagram
45
Preliminary Results

Xilinx Virtex-II Platform FPGAs were used to
implement this system.
Preliminary results were generated using ModelSim
SE 5.7f.

Simulation results for the HB I/F interface. They
illustrate how the I/F is used in order to enable
proper synchronization among the reconfiguration
flow and the application flow.
46
6. Conclusion and Future Work

A novel architecture of a RTR SoC is introduced
RPU and HBs are designed
This design targets adaptive embedded systems,
DSP-related and low-power applications
These functions are implemented as HBs and can be
exploited in a multi-purpose environment. For
example, the RTR SoC may execute various tasks to
perform DSP-related functions, and subsequently
reconfigured into a high-performance measurement
processing system
Future designs would allow the user more
flexibility by auto-reconfiguring the RPU
depending on the computational and functional
needs of its respective applications
Real-time applications is our future target, as
idle HBs are swapped out of the RPU, to save
power or to allow for updates to the HBs

47
References

Marco Platzner. Reconfigurable Computer
Architectures, ei Elektrotechnik und
Informationstechnik, 115(3)143-148, 1998.
Springer.
Y. Li, T. Callahan, E. Darnel, R. Harr, U.
Kurkure and J. Stockwood, HardwareSoftware
Co-Design of Embedded Reconfigurable
Architectures, 37th Design Automation
Conference, 2000. Proceedings DAC pp.507 - 512,
June 5-9, 2000.
J. P. Heron, R. Woods, S. Sezer, and R. H.
Turner. Development of a run-time
reconfiguration system with low reconfiguration
overhead, Journal of VLSI Signal Processing,
28(1/2)97-113, May 2001.
Xilinx Microblaze Soft Processor Core,
http//www.xilinx.com/ise/embedded/edk6_2docs/mb
ref_guide.pdf, last accessed on October 19, 2004
G. Aggarwal, N. Thaper, K. Aggarwal, M.
Balakrishnan, and S. Kumar. A Novel
Reconfigurable Co-Processor Architecture, In
Proceedings of Tenth International Conference on
VLSI Design, pages 370-375, January 1997.
G. Haug and W. Rosenstiel. Reconfigurable
Hardware as Shared Resource in Multipurpose
Computers, In Reiner W. Hartenstein and Andres
Keevallik, editors, Field-Programmable Logic
From FPGAs to Computing Paradigm,
Springer-Verlag, pages 149-158, Berlin,
August/September 1998.
Xilinx Virtex-II Platform FPGAs Complete Data
Sheet, DS031 (14 Oct. 2003).
D. Wo and K. Forward, Compiling to the Gate
Level for a Reconfigurable Co-Processor In
Proceeding of FPGAs for Custom Computing Machines
(1994), pages 147-154.
V. Groza, R. Abielmona, M. El-Kadri, N. Sakr, and
M. Elbadri, A Reconfigurable Co-Processor for
Adaptive Embedded Systems, Workshop on
Intelligent Solutions in Embedded Systems, Graz,
Austria, June 2004.
IBM On-Chip Peripheral Bus, http//www-306.ibm.c
om/chips/techlib/techlib.nsf/techdocs/
9A7AFA74DAD200D087256AB30005F0C8/file/OpbBus.pdf
last accessed on October 19, 2004
R. Abielmona, V. Groza, N. Sakr, and J. Ho,
Low-Level Run-Time Reconfiguration of FPGAs for
Dynamic Environments, IEEE Canadian Conference
on Electrical and Computer Engineering, CCECE
2003, Niagara Falls, May 2004.
B. Blodget, P. James-Roxby, E. Keller, S.
McMillian, and P. Sundararajan. A Self
reconfiguring Platform, Proceedings of the
International Conference on Field Programmable
Logic, Lisbon, Portugal, Sept. 2003.