Diapositiva 1 - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Diapositiva 1

Description:

Always have some system adaptation to do ... An a priori selected base set of functions could be very bad for some applications ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 51

Provided by: toma178

Category:

more less

Transcript and Presenter's Notes

Title: Diapositiva 1

1
Reconfigurable Architectures Andrea Lodi
2
SoC trends

Increasing mask cost ( 3M)
Increasing design complexity
Increasing design time ( 3M)
Rapidly changing communication standards
Low-power design in wireless environment
Increasing algorithmic complexity requirements

3
Product life cycle
sales
Maturity
Decrease
Growth
LOSS
time-to-market met
time-to-market failed
time
4
Trends in wireless systems

Increased on-chip Transistor density
Increased design complexity

Algorithm complexity
Moores law
400
Battery capacity
Millions of transistors/Chip
300
1997
1999
2001
2003
2005
2007
2009
200

Increased Algorithmic complexity
Low battery capacity growth

Technology (nm)
100
0
1997
1999
2001
2003
2005
2007
2009

Demand for reusability and flexibility

Demand for high performance and energy efficiency

5
Digital architecture design space
6
Parallelism in computation

Thread level parallelism
Instruction level parallelism (ILP)
Pipeline (loop level)
Fine-grain parallelism (bit/byte-level)

7
Instruction level parallelism
a
b
c
d

ASIC Implementation

3

e
-
-

8
Spatial vs. Temporal Computing
(Ax B)x C
Ax2 Bx c
Temporal (Processor)
Spatial (ASIC)
9
Superscalar/VLIW processors

FU limitations
Register file size limitation
Crossbar inefficiency

10
Byte-level parallelism in processors

MMX technology 57 new instructions
Byte and half word parallel computation
SIMD execution model

11
Bit-level parallelism

Reverse (int v)
int x, r
for (c0 xltWIDTH x)
r v1
v v gtgt 1
R r ltlt 1
return r

v
r
12
Pipeline parallelism
v
for (j0 jltMAX j) bj popcountaj

r
13
FPGA

FPGA (Field-Programmable Gate Array) composed of
2 elements
Array of clbs (configurable logic blocks)
composed of
1 or few small size LUTs (41 or 31)
Control logic mux controlled by configuration
bits
Dedicated computational logic (carry chain )
Configurable routing network connecting clbs
composed of
Different length wires
Connection blocks connecting clbs to the routing
network
Switch blocks connecting routing wires
LUTs, configuration bits to program clbs and the
routing network
represent the FPGA configuration, which
determines the function
implemented

14
Configurable logic block
15
Xilinx Clb

Xilinx clb 4000 series
11 input 4 output bits
3 LUTs
Carry logic
2 output registers

16
Configurable routing network
17
Example
18
Density Comparison
19
FPGA vs. Processor

FPGA
(computing in space)
Parallel execution
Configurable in 102-103 cycles
Fine-grained data
Application specific operators
Large area (switches, SRAM)
Entire applications dont fit
Slow synthesis, PR tools

Processor
(computing in time)
Sequential execution
Programmable every cycle
Fixed-size operands
Basic operators (ALU)
Compact
Handles complex control flow
Fast compilers

20
Reconfigurable processors

But
90 execution time spent in computational
kernels
FPGAs 10-100x speed-up over processors
FPGAs 10-100x denser than processors
(bit-ops/?2s)
Reconfigurable processor Risc FPGA

21
Reconfigurable processor architecture

Hybrid architectures
RISC processor
FPGA

22
Computational models

RC Array IO Processor/Interface logic
Attached processor
Piperench, T-Recs
ISA Extension
Function unit
PRISC, OneChip, Chimaera
Coprocessor
Garp, NAPA, Molen

23
IO Processor/Interface Logic

Case for
Always have some system adaptation to do
Modern chips have capacity to hold processor
glue logic
reduce part count
Glue logic vary
many protocols, services
only need few at a time

Logic used in place of
ASIC environment customization
external FPGA/PLD devices
Looks like IO peripheral to processor
Example
protocol handling
stream computation
compression, encrypt
peripherals
sensors, actuators

24
Example Interface/Peripherals

Triscend E5

25
Instruction Set Extension

Instruction Bandwidth
Processor can only describe a small number of
basic computations in a cycle
I bits ?2I operations
This is a small fraction of the operations one
could do even in terms of w?w?w Ops
w22(2w) operations
Processor could have to issue w2(2 (2w) -I)
operations just to describe some computations
An a priori selected base set of functions could
be very bad for some applications

26
Instruction Set Extension

Idea
provide a way to augment the processors
instruction set
with operations needed by a particular application

27
Architectural Models for I.S.A extension
XTENSA
PLEIADES

Good performance
Easy to program
Configured at
mask-level

High performance
Overdesigned for
most applications
Difficult to program

Cpu surrounded by a collection of
Application-specific Custom Computing Devices
Risc CPU featuring application-specific function
units optionally inserted in the processor
pipeline
Zhang et al, 2000
Tensilica inc, 2002
28
Dynamic ISA Extension models
Standard processor coupled with embedded
programmable logic where application specific
functions are dynamically re-mapped depending
on the performed algorithm
2 Function unit model
1 Coprocessor model
29
Coprocessor model Garp

Explicit instructions moving
data to and from the array
High communication overhead
(long latency array operations)
Processor stalled each time the
array is active
Array performs at TASK level
(Very coarse grain)
10-20x on stream, feed-forward
operations
2-3x when data-dependencies
limit pipelining

Callahan, Hauser, Wawrzynek, 2000
30
Function unit model Prisc

Array fit in the risc pipeline
No communication overhead
Some degree of parallelism between
function units
Gate array performs combinatorial
instructions ONLY (very fine grain)
Low speedup figures (2x/3x)

Razdan, Smith 1994
31
Function Unit Model pros

No communication overhead
Strict synergy between FPGA and other function
units
FPGA can be used frequently even for small
functions
Small reconfigurable array area
Flow control handled by the core
Memory access handled by the core
Easy instruction set extension
Configuration streams compiled from C

32
EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE

32-bit load/store Risc architecture (5 stages
pipeline)

Set of specialized functional units

Multiply/Mac Unit
Branch/Decrement Unit
Alu featuring MMX byte-wide concurrent
operations

VLIW Elaboration

Concurrent fetch and execution of two 32-bit
instructions per cycle
Fully bypassed, to minimize pipeline stalls

(Average of 10/20 for most computational cores)

Embedded reconfigurable device for dynamic ISA
extension

DSP-oriented reconfigurable functional unit
(PiCoGA)
Fully configurable at execution time
Elaboration and configuration controlled by asm
instructions inserted in C source code
PiCoGA used as a programmable Data-path with
independent pipeline structure

33
XiRisc Architecture
34
Dynamic Instruction Set Extension
35
Dynamic Instruction Set Extension
Register File
.. pgaload .. .. .. pgaop
3,4,5 ... ... Add 8, 3
Configuration Memory
36
PiCoGA Architecture

PiCoGA
(Pipelined Configurable Gate Array)
Embedded datapath
for dynamic i.s.a. extension
Dynamically reconfigurable
Structured in rows activated in data- flow
fashion by the PiCoGA control unit
Can hold a state
pGA-op latency depends on the specific mapped
function
Functionality is determined from DFG extracted
from C code

Processor Interface
PiCoGA Control Unit
37
Pico-cell Description
38
Computing on PiCoGA
39
Multi-context Array
PiCoGA
Configuration Cache
While a plane is executing another may be
reconfigured ? No reconfiguration time overhead
Four configuration planes are available, one of
them executing
Plane switch takes just 1 clock cycle
40
Architecture Flexibility
Yes
Speed-up from pGA (5x 100x)
Parallelism to exploit ?
(Ex Turbo Decod., Motion Est.)
No
Yes
Bit-level operations ?
(Ex DES, Reed-Solomon)
No
Yes
Speed-up from DSP instructions and VLIW (1.5x
2x)
MAC intensive ?
(Ex FFT, Scalar product)
Yes
No
Memory intensive ?
(Ex DCT, Motion Est.)
Improvements for a large number of Data Signal
Processing algorithms
41
Programming XiRisc Restrictions

Fixed-point algorithms
Variable size specification at the bit level
Not supported yet
Dynamic memory allocation
Math library
Operating System

42
XiRisc Compilation Flow
C COMPILER
PROFILER
PiCoGA Configurator
PiCoGAop
Configuration Bit stream
Configuration Library
43
Example Motion Estimation
Sum of Absolute Difference (SAD) - High
instruction-level and inter-iteration parallelism
44
Data Flow Graph

pixel-pixel
absolute difference
Abs (p1i p2i)
p1i, p2i pixel

..
Absolute Difference Sum tree
45
Sum of Absolute Difference
SAD
SAD8
SAD8
46
Place Route
High-Level C Compiler
Mapping
Place Route
Configuration Bits
DFG-based description
Emulation Function with Latency and Issue Delay
47
Performance evaluation