Title: Diapositiva 1
1 Reconfigurable Architectures Andrea Lodi
2SoC trends
- Increasing mask cost ( 3M)
- Increasing design complexity
- Increasing design time ( 3M)
- Rapidly changing communication standards
- Low-power design in wireless environment
- Increasing algorithmic complexity requirements
3Product life cycle
sales
Maturity
Decrease
Growth
LOSS
time-to-market met
time-to-market failed
time
4Trends in wireless systems
- Increased on-chip Transistor density
- Increased design complexity
Algorithm complexity
Moores law
400
Battery capacity
Millions of transistors/Chip
300
1997
1999
2001
2003
2005
2007
2009
200
- Increased Algorithmic complexity
- Low battery capacity growth
Technology (nm)
100
0
1997
1999
2001
2003
2005
2007
2009
- Demand for reusability and flexibility
- Demand for high performance and energy efficiency
5Digital architecture design space
6Parallelism in computation
- Thread level parallelism
- Instruction level parallelism (ILP)
- Pipeline (loop level)
- Fine-grain parallelism (bit/byte-level)
7Instruction level parallelism
a
b
c
d
ASIC Implementation
3
e
-
-
8Spatial vs. Temporal Computing
(Ax B)x C
Ax2 Bx c
Temporal (Processor)
Spatial (ASIC)
9Superscalar/VLIW processors
- FU limitations
- Register file size limitation
- Crossbar inefficiency
10Byte-level parallelism in processors
- MMX technology 57 new instructions
- Byte and half word parallel computation
- SIMD execution model
11Bit-level parallelism
- Reverse (int v)
- int x, r
- for (c0 xltWIDTH x)
- r v1
- v v gtgt 1
- R r ltlt 1
-
- return r
v
r
12Pipeline parallelism
v
for (j0 jltMAX j) bj popcountaj
r
13FPGA
- FPGA (Field-Programmable Gate Array) composed of
2 elements - Array of clbs (configurable logic blocks)
composed of - 1 or few small size LUTs (41 or 31)
- Control logic mux controlled by configuration
bits - Dedicated computational logic (carry chain )
- Configurable routing network connecting clbs
composed of - Different length wires
- Connection blocks connecting clbs to the routing
network - Switch blocks connecting routing wires
- LUTs, configuration bits to program clbs and the
routing network - represent the FPGA configuration, which
determines the function - implemented
14Configurable logic block
15Xilinx Clb
- Xilinx clb 4000 series
- 11 input 4 output bits
- 3 LUTs
- Carry logic
- 2 output registers
16Configurable routing network
17Example
18Density Comparison
19FPGA vs. Processor
- FPGA
- (computing in space)
- Parallel execution
- Configurable in 102-103 cycles
- Fine-grained data
- Application specific operators
- Large area (switches, SRAM)
- Entire applications dont fit
- Slow synthesis, PR tools
- Processor
- (computing in time)
- Sequential execution
- Programmable every cycle
- Fixed-size operands
- Basic operators (ALU)
- Compact
- Handles complex control flow
- Fast compilers
20Reconfigurable processors
- But
- 90 execution time spent in computational
kernels - FPGAs 10-100x speed-up over processors
- FPGAs 10-100x denser than processors
(bit-ops/?2s) - Reconfigurable processor Risc FPGA
21Reconfigurable processor architecture
- Hybrid architectures
- RISC processor
- FPGA
22Computational models
- RC Array IO Processor/Interface logic
- Attached processor
- Piperench, T-Recs
- ISA Extension
- Function unit
- PRISC, OneChip, Chimaera
- Coprocessor
- Garp, NAPA, Molen
23IO Processor/Interface Logic
- Case for
- Always have some system adaptation to do
- Modern chips have capacity to hold processor
glue logic - reduce part count
- Glue logic vary
- many protocols, services
- only need few at a time
- Logic used in place of
- ASIC environment customization
- external FPGA/PLD devices
- Looks like IO peripheral to processor
- Example
- protocol handling
- stream computation
- compression, encrypt
- peripherals
- sensors, actuators
24Example Interface/Peripherals
25Instruction Set Extension
- Instruction Bandwidth
- Processor can only describe a small number of
basic computations in a cycle - I bits ?2I operations
- This is a small fraction of the operations one
could do even in terms of w?w?w Ops - w22(2w) operations
- Processor could have to issue w2(2 (2w) -I)
operations just to describe some computations - An a priori selected base set of functions could
be very bad for some applications
26Instruction Set Extension
- Idea
- provide a way to augment the processors
instruction set - with operations needed by a particular application
27Architectural Models for I.S.A extension
XTENSA
PLEIADES
- Good performance
- Easy to program
- Configured at
- mask-level
- High performance
- Overdesigned for
- most applications
- Difficult to program
Cpu surrounded by a collection of
Application-specific Custom Computing Devices
Risc CPU featuring application-specific function
units optionally inserted in the processor
pipeline
Zhang et al, 2000
Tensilica inc, 2002
28Dynamic ISA Extension models
Standard processor coupled with embedded
programmable logic where application specific
functions are dynamically re-mapped depending
on the performed algorithm
2 Function unit model
1 Coprocessor model
29Coprocessor model Garp
- Explicit instructions moving
- data to and from the array
-
- High communication overhead
- (long latency array operations)
- Processor stalled each time the
- array is active
- Array performs at TASK level
- (Very coarse grain)
- 10-20x on stream, feed-forward
- operations
- 2-3x when data-dependencies
- limit pipelining
Callahan, Hauser, Wawrzynek, 2000
30Function unit model Prisc
- Array fit in the risc pipeline
- No communication overhead
- Some degree of parallelism between
- function units
- Gate array performs combinatorial
- instructions ONLY (very fine grain)
- Low speedup figures (2x/3x)
Razdan, Smith 1994
31Function Unit Model pros
- No communication overhead
- Strict synergy between FPGA and other function
units - FPGA can be used frequently even for small
functions - Small reconfigurable array area
- Flow control handled by the core
- Memory access handled by the core
- Easy instruction set extension
- Configuration streams compiled from C
32EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE
- 32-bit load/store Risc architecture (5 stages
pipeline)
- Set of specialized functional units
- Multiply/Mac Unit
- Branch/Decrement Unit
- Alu featuring MMX byte-wide concurrent
operations
- Concurrent fetch and execution of two 32-bit
instructions per cycle - Fully bypassed, to minimize pipeline stalls
(Average of 10/20 for most computational cores)
- Embedded reconfigurable device for dynamic ISA
extension
- DSP-oriented reconfigurable functional unit
(PiCoGA) - Fully configurable at execution time
- Elaboration and configuration controlled by asm
instructions inserted in C source code - PiCoGA used as a programmable Data-path with
independent pipeline structure
33XiRisc Architecture
34Dynamic Instruction Set Extension
35Dynamic Instruction Set Extension
Register File
.. pgaload .. .. .. pgaop
3,4,5 ... ... Add 8, 3
Configuration Memory
36PiCoGA Architecture
- PiCoGA
- (Pipelined Configurable Gate Array)
- Embedded datapath
- for dynamic i.s.a. extension
- Dynamically reconfigurable
- Structured in rows activated in data- flow
fashion by the PiCoGA control unit - Can hold a state
- pGA-op latency depends on the specific mapped
function - Functionality is determined from DFG extracted
from C code
Processor Interface
PiCoGA Control Unit
37Pico-cell Description
38Computing on PiCoGA
39Multi-context Array
PiCoGA
Configuration Cache
While a plane is executing another may be
reconfigured ? No reconfiguration time overhead
Four configuration planes are available, one of
them executing
Plane switch takes just 1 clock cycle
40Architecture Flexibility
Yes
Speed-up from pGA (5x 100x)
Parallelism to exploit ?
(Ex Turbo Decod., Motion Est.)
No
Yes
Bit-level operations ?
(Ex DES, Reed-Solomon)
No
Yes
Speed-up from DSP instructions and VLIW (1.5x
2x)
MAC intensive ?
(Ex FFT, Scalar product)
Yes
No
Memory intensive ?
(Ex DCT, Motion Est.)
Improvements for a large number of Data Signal
Processing algorithms
41Programming XiRisc Restrictions
- Fixed-point algorithms
- Variable size specification at the bit level
- Not supported yet
- Dynamic memory allocation
- Math library
- Operating System
42XiRisc Compilation Flow
C COMPILER
PROFILER
PiCoGA Configurator
PiCoGAop
Configuration Bit stream
Configuration Library
43Example Motion Estimation
Sum of Absolute Difference (SAD) - High
instruction-level and inter-iteration parallelism
44Data Flow Graph
- pixel-pixel
- absolute difference
- Abs (p1i p2i)
- p1i, p2i pixel
..
Absolute Difference Sum tree
45Sum of Absolute Difference
SAD
SAD8
SAD8
46Place Route
High-Level C Compiler
Mapping
Place Route
Configuration Bits
DFG-based description
Emulation Function with Latency and Issue Delay
47Performance evaluation
- Emulation function
- Latency and Issue-Delay back-annotation
- Profiling
48Motion Estimation Results
- Motion estimation
- 16 SAD operations in parallel
- PiCoGA occupation 100
- Speed-up 7x (with respect to standard XiRisc)
- MPEG preliminary result
- H.261 standard QCIF (176x144) 10 frame/sec
49Reed-Solomon Encoder Results
- Encoder RS(15,9) 4-bit symbols
- PiCoGA occupation 25
- Speed-up 37x
- Throughput 70.6 Mb/sec
- Encoder RS(255,239) widely used 8-bit symbols
- PiCoGA occupation 60
- Speed-up 135x
- Throughput 187.1 Mb/sec
50Speed-up and Power Consumption