Diapositiva 1 - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Diapositiva 1

Description:

Always have some system adaptation to do ... An a priori selected base set of functions could be very bad for some applications ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 51
Provided by: toma178
Category:

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
Reconfigurable Architectures Andrea Lodi
2
SoC trends
  • Increasing mask cost ( 3M)
  • Increasing design complexity
  • Increasing design time ( 3M)
  • Rapidly changing communication standards
  • Low-power design in wireless environment
  • Increasing algorithmic complexity requirements

3
Product life cycle
sales
Maturity
Decrease
Growth
LOSS
time-to-market met
time-to-market failed
time
4
Trends in wireless systems
  • Increased on-chip Transistor density
  • Increased design complexity

Algorithm complexity
Moores law
400
Battery capacity
Millions of transistors/Chip
300
1997
1999
2001
2003
2005
2007
2009
200
  • Increased Algorithmic complexity
  • Low battery capacity growth

Technology (nm)
100
0
1997
1999
2001
2003
2005
2007
2009
  • Demand for reusability and flexibility
  • Demand for high performance and energy efficiency

5
Digital architecture design space
6
Parallelism in computation
  • Thread level parallelism
  • Instruction level parallelism (ILP)
  • Pipeline (loop level)
  • Fine-grain parallelism (bit/byte-level)

7
Instruction level parallelism
a
b
c
d






ASIC Implementation


3

e
-
-


8
Spatial vs. Temporal Computing
(Ax B)x C
Ax2 Bx c
Temporal (Processor)
Spatial (ASIC)
9
Superscalar/VLIW processors
  • FU limitations
  • Register file size limitation
  • Crossbar inefficiency

10
Byte-level parallelism in processors
  • MMX technology 57 new instructions
  • Byte and half word parallel computation
  • SIMD execution model

11
Bit-level parallelism
  • Reverse (int v)
  • int x, r
  • for (c0 xltWIDTH x)
  • r v1
  • v v gtgt 1
  • R r ltlt 1
  • return r

v
r
12
Pipeline parallelism
v
for (j0 jltMAX j) bj popcountaj











r
13
FPGA
  • FPGA (Field-Programmable Gate Array) composed of
    2 elements
  • Array of clbs (configurable logic blocks)
    composed of
  • 1 or few small size LUTs (41 or 31)
  • Control logic mux controlled by configuration
    bits
  • Dedicated computational logic (carry chain )
  • Configurable routing network connecting clbs
    composed of
  • Different length wires
  • Connection blocks connecting clbs to the routing
    network
  • Switch blocks connecting routing wires
  • LUTs, configuration bits to program clbs and the
    routing network
  • represent the FPGA configuration, which
    determines the function
  • implemented

14
Configurable logic block
15
Xilinx Clb
  • Xilinx clb 4000 series
  • 11 input 4 output bits
  • 3 LUTs
  • Carry logic
  • 2 output registers

16
Configurable routing network
17
Example
18
Density Comparison
19
FPGA vs. Processor
  • FPGA
  • (computing in space)
  • Parallel execution
  • Configurable in 102-103 cycles
  • Fine-grained data
  • Application specific operators
  • Large area (switches, SRAM)
  • Entire applications dont fit
  • Slow synthesis, PR tools
  • Processor
  • (computing in time)
  • Sequential execution
  • Programmable every cycle
  • Fixed-size operands
  • Basic operators (ALU)
  • Compact
  • Handles complex control flow
  • Fast compilers

20
Reconfigurable processors
  • But
  • 90 execution time spent in computational
    kernels
  • FPGAs 10-100x speed-up over processors
  • FPGAs 10-100x denser than processors
    (bit-ops/?2s)
  • Reconfigurable processor Risc FPGA

21
Reconfigurable processor architecture
  • Hybrid architectures
  • RISC processor
  • FPGA

22
Computational models
  • RC Array IO Processor/Interface logic
  • Attached processor
  • Piperench, T-Recs
  • ISA Extension
  • Function unit
  • PRISC, OneChip, Chimaera
  • Coprocessor
  • Garp, NAPA, Molen

23
IO Processor/Interface Logic
  • Case for
  • Always have some system adaptation to do
  • Modern chips have capacity to hold processor
    glue logic
  • reduce part count
  • Glue logic vary
  • many protocols, services
  • only need few at a time
  • Logic used in place of
  • ASIC environment customization
  • external FPGA/PLD devices
  • Looks like IO peripheral to processor
  • Example
  • protocol handling
  • stream computation
  • compression, encrypt
  • peripherals
  • sensors, actuators

24
Example Interface/Peripherals
  • Triscend E5

25
Instruction Set Extension
  • Instruction Bandwidth
  • Processor can only describe a small number of
    basic computations in a cycle
  • I bits ?2I operations
  • This is a small fraction of the operations one
    could do even in terms of w?w?w Ops
  • w22(2w) operations
  • Processor could have to issue w2(2 (2w) -I)
    operations just to describe some computations
  • An a priori selected base set of functions could
    be very bad for some applications

26
Instruction Set Extension
  • Idea
  • provide a way to augment the processors
    instruction set
  • with operations needed by a particular application

27
Architectural Models for I.S.A extension
XTENSA
PLEIADES
  • Good performance
  • Easy to program
  • Configured at
  • mask-level
  • High performance
  • Overdesigned for
  • most applications
  • Difficult to program

Cpu surrounded by a collection of
Application-specific Custom Computing Devices
Risc CPU featuring application-specific function
units optionally inserted in the processor
pipeline
Zhang et al, 2000
Tensilica inc, 2002
28
Dynamic ISA Extension models
Standard processor coupled with embedded
programmable logic where application specific
functions are dynamically re-mapped depending
on the performed algorithm
2 Function unit model
1 Coprocessor model
29
Coprocessor model Garp
  • Explicit instructions moving
  • data to and from the array
  • High communication overhead
  • (long latency array operations)
  • Processor stalled each time the
  • array is active
  • Array performs at TASK level
  • (Very coarse grain)
  • 10-20x on stream, feed-forward
  • operations
  • 2-3x when data-dependencies
  • limit pipelining

Callahan, Hauser, Wawrzynek, 2000
30
Function unit model Prisc
  • Array fit in the risc pipeline
  • No communication overhead
  • Some degree of parallelism between
  • function units
  • Gate array performs combinatorial
  • instructions ONLY (very fine grain)
  • Low speedup figures (2x/3x)

Razdan, Smith 1994
31
Function Unit Model pros
  • No communication overhead
  • Strict synergy between FPGA and other function
    units
  • FPGA can be used frequently even for small
    functions
  • Small reconfigurable array area
  • Flow control handled by the core
  • Memory access handled by the core
  • Easy instruction set extension
  • Configuration streams compiled from C

32
EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE
  • 32-bit load/store Risc architecture (5 stages
    pipeline)
  • Set of specialized functional units
  • Multiply/Mac Unit
  • Branch/Decrement Unit
  • Alu featuring MMX byte-wide concurrent
    operations
  • VLIW Elaboration
  • Concurrent fetch and execution of two 32-bit
    instructions per cycle
  • Fully bypassed, to minimize pipeline stalls

    (Average of 10/20 for most computational cores)
  • Embedded reconfigurable device for dynamic ISA
    extension
  • DSP-oriented reconfigurable functional unit
    (PiCoGA)
  • Fully configurable at execution time
  • Elaboration and configuration controlled by asm
    instructions inserted in C source code
  • PiCoGA used as a programmable Data-path with
    independent pipeline structure

33
XiRisc Architecture
34
Dynamic Instruction Set Extension
35
Dynamic Instruction Set Extension
Register File
.. pgaload .. .. .. pgaop
3,4,5 ... ... Add 8, 3
Configuration Memory
36
PiCoGA Architecture
  • PiCoGA
  • (Pipelined Configurable Gate Array)
  • Embedded datapath
  • for dynamic i.s.a. extension
  • Dynamically reconfigurable
  • Structured in rows activated in data- flow
    fashion by the PiCoGA control unit
  • Can hold a state
  • pGA-op latency depends on the specific mapped
    function
  • Functionality is determined from DFG extracted
    from C code

Processor Interface
PiCoGA Control Unit
37
Pico-cell Description
38
Computing on PiCoGA
39
Multi-context Array
PiCoGA
Configuration Cache
While a plane is executing another may be
reconfigured ? No reconfiguration time overhead
Four configuration planes are available, one of
them executing
Plane switch takes just 1 clock cycle
40
Architecture Flexibility
Yes
Speed-up from pGA (5x 100x)
Parallelism to exploit ?
(Ex Turbo Decod., Motion Est.)
No
Yes
Bit-level operations ?
(Ex DES, Reed-Solomon)
No
Yes
Speed-up from DSP instructions and VLIW (1.5x
2x)
MAC intensive ?
(Ex FFT, Scalar product)
Yes
No
Memory intensive ?
(Ex DCT, Motion Est.)
Improvements for a large number of Data Signal
Processing algorithms
41
Programming XiRisc Restrictions
  • Fixed-point algorithms
  • Variable size specification at the bit level
  • Not supported yet
  • Dynamic memory allocation
  • Math library
  • Operating System

42
XiRisc Compilation Flow
C COMPILER
PROFILER
PiCoGA Configurator
PiCoGAop
Configuration Bit stream
Configuration Library
43
Example Motion Estimation
Sum of Absolute Difference (SAD) - High
instruction-level and inter-iteration parallelism
44
Data Flow Graph
  • pixel-pixel
  • absolute difference
  • Abs (p1i p2i)
  • p1i, p2i pixel

..
Absolute Difference Sum tree
45
Sum of Absolute Difference
SAD
SAD8
SAD8
46
Place Route
High-Level C Compiler
Mapping
Place Route
Configuration Bits
DFG-based description
Emulation Function with Latency and Issue Delay
47
Performance evaluation
  • Emulation function
  • Latency and Issue-Delay back-annotation
  • Profiling

48
Motion Estimation Results
  • Motion estimation
  • 16 SAD operations in parallel
  • PiCoGA occupation 100
  • Speed-up 7x (with respect to standard XiRisc)
  • MPEG preliminary result
  • H.261 standard QCIF (176x144) 10 frame/sec

49
Reed-Solomon Encoder Results
  • Encoder RS(15,9) 4-bit symbols
  • PiCoGA occupation 25
  • Speed-up 37x
  • Throughput 70.6 Mb/sec
  • Encoder RS(255,239) widely used 8-bit symbols
  • PiCoGA occupation 60
  • Speed-up 135x
  • Throughput 187.1 Mb/sec

50
Speed-up and Power Consumption
Write a Comment
User Comments (0)
About PowerShow.com