Wormhole RTR FPGAs with Distributed Configuration Decompression - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Wormhole RTR FPGAs with Distributed Configuration Decompression

Description:

If worms are decompressed at the boundary, large internal worms lengths will result ... Worm will have to be stalled. Negates the benefit of fast reconfiguration ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 66
Provided by: alimustafa
Category:

less

Transcript and Presenter's Notes

Title: Wormhole RTR FPGAs with Distributed Configuration Decompression


1
Wormhole RTR FPGAs with Distributed Configuration
Decompression
  • CSE-670 Final Project Presentation
  • A Joint Project Presentation by
  • Ali Mustafa Zaidi
  • Mustafa Imran Ali

2
Introduction
  • An FPGA configuration architecture supporting
    distributed control and fast context switching.
  • Aim of Joint Project
  • Explore the potential for dynamically
    reconfiguring FPGAs by adapting the WRTR Approach
  • Enable High Speed Reconfiguration using
  • Optimized Logic-Blocks and
  • Distributed Configuration Decompression
    techniques
  • Study focuses on Datapath-oriented FPGAs

3
Project Methodology
  • In depth study of issues.
  • Definition of basic Architecture models.
  • Definition of Area models (for estimation of
    relative overhead w.r.t other RTR schemes).
  • Design of Reconfigurable systems around designed
    Architecture models.
  • Identification of FPGA resource allocation
    methods
  • For distributing FPGA area between multiple
    applications/hosts (at runtime, compile-time, or
    system design-time).
  • Selection of Benchmarks for testing and
    simulation of various approaches.
  • Experimentation with WRTR systems and
    corresponding PRTR system without distributed
    configuration (i.e. single host/application) for
    comparison of baseline performance.
  • Evaluate resource utilization, and normalized
    reconfiguration overhead etc.
  • Experimentation with all systems with distributed
    configurations (i.e. multiple hosts/applications).
    The PRTR systems configuration port will be
    time multiplexed between the various
    applications.
  • Evaluate resource utilization, and normalized
    reconfiguration overhead etc.

4
Configuration Issues with Ultra-high Density FPGAs
  • FPGA densities rising dramatically with process
    technology improvements.
  • Configuration time for serial method becoming
    prohibitively large.
  • FPGAs are increasingly used as compute engines,
    implementing data-intensive portions of
    applications directly in Hardware.
  • Lack of efficient method for dynamic
    reconfiguration of large FPGAs will lead to
    inefficient utilization of the available
    resources.

5
Scalability Issues with Multi-context RTR
  • Concept While one plane is operating, configure
    the other planes serially.
  • Latency hidden by overlapping configuration with
    computation
  • As FPGA size, and thus configuration time grows,
    Multi context becomes less effective in hiding
    latency
  • Configuring more bits in parallel for each
    context is only a stop-gap solution.
  • Only so many pins can be dedicated to
    configuration
  • Overheads in the Multi-context approach
  • Number of SRAM cells used for configuration grow
    linearly with number of contexts
  • Multiplexing Circuitry associated with each
    configurable unit
  • Global, low-skew context select wires.

6
Scalability Issues with Partial RTR
  • Concept The Configuration memory is Addressable
    like standard Random Access Memory.
  • Overheads in PRTR Approach
  • Long global cell-select and data busses required
  • area overhead,
  • issues with single cycle signal-transmission as
    wires grow relative to logic.
  • Vertical and Horizontal Decoding circuitry
    represent centralized control resource.
  • Can be accessed only sequentially by different
    user applications (one app at a time)
  • Potential for underutilization of hardware as
    FPGA density increases.
  • One solution could be to design the RAM as a
    multi-ported memory
  • But area of RAM increases quadratically with
    increase in number of ports.
  • Only so many dedicated configuration ports can be
    provided
  • Not a long-term scalable solution.

7
What is Wormhole RTR
  • WRTR is a method for reconfiguring a configurable
    device in an entirely distributed fashion.
  • Routing and configuration handled at local
    instead of global level.

8
Advertised Benefits of WRTR
  • WRTR is a distributed paradigm
  • Allows different parts of same resource to be
    independently configured simultaneously.
  • Dramatically increases the configuration
    bandwidth of a device.
  • Lack of centralized controller means
  • Fewer single point failures that can lead to
    total system failure (e.g. a broken configuration
    pin)
  • Increased resilience routing around faults
    improving chip yields ?
  • Distributed control provides scalability
  • Eliminates configuration bottleneck.

9
Origins of Wormhole RTR
  • Concept Developed in late 90s at Virginia Tech.
  • Intended as a method of rapidly creating and
    modifying custom computational pathways using a
    distributed control scheme.
  • Essence of WRTR concept
  • Independent self-steering streams.
  • Streams carried both programming information as
    well as operand data
  • Streams interact with architecture to perform
    computation. (see DIAGRAM)

10
Origins of Wormhole RTR
11
Origins of Wormhole RTR
  • Programming information configures both the
    pathway of stream through the system, as well as
    the operations performed by computational
    elements along the path.
  • Heterogeneity of architectures is supported by
    these streams.
  • Runtime determination of path of stream is
    possible, allowing allocation of resources as
    they become available.

12
Adapting WRTR for conventional FPGAs
  • Our aims
  • To achieve Fast, Parallel Reconfiguration
  • With minimum area overhead
  • And minimum constraints imposed on the underlying
    FPGA Architecture.
  • Configuration Architecture is completely
    decoupled from the FPGA Architecture.
  • WRTR model is used as inspiration for developing
    a new paradigm for dynamic reconfiguration
  • Not necessary that WRTR method is followed to the
    letter

13
Issues Associated with using WRTR for
Conventional FPGAs
  • Original WRTR was intended for Coarse-grained
    dataflow architectures with localized
    communications
  • Thus operand data was appended to the streams
    immediately after the programming header.
  • In conventional FPGAs, dataflow patterns are
    unrelated to configuration flow, and there is no
    restriction of localizing communications.
  • Therefore Wormhole routing is used only for
    configuration (cannot be used for data).

14
Issues Associated with using WRTR for
Conventional FPGAs
  • The original model was intended to establish
    linear pipelines through system.
  • This makes run-time direction determination
    feasible.
  • However, for conventional FPGAs, the functions
    implemented have arbitrary structures.
  • Configuration stream can not change direction
    arbitrarily (i.e. fixed at compile-time).

15
Issues Associated with using WRTR for
Conventional FPGAs
  • Due to the need for large number of configuration
    ports, I/O ports must be shared/multiplexed
  • thus active circuits may need to be stalled to
    load configurations.
  • Should not be a severe issue for high-performance
    computing oriented tasks.
  • Should impose minimum constraints on the
    underlying FPGA architecture
  • Constraints applicable in an FPGA with WRTR are
    same as those for any PRTR architecture.

16
A Possible System Architecture for a WRTR FPGA
  • Many configuration/IO ports, divided between
    multiple host processors. (See Diagram)
  • Internally, FPGA divided into partitions, useable
    by each of the hosts.
  • Partition boundaries may be determined at system
    design time, or at runtime, based on requirements
    of each host at any given time

17
The various WRTR Models derived
  • Our aim was to devise a high-speed, distributed
    configuration model with all the benefits of the
    original WRTR concept, but with minimum overhead.
  • To this end, 3 models have been devised
  • Basic with Full Internal Routing
  • Second with Perimeter-only Routing.
  • Third Packetized, or parallel configuration
    streams, with no Internal Routing.

18
Basic WRTR Model with Internal Routing
  • Each configurable block or tile is accompanied
    by a simple configuration Stream Router. See
    Diagram
  • Overhead scales linearly with FPGA Size.
  • Expected Issues with this model
  • Complicated router, arbitration overhead and
    prioritization, potential for deadlock conditions
    etc.
  • May be restricted to coarser grained designs.
  • Without data routing, do we really need internal
    routing?

19
Second WRTR with Perimeter-only Routing
  • Primary requirement for achieving parallel
    configuration is multiple input ports.
  • Internal Routing not a mandatory requirement.
  • So why not restrict routing to chip boundary?
    (See Diagram)
  • Overhead scaling improved (similar to PRTR Model)
  • Highlights
  • Finer granularity for configuration achievable
  • Significantly lower overheads as FPGA sizes grow
    (ratio of perimeter to area)
  • Issues
  • Longer time required to reach parts of FPGA as
    FPGA Size grows.
  • Reduced configuration parallelism because of
    potentially greater arbitration delays at
    boundary Routers.

20
Third Packet based distribution of Configuration
  • One solution to the increased boundary
    arbitration issues use packets instead of
    streams. (See Diagram)
  • A single configuration from each application is
    generated as a stream (a worm) similar to
    previous models.
  • Before entering device, configuration packets
    from different streams are grouped according to
    their target rows.

21
Third Packet based distribution of Configuration
  • Benefit No need at all for Routers in the fabric
    itself.
  • Drawbacks
  • Increases overhead on the host system
  • Implies a centralized external controller
  • Or a limited crossbar interconnect within the
    FPGA
  • Parallel Reconfiguration still possible, but with
    limited multitasking.
  • This model may be considered for embedded
    systems, with low configuration parallelism, but
    high resource requirements.

22
Basic Area Model
  • Baseline model required to identify overheads
    associated with each PRTR model.
  • Basic Building block (See Diagram)
  • A basic Array of SRAM Cells
  • Configured by arbitrary number of scan chains.
  • Assumptions for a fair comparison of overhead
  • Each RTR model studied has exactly the same
    amount of Logic resources to configure (rows and
    columns).
  • Each model can be configured at exactly the same
    granularity.
  • The given array of logic resources (see Diagram)
    has an area equivalent to a serially
    configurable FPGA.
  • AREA of Basic Model (A B) (x y)

23
The PRTR FPGA Area Model
  • Please See Diagram
  • Configuration Granularity decided by A and B
  • AREA Area of Basic model Overheads
  • Overheads
  • Area of log2(x)-to-x Row-select decoder
  • Area of log2(y)-to-y n-bit Column
    De-multiplexer
  • Area of 1 n-bit bus y

24
The basic WRTR FPGA Area Model
  • Please See Diagram
  • AREA Area of Basic Model Overheads
  • Overheads
  • Area of 1 n-bit bus 2x
  • Area of 1 n-bit bus 2y
  • Area of 1 4-D Router block x y.

25
The Perimeter Routing WRTR FPGA Area Model
  • Please See Diagram
  • AREA Area of Basic Model Overheads
  • Overheads
  • Area of 1 n-bit bus 2x
  • Area of 1 n-bit bus 2y
  • Area of 1 3-D Router block 2(x y) 1
  • This model can also be made one dimensional to
    further reduce overheads. (other constraints will
    apply)

26
The Packet based WRTR FPGA Area Model
  • Please See Diagram
  • AREA Area of Basic Model Overheads
  • Overheads
  • Area of 1 n-bit bus 2x
  • Area of 1 n-bit bus 2y
  • Additional overheads may appear in host system.
  • This model can also be made one dimensional to
    further reduce overheads. (other constraints will
    apply)

27
Parameters defined and their Impact
  • The number of Busses (x and y)
  • Number of Busses varies with reconfiguration
    granularity.
  • For fixed logic capacity, A and B increase with
    decreasing x and y, i.e. coarser granularity.
  • Impact of coarser granularity
  • Reduced overhead ?
  • Reduced reconfiguration flexibility
  • Increased reconfiguration time per block
  • Thus it is better to have finer granularity
  • The Width of the busses (n-bits)
  • Smaller the width, smaller the overhead (for
    fixed number of busses)
  • Longer Reconfiguration times.

28
Parameters defined and their Impact
  • It is possible to achieve finer granularity
    without increasing overhead at the cost of bus
    width (and hence reconfiguration time per block)
  • Impact of Coarse grained vs. Fine grained
    configurability Methods of Handling hazards in
    the underlying FPGA fabric.
  • Coarse grained configuration places minimum
    constraints on FPGA architecture
  • Fine-grained reconfiguration is subject to all
    issues associated with Partial RTR Systems.

29
Approaches to Router Design
  • Active Routing Mechanism
  • Similar to conventional networks
  • Routing of streams depends on stream-specified
    destination, as well as network metrics (e.g.
    congestion, deadlock)
  • Hazards and conflicts may be dealt with at
    Run-time.
  • Significantly complicated routing logic required.
  • Most likely will be restricted to very coarse
    grained systems.
  • Passive Routing Mechanism
  • Routing of streams depends only on
    stream-specified direction
  • Hazards and conflicts avoided by compile time
    optimization.
  • We have selected the Passive Routing Mechanism
    for our WRTR Models

30
Passive Router Details
  • Must be able to handle streams from 4 different
    directions.
  • Streams from different directions only stalled if
    there is a conflict in outgoing direction.
  • Includes mechanisms for stalling streams in case
    of conflict etc.
  • Detecting and Applying back-pressure
  • Routing Circuitry for one port is defined (see
    Diagram)
  • For a 4D router, this design is replicated 4
    times
  • For a 3D router, this design is replicated 3
    times

31
Utilizing Variable Length Configuration Mechanisms
  • Support Hardware and Logic Block Issues

32
Configuration Overhead
  • Configuration data is huge for large FPGAs
  • Has to be reduced in order to have fast context
    switching

33
Configuration Overhead Minimization
  • Initial pointers in this direction
  • Variable Length Configurations
  • Default Configuration on Power-up

34
Variable Length Configurations
  • Ideally
  • Change only minimum number of bits for a new
    configuration
  • Utilize the idea of short configurations for
    frequently used configurations
  • Start from a default configuration and change
    minimum bits to reconfigure to a new state

35
Hurdles
  • Logic blocks always require full configurations
    to be specified configuration sizes cannot be
    varied
  • Knowing only what to change requires keeping
    track of what was configured before difficult
    issue in multiple dynamic applications switching
  • A default power-up configuration can hardly be
    useful for all application cases

36
Configuration Overhead Minimization
  • How to do it?
  • Remove redundancy in configuration data or
    compact the contents of configuration stream
  • Result?
  • This will minimize the information required to
    be conveyed during configuration or
    reconfiguration
  • Configuration Compression
  • Applying some sort of compression to the
    configuration data stream

37
Configuration Decompression Approaches
  • Centralized Approach
  • Decompress the configuration stream at the
    boundary of the FPGA
  • Distributed Approach New Paradigm
  • Decompress the stream at the boundary of the
    Logic Blocks or Logic Cluster

38
Centralized Approach
  • Advantage
  • Requires hardware only at the boundary of the
    device from where the configuration data enters
    the device
  • Significant reduction in configuration size can
    be achieved
  • Runlength Coding and Lemple-Ziv based compression
    used
  • Examples
  • Atmel 6000 Series
  • Zhiyuan Li, Scott Hauck, Configuration
    Compression for Virtex FPGAs, IEEE Symposium on
    FPGAs for Custom Computing Machines, 2001

39
Centralized Approach
  • Limitations
  • More efficient variable length coding not easy to
    use because of the large number of symbol
    possibilities
  • It is difficult to quantify symbols in the
    configuration stream of heterogeneous devices
    which can have different types of blocks

40
Decentralized Approach
  • Advantages
  • Decompressing at the logic block boundary enables
    configurations to be easily symbolized and hence
    VLC to be used
  • In other words, we know what exactly we are
    coding so Huffman like codes can be used based on
    the frequency of configuration occurrences
  • Also has advantages specific to Wormhole RTR
    discussed next

41
Decentralized Approach
  • Limitations
  • The decompression hardware has to be replicated
  • Optimality Issue Decompression hardware should
    be amortized over how much programmable logic
    area?
  • In other words, granularity of the logic area
    should be determined for optimal cost/benefit
    ratio

42
Suitability of Decentralized Approach to WRTR
  • If worms are decompressed at the boundary, large
    internal worms lengths will result
  • This leads to greater internal worm lengths and
    greater issues to arbitration and worm blockages
  • Decentralized approach thus favors shorter worm
    lengths and parallel worms to traverse with less
    blockages

43
Variable Length Configuration
  • Overall idea
  • Frequently used configurations of a logic block
    should have small sized codes
  • Variable length coding such as Huffman coding can
    be adapted

44
Configuration Frequency Analysis
  • How to decide upon the frequency?
  • Hardwired? By the designer through benchmarks
    analysis?
  • Generic? Done by software generating the
    configuration stream

45
Continued
  • Hardwired determination will be inferior no
    large benefit gained due to variations in
    applications
  • Software that generates the configuration can
    optimally identify a given number of frequently
    used configuration according to set of
    applications to be executed
  • Code determination should be done by software
    generating the configurations for optimal codes

46
Decoding Hardware Approaches
  • Huffman coding the configurations
  • A Hardwired Huffman Decoder
  • Adaptive Decoder (code table can be changed)
  • Using a Table of Frequently used configurations
    and address Decoder
  • Huffman Coding the Table Addresses
  • Static Coding
  • Adaptive Coding

47
Decoding Hardware Features
  • Static Huffman Decoders
  • Lower compression
  • Coding Configurations
  • Requires a very wide decoder
  • Using an Address Decoder only
  • Reduced hardware but less compression (fixed
    sized codes)
  • Coding the Decoder Inputs
  • Requires a relatively smaller Huffman decoder

48
Some points to Note
  • Decompression approach is decoupled from any
    specific logic block architecture
  • Though certain logic blocks will favor more
    compression (discussed later)
  • Not every possible configuration will be coded.
    Especially random logic portions will require all
    the bits to be transmitted
  • A special code will prefix the random logic
    configuration to identify it to be handled
    separately

49
Logic Block Selection
  • High Level Issues
  • Should be Datapath Oriented
  • Efficient support for random logic implementation
  • High functionality with minimum configuration
    bits to support dense implementation with
    reconfiguration overhead reduction
  • Well defined datapath functionality
    (configuration) to aid in the quantification of
    frequently used configuration idea

50
Chosen Hi-Functionality Logic Block
51
A Low Functionality Logic Block
52
Logic Block Considerations
  • High functionality blocks ? good for datapath
    implementations
  • Low Functionality blocks ? less dense datapath
    implementations

53
Logic Block Considerations
  • Low functionality blocks have lesser
    configurations bits/block and vice versa
  • Frequently Used Configurations memory cost
    depends on the size of configurations stored
  • So decoder hardware overhead will be less for low
    functionality blocks

54
Logic Block Considerations
  • What about the configuration time overhead?
  • Less dense functionality means more blocks to
    configure
  • This leads to longer configuration streams

55
Logic Block Considerations
  • Assumption Random Logic does not benefit from
    one block or the other
  • Consequence Datapath oriented designs will
    require fewer blocks to configure with high
    functionality blocks and even higher compression
    and larger overhead for random logic
    implementations than low-functionality blocks

56
Logic Block Issue Conclusions
  • Proper logic block selection for a particular
    application affects
  • Decoder hardware size
  • Configuration compression ratio
  • Since Random logic is not compressed, using high
    functionality blocks for less datapath oriented
    applications will result in
  • A high decoder overhead unutilized
  • Less compression and longer configuration streams

57
Huffman decoder hardware
  • Basic Huffman hardware is sequential in nature
    and is variable input rate variable output rate

58
Huffman Hardware
  • Sequential decoding not suitable for WRTR
  • Worm will have to be stalled
  • Negates the benefit of fast reconfiguration
  • Hardware should be able to process N bits at a
    time where Nbus width
  • This requires a constant input rate architecture
    with variable number of codes processed per cycle

59
Constant Input Rate PLA based Architecture
  • Input rate K bits/cycle
  • PLA undertakes table lookup process
  • Input bits
  • Determine one unique path along Huffman tree
  • Next state
  • feed back to the input of PLA to indicate the
    final residing state
  • Indicator
  • The number of symbols decoded in the cycle

The constant-input rate PLA-based architecture
for the VLC decoder
60
PLA Based Architecture
  • Ref VLSI Designs for High Speed Huffman
    Decoders, Shihfu Chang and David G.
    Messerschimit
  • Decoder model-FSM
  • FSM Implemention
  • ROM
  • PLA
  • Lower complexity
  • high speed

61
Implementation Results
  • The area is a function of number of inputs and
    outputs along with input rate

62
Hardware Area Estimates
  • The number of inputs and outputs depend upon the
    maximum codelength and the symbol sizes
  • Typically, for a 16 entry table
  • Code Size range from 1 to 8 bits
  • Symbol size will equal the decoder input i.e. 4

63
Handling Multiple Symbols Outputs
  • More than one code will be decoded with a max of
    N codes per cycle
  • To take full advantage of parallel decoding,
    multiple configuration chains can be employed.
  • A counter can be used to cycle between the chains
    and output one configuration per chain with a
    maximum of N chains.

64
Decoder Hardware
  • For parallel decoding of N codes and M-bit
    decoded symbols
  • Huffman Decoder (discussed before)
  • N M-bit decoders
  • N port ROM table
  • N parallel configuration chains

65
Concluding Points
  • The hardware overhead of the Huffman decoding
    mechanism discussed has to be incorporated in the
    area model discussed.
  • Empirical determination of reconfiguration
    speed-up versus area overhead determination for a
    sampling of benchmarks
Write a Comment
User Comments (0)
About PowerShow.com