Title: A Survey of Logic Block Architectures
1A Survey of Logic Block Architectures
- For Digital Signal Processing Applications
2Presentation Outline
- Considerations in Logic Block Design
- Computation Requirements
- Why Inefficiencies?
- Representative Logic Block Architectures
- Proposed
- Commercial
- Conclusions What is suitable Where?
3Why DSP??? ?The Context
- Representative of computationally intensive class
of applications ? datapath oriented and
arithmetic oriented - Increasingly large use of FPGAs for DSP ?
multimedia signal processing, communications, and
much more - To study the issues in reconfigurable fabric
design for compute intensive applications ? What
is involved in making a fabric to accelerate
multimedia reconfigurable computing possible?
4Elements of a Reconfigurable Architecture
- Logic Block/Processing Element
- Differing Grains FinegtgtCoarsegtgtALUs
- Routing
- Dynamic Reconfiguration
5So whats wrong with the typical FPGA?
- Meant to be general purpose ? lower risks
- Toooo Flexible! ? Result Efficiency Gap
- Higher Implementation Cost, Larger Delay, Larger
Power Consumption than ASICs - Performance vs. Flexibility Tradeoff ? Postponing
Mapping and Silicon Re-use
6Solution? See how FPGAs are Used?
- FPGAs are being used for classes of
applications ? Encryption, DSP, Multimedia etc. - Here lies the Key ? Design FPGAs for a class of
applications - Application Domain Characterization ? Application
Domain Tuning
7Domain Specialization
- COMPUTATION ? defines ? ARCHITECTURE
- Target Application Characteristics known
beforehand? Yes - Characterize the application domain
- Determine a balance b/w flexibilty vs efficiency
- Tune the architecture according
8Categorizing the Computation
- Control ? Random Logic Implementation
- Datapath ? Processing of Multi-bit Data
- Conflicting Requirements???
9Datapath Element Requirements
- Operates on Word Slices or Bit Slices
- Produces multi-bit outputs
- Requires many smaller elements to produce each
bit output ? i.e. multiple small LUTs
10Control Logic Requirements
- Produces a single output from many single bit
inputs - Benefits from large grain LUT as logic levels
gets reduced
11Logic Block Design Considerations
- How much of what kinds of computations to
support? - Tradeoff Generality vs Specialization
12How much of What? ?Applications benchmarking
13So what do we have to support?
- Datapath functionality, in particular arithmetic,
is dominant in DSP. - The datapath functions have different bit-widths.
- DSP designs heavily use multiplexers of various
size. Thus, an efficient mapping of multiplexers
should be supported. - DSP functions do contain random logic. The amount
of random logic varies per design. - Some DSP designs use wide boolean functions.
14DSP Building Blocks
- Some techniques widely used to achieve area-speed
efficient DSP implementations - Bit Serial Computations
- Routing Efficient
- Bit Level Pipelining Increases throughput even
more - Digit Serial Computation
- Combining Area efficiency of bit-serial and
with Time efficiency of Bit-parallel
15Classes of DSP-optimized FPGA Architectures
- Architectures with Dedicated DSP Logic
- Homogeneous
- Hetrogeneous
- Globally Homogeneous, Locally Heterogenous
- Architectures of Coarser Granularity
- With DSP Specific Improvements (e.g. Carry
Chains, Input Sharing, CBS)
16Some Representative Architectures
17Bit-Serial FPGA with SR LUT
- Bit-serial paradigm suites the existing FPGA so
why not optimize the FPGA for it! - Logic block to support efficient implementation
of bit-serial data path and bit-level pipelining - LUTs can be used for combinational logic as well
as for Shift Registers
18A Bit-Serial Adder
A Bit-Serial Adder which processes two bits at a
time
Interface Block Diagram
19A Bit-Serial Multiplier Cell
20The Proposed Bit Serial Logic Block Architecture
- 4x4-input LUTs and 6 flip-flops.
- The two multiplexers in front of the LUTs are
targeted mainly for carry-save operations which
are frequently used in bit-serial computations. - There are 18 signal inputs and 6 signal outputs,
plus a clock input. - Feed-back inputs c2, c3, c4, c5 can be connected
to either GND or VDD or to one of the 4 outputs
d0, d1, d2, d3. Therefore, each LUT can implement
any 4-input functions controlled by inputs a0,
a1, a2, a3 or b0, b1, b2, b3. - Programmable switches connected to inputs a4 and
b4 control the functionality of the four
multiplexers at the output of LUTs. As a result,
2 LUTs can implement any 5-input functions. - The final outputs d0, d1, d2, d3 can either be
the direct outputs from the multiplexers or the
outputs from flip-flops. All bit-serial operators
use the outputs from flip-flops therefore the
attached programmable switches are actually
unnecessary. They are only present in order to
implement any other logic functions other than
bit-serial datapath circuits. - Two flip-flops are added (inputs c0 and c1) to
implement shift registers which are frequently
used in bit-serial operations.
21The Modified LUT Implementing a Shift Register
22Performance Results
23Digit-Serial Logic Block Architecture
- DigitSerial Architectures process one digit (N4
bits) at a time - They offer area efficiency similar to bit-serial
architectures and time-efficiency close to
bit-parallel architectures - N4 bits can serve as an optimal granularity for
processing larger digit sizes (N8,16 etc)
24Digit-Serial Building Blocks
A Digit-Serial Adder
A Digit-Serial Unsigned Multiplier
25Digit-Serial Building Blocks
A Pipelined Digit-Serial Unsigned Multiplier For
Y8 bits
26Digit-Serial Signed Multiplier Blocks
Middle Stages Module
First Stage Module
Last Stage Module
27Signed Digit-Serial Multiplier
A Digit-Serial Signed Booths Pipelined
Multiplier with Y8
28Proposed Digit-Serial Logic Block
29Detailed Structure of Digit-Serial Logic Block
30The Basic Logic Module (LM)
Table of Functions Implemented
The Structure of the LM
31Examples of Implementations
N4 Unsigned Multiplier
N4 Signed Multiplier
Two N2 Multipliers
Bit-Level Pipelined
32Area Comparison with Xilinx 4000 Series
33Mixed-Grain Logic Block Architecture
- Exploits the adder inverting property
- Efficiently implements both datapath and random
logic in the same logic block design
34Adder Inverting Property
Full Adder and Equations Showing The Inverting
Property
An optimal structure derived from the property
35LUT Bits Utilization in Datapath and Logic Modes
36Structure of a Single Slice
37Complete Logic Block
38Modified ALU Like Functionality
39Comparison Results
40Comparison Results (Cont)
41Comparison Results (cont)
42Coarser ALU Like Architectures
43CHESS Architecture
44CHESS ALU Based Logic Block
45Structure of a Switch Box
46Comparison Results
47Computation Field Programmable Architecture
- A Heterogeneous architecture with cluster of
datapath logic blocks - Separate LUT Based Logic Blocks for supporting
random logic mapping - Basic Logic Block called a Partial Adder
Subtraction Multiplier (PASM) Module
48PASM Logic Block of CFPA
49Cluster of PASM Logic Blocks
50Comparison Results
51Some Industry Architectures Designs
52Altera APEX II Logic Element
53Altera MAX II Logic Element
54LE Configuration in Arithmetic Mode
55LE in Random Logic Implementation
56Altera Stratix Logic Element
57Altera Stratix II Architecture
58Stratix II Adaptive Logic Module
59Stratix II ALM in Arithmetic Mode
60Various Configurations in an ALM of Stratix II
61Multiplier Resources in Stratix II
62Structure of a DSP Block in Stratix II
63XILINX Virtex II Pro Architecture
64Basic Logic Element of Virtex II Pro
65Dedicated Multipliers in Virtex II Pro
66Processor-Programmable Logic Coupled Architecture
67PiCoGA Architecture Coupled with a VLIW processor
68PiCoGA Logic Block
69Conclusions
- Traditional general purpose FPGA inefficient for
data path mapping - Logic blocks with DSP specific enhancements seem
a promising solution - Coarse Grained Logic can achieve better
application mapping for data path but sacrifice
flexibility - Dedicated Blocks (Multipliers) increase
performance but also increases cost significantly
70Conclusions
- PDSPs with embedded FPGA can achieve a good
balance between performance and power consumption - SoWhich approach is the best? ? No single best
exists
71Suitability of Approaches
- Highly computationally intensive applications
with large amounts of parallelism can use
platform FPGAs where often large resources are
required and power consumption is not an issue. - Here cost/function will be lowest
72Suitability of Approaches
- Field Programmable Logic based coprocessors can
benefit from coarse grained blocks where most
control functions are implemented by the PDSP
itself
73Suitability of Approaches
- Higher flexibility and lower cost can be achieved
with logic blocks with DSP specific enhancements
but flexibility to implement control logic in an
efficient manner.