Design of PowerEfficient FloatingPoint Adder Blocks - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Design of PowerEfficient FloatingPoint Adder Blocks

Description:

New developed high-performance floating-point adder POWER6 Floating-Point Adder ... B A C where B is the addend, A is the multiplicand, and C is the multiplier. ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 43
Provided by: xyz195
Category:

less

Transcript and Presenter's Notes

Title: Design of PowerEfficient FloatingPoint Adder Blocks


1
Design of Power-Efficient Floating-Point Adder
Blocks
  • Xiao Yan Yu
  • ACSEL Lab
  • University of California at Davis
  • May 29, 2007

2
Presentation Outline
  • Motivation
  • Research objective
  • Background
  • New developed high-performance floating-point
    adder POWER6 Floating-Point Adder
  • New developed medium-performance floating-point
    adder
  • Directions in future adder design
  • Conclusion

3
Motivation
  • Adders designed using dynamic logics can cause
    significantly more power consumption in 65nm
    technology. Current design trend favors static
    circuits in order to save power and uses dynamic
    circuits only when necessary.
  • For high performance applications, static
    circuits do not operate as fast as dynamic
    circuits in power-performance space. Special
    techniques are needed.
  • For medium performance applications, sparse tree
    implementations provide significant power saving.
    However, current state-of-the-art sparse tree
    designs do not meet our stringent power and area
    constraint. A new sparse tree design is needed.

4
Research Objective
  • This research provides guidelines on how to
    design power-efficient floating-point adders for
    use in high performance and medium performance
    multiply-add fused dataflow.
  • It targets at one type of floating-point
    addition, the end-around carry addition.

5
Background of Floating Point Add
  • Multiply-add fused dataflow performs T B A ?
    C where B is the addend, A is the multiplicand,
    and C is the multiplier.
  • Input operands of the adder come from the outputs
    of last 32 compressor that compresses the sum
    and carry from multiplier tree and the addend.
  • The magnitude of the operands is not known prior
    to addition.
  • Floating-point operation is a sign magnitude
    operation. The adder needs to produce magnitude
    result during 2s complement subtraction.
  • Case 1 If operand A gt B, A B A B (A
    B 1)
  • Case 2 If operand B gt A, A B B A -(A
    B)gt -(A B) 1 (A B 0)
  • During 2s complement subtraction of A - B, the
    final carry-out, Cout, is 1 when A gt B and 0 when
    B gt A gt Cout determines whether it is case 1 or
    2.

6
End-Around Carry Computation cont.
Below shows an abstraction of end-around carry
computation during subtraction
7
End-Around Carry Addition
  • Assume the adder is divided into four groups.
    During subtraction, the carry for each group can
    be expressed as
  • During addition P3 is set to 0 and conventional
    carries are computed.

8
  • High Performance Adder

9
High Performance Adder Design Issues
  • Adder is required to operate in higher-end of
    multi-gigahertz range inside a unit. Hence, the
    overall performance matters, not the stand-alone
    performance.
  • Currently, only dynamic adders can achieve this
    performance. Power consumed by dynamic adders
    cannot be tolerated in current high performance
    applications. Only static circuits can be used
    for implementation.
  • Conventional static adders cannot achieve such
    high performance.

10
High Performance adder Design Solution
  • Adder partition can be used to boost overall
    performance. It can be partitioned to fit a
    particular floating-point pipeline design.
  • Placement optimization is performed on these
    partitions to ensure lowest communication
    overhead.
  • Cell stacking can be used to shorten wires on the
    critical paths.
  • Adder tree that balances according to its
    critical path provides higher performance than
    conventional ones.

11
  • High Performance Adder Design Example
  • 128-bit Binary Floating-Point Adder for POWER6
    Processor

12
Key Features of the POWER6 BFU adder
  • Fabricated in IBMs 65nm SOI technology
  • It is realized in a 7-cycle multiply-add pipeline
  • Implementation uses all static circuits with
    nominal Vt devices
  • Adder is physically implemented as part of an O
    shaped BFU floorplan
  • A non-uniformly sparse adder scheme was used
    based on the given wire resource to optimize
    performance.

13
Organization of POWER6 BFU Adder as Part of the
BFU Dataflow
14
Organization of POWER6 BFU Adder cont.
Final Sum Selection
Sum and carries from last 32
p,g generation
Floating-Point Addition
End-around carry computation
15
POWER6 BFU Adder Block Diagram
16
Diagram of the 32-b block
Since Carry1i Carry0i or Pi Where Carry0i is
the carry when cin is 0 and Carry1i is the
carrywhen cin is 1 Therefore, Pi ? Carry1i and
Carry1i can be used instead of Pi on the
non-critical paths.
17
Cell Stacking Technique
18
Cell Stacking Technique cont.
19
Comparing with other designs
  • We have compared our design against the
    Ladner-Fischer (LFA) design and a prefix-2
    Kogge-Stone adder with sparseness of 8 (Sparse
    8).
  • All designs use only nominal Vt transistors.
  • The optimization points of each design are
    obtained by varying power performance tradeoff
    factor using Einstuner with constrained input
    size .
  • The performance of each point is simulated using
    a transistor level static timer, EinsTLT.
  • The average power dissipation of a design at each
    performance point is simulated using a power
    simulator, CPAM.
  • Each output is loaded with equivalent capacitive
    load calculated at the unit level of the POWER6
    BFU.

20
Power-Performance Result Average Power vs.
Performance
21
Power-Performance Result Leakage Power vs.
Performance
22
POWER6 BFU Adder Layout
Final Sum Selection
Bitwise g, p generation
Complete End-Around Carry
Partial End-Around Carry Conditional Sums
23
  • Medium Performance Adder

24
Medium Performance Adder Design Issues
  • Adder operates in lower-end of multi-gigahertz
    range with stringent power and area constraint.
    Sparse tree implementation provides power
    efficient solution for this performance region.
  • Contemporary sparse tree designs implements spare
    tree with sparseness not exceeding 4. Designs
    with sparseness beyond 4, the ripple carry chain
    becomes critical. There is a need to investigate
    at designs with sparseness beyond 4 which
    provides enough performance in this region.
  • To reduce power in designs, high Vt optimization
    is traditionally performed on a well tuned design
    in nominal Vt. This does not provide enough power
    saving in this region. Alternative method is
    needed.

25
Medium Performance adder Design Solution
  • To realize a sparse tree with high sparseness, a
    new structure can be used, which uses local carry
    look-ahead blocks instead of ripple chains. With
    this approach, the conditional sum generation
    does not become critical when we increase the
    sparseness.
  • Alternative to perform high Vt optimization to
    reduce power in a design, a mixture of cell
    images in nominal Vt can be used. This will be
    demonstrated in our design example. This approach
    provides both area and power savings.

26
  • Medium Performance Adder Design Example
  • 270ps 20mW 108-bit Floating Point Adder

27
Key Features of our 108-bit BFU adder
  • Implemented in IBMs 65nm SOI technology
  • It is part of a multiply-add fused dataflow and
    uses end-around carry technique
  • It implements sparse trees with sparseness of 9
  • A mixture of two different cell images are used
    in this design
  • Implementation uses all static circuits with
    nominal Vt devices

28
Sparse 9 BFU Adder Block Diagram
29
Diagram of 36-bit Lookahead Block
30
Cell Images Used in Sparse 9 BFU Adder
XOR cell that spans 9 tracks
XOR cell that spans 18 tracks
31
Comparison of different design approaches
  • The two cell images approach is compared with
    high Vt optimization. The adder is first
    implemented with only 18 track cells. For each
    optimization point of the adder with only 18
    track cells, high Vt optimization is applied on
    its non-critical paths. The design with two cell
    images is created by replacing all the 18 track
    cells on the non-critical path with 9 track
    cells.
  • The percentage of high Vt cells in the high Vt
    optimized design ranges from 34 at the highest
    performance point to 57 at the lowest
    performance point.

32
Comparison of different design approaches
33
Comparison of Sparse 9 with other designs
  • The sparse 9 design is compared against sparse 4
    and a sparse 6 designs.
  • The organization of these adders is similar to
    that of our implementation. The difference lies
    in the schemes used inside the 36-bit CLA blocks
    and conditional sum blocks of each adder. Both
    Sparse4 and Sparse6 designs use ripple-carry
    adder in their conditional sum blocks. Our design
    uses local CLA adders instead.
  • All designs use only nominal Vt transistors with
    two cell images approach and without the result
    latch bank.
  • The assignments of cell images used in each block
    are the same for each adder. The amount of 9
    track and 18 track cells used in each adder is
    different however.
  • All critical wires are assumed to have good wire
    width and space.

34
Comparison of Sparse 9 with other designs
Final Design
35
Power Distributions of the Final Design
36
Sparse 9 BFU Adder Floorplan
37
Sparse 9 BFU Adder Layout
  • All components are manually placed and routed for
    minimal area.
  • All loads at the outputs of clock pulse
    generators are well balanced to minimize clock
    skew.
  • Total Transistor Count 21306
  • 9 track cell percentage 70
  • of Metal Layers used 4

38
  • Directions in Future Adder Design

39
Directions in Future Adder Design
  • High-Performance Adders
  • The concept of adder as a module fades away at
    very high frequency. A well-partitioned adder
    provides significant improvement of overall
    system performance. Future adder will continue to
    follow this trend.
  • Floorplan-aware adder optimization to obtain
    optimal adder partitions. Optimizations at
    micro-architecture and floorplan levels are
    needed to achieve this.

40
Directions in Future Adder Design
  • Medium-Performance Adders
  • Sparse tree adders have been shown to have
    sufficient power efficiency. By adaptively adjust
    the sparseness number, a design can meet its
    stringent performance and power constraints.
  • We have also observed the effectiveness of
    designing adder with two different cell images.
    Currently assignment of cell images has to be
    done manually. This can be implement in tools to
    automatically assign cell image.

41
Conclusions
  • This research provided new ways to design
    performance-specific adders.
  • For high-performance applications, we have
    designed a fast 128-bit floating-point adder is
    implemented and fabricated as part of the POWER6
    processor in IBM 65nm SOI technology.
  • A sparse tree with sparseness of 9 with local CLA
    adders inside its conditional sum blocks for
    medium-performance applications. A two cell
    images design methodology that uses regular Vt
    transistors is used to ensure low power and
    compact layout.

42
Thank you for listening to my talk.Questions?
Write a Comment
User Comments (0)
About PowerShow.com