Intro - PowerPoint PPT Presentation

About This Presentation

Title:

Intro

Description:

Used in games like Far Cry. Optimization for speed( chose this because of market) ... CPUs and DSPs because it's so cool. One ring (circuit) to rule them all! ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 41

Provided by: sonali3

Learn more at: https://course.ece.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Intro

1
Presentation 12 MAD MAC 525
Farhan Mohamed Ali (W2-1)Jigar Vora
(W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala
(W2-4)
W2
Design Manager Zack Menegakis
26th April, 2006 Short Final Presentation
Project Objective Design a crucial part of a GPU
called the Multiply Accumulate Unit (MAC) which
will revolutionize graphics.
2
Agenda

Marketing (Jigar)
Project Description (Farhan)
Algorithmic Description (Farhan)
Design Process (Sonali)
Floorplan Evolution (Sonali)
Layout (Avni)
Design Specifications (Avni)
Conclusion (Jigar)

3
MARKETING

Application of product HDR rendering in gaming
graphics
Why HDR? Used in games like Far Cry
Optimization for speed( chose this because of
market)
Competition- if enter market, possible barriers
to entry

4
MAD MAC and HDR

What is HDR?
Show animation explaining concept

5
MAD MAC and HDR

MAD MAC accelerates FP16 blending to enable true
HDR graphics
What is HDR?
HDR High Dynamic Range
Dynamic range is defined as the ratio of the
largest value of a signal to the lowest
measurable value
Dynamic range of luminance in real-world scenes
can be 100,000 1
With HDR rendering, pixel intensity are allowed
to extend beyond 0..1 range of traditional
graphics
Nature isnt clamped to 0..1 and neither should
CG
In lay terms
Bright things can be really bright
Dark things can be really dark
And the details can be seen in both

6
(No Transcript)
7
PROJECT DESCRIPTION

Multiply Accumulate unit (MAC)
Executes function ABC on 16 bit floating point
inputs. Inputs will be OpenEXR format.
Multiply and add in parallel to greatly speed up
operation
Rounding is only performed only once so greater
accuracy than individual multiply and add
functions.
Also known as
Fused Multiply Add (FMA)
Multiply Add (MAD/MADD) in graphics shader
programs
Many applications benefit from a fast FMA
Graphics HDR rendering, blending and shader
ops
DSPs computing vector dot-products in digital
filters
Fast division, square root eliminates extra
hardware
Available in many newer CPUs and DSPs because
its so cool
One ring (circuit) to rule them all!

8
ALGORITHMIC DESCRIPTION

Step through entire process
Multiply and align occurs concurrently- always
align C to AB
Outputs go to adder, normalize, round, overflow
checker and output register

9
Block Diagram
Input
Input
Input
16
16
16
5
RegArray A
RegArray B
RegArray C
10
10
10
5
5
Multiplier
Exp Calc
Align
1
5
14
22
35
Control Logic Sign Dtrmin
Leading 0 Anticipator
Adder/Subtractor
36
4
Normalize
14
5
1
Round
Reg Y
10
5
Output
16
15
1
1
Ovf Checker
10
IMPLEMENTATION

Implementation of each module- how and why we
chose a particular method keeping in mind goal of
speed( multiplier, adder)

11
Design Decisions (contd.)

Multiplier Implementation
11 x 11 Carry-Save Multiplier
Reasons
Fast because it avoids having ripple carry in
every stage
Enables Compact Layout

12
Design Process

Verilog-gt Schematic-gt Layout
Behavioral -gt Structural Verilog
Transistors/gates -gt Full Schematic
Gate/Component Layout -gt Top Level
Transistor Count fluctuated from 20,200 to 12,800
Major design decisions
Decided against implementing denormal arithmetic
because it would increase the complexity of the
project beyond the scope of the class
Round performed only once at the end.
Picked nPass over Tgate in the normalize shifter
Adder variable length carry select-gt Han-Carlson
binary tree adder

13
VERIFICATION OF DESIGN

Verilog Simulations ( show outputs)
Overview
How/Why it works
Behavioral/Structural
Explain why we couldnt get a high-level
simulator and how we tested our verilog design.

14
SCHEMATICS

Show schematics of major blocks adder,
multiplier, and top-level
HOW WE VERIFIED analog simulation

15
Top Level Schematic
16
Multiplier Schematic
17
Adder Schematic
18
FLOORPLAN EVOLUTION

Initial floorplan
How it evolved (with animation)- why and how we
changed it

19
Main Floorplan
Multiplier
Reg A
Reg C
Exp Calc
Reg B
Align C
Pipeline Reg
Pipeline Reg
Adder
Ld Zero
Pipeline Reg
Round
Normalize
Reg Y
20
Floorplan
21
Full Chip Layout
Exponent
Multiplier
Zero
Align
Adder
O v f
N o r m a l i z e
R o u n d
22
Pipelining

Initially planned 5-6 pipeline stages
Reduced to 4 pipeline stages made possible by
implementing fast carry lookahead adders in
critical path modules (adder and multiplier)

23
Pipelining Stages
Reg C
Multiplier
Reg A
Exp Calc
Reg B
Pipeline Reg
Pipeline Reg
Align C
Pipeline Reg
Pipeline Reg
Adder
Ld Zero
Pipeline Reg
Round
Normalize
Overflow checker

Reg Y
24
LAYOUT

Final Layout
Layout of large blocks such as multiplier, adder
and normalize

25
Layout Decisions

3 standard cell heights
Uniform width vdd and ground rails
Wider vdd and ground rails in power hungry
modules
Max of 8 flip flops per clock pulse generator
Metal directionality

26
Multiplier Layout with pipelining
27
Adder Layout
28
Normalize Layout
29
FINAL LAYOUT
30
Design Specifications

Worst case delay 2.25ns
Long buses are all buffered (not tested yet)
Estimated clocking speed 400MHz
Height by width 193.86 um 301.545 um
Area 58,458 um2
Aspect ratio 11.55
Total Transistor density 0.22

31
Layout densities

Active 14.05
Poly 9.25
Metal 1 33.89
Metal 2 18.00
Metal 3 14.99
Metal 4 6.29

32
Layer Masks - Poly
33
Layer Masks Metal 1
34
Layer Masks Metal 2
35
Layer Masks Metal 3
36
Layer Masks Metal 4
37
Schematic Power mW (350Mhz) Layout Power mW Schematic Delay Layout Delay
Multiplier -w/ pipeline 2.97 ?? N/A ?? 3.38n 1.9n N/A 2.25n
Exponents 1.608 2.21 1.01n 1.2n
Align 0.094 0.113 480p 637p
Adder 8.48 9.73 1.34n 1.7n
Leading 0 0.232 0.857 506p 551p
Normalize 1.458 1.546 407p 437p
Round 0.631 1.21 864p 986p
OvfCheck 0.13 0.19 453p 475p
Registers ?? ?? 179p 193p
Total ?? ?? - -
38
Area um2 Transistor Count Transistor Density
Multiplier -w/ pipeline 20388 4496 0.22
Exponents 5,163 738 0.14
Align 3,995 500 0.13
Adder 13,202 3174 0.24
Leading 0 1,253 364 0.29
Normalize 3,190 942 0.3
Round 1,802 494 0.28
OvfCheck 200 70 0.35
Registers, etc N/A 1948 N/A
Total 58,458 12,730 0.22
39
Conclusion