Title: Intro
1Presentation 12 MAD MAC 525
Farhan Mohamed Ali (W2-1)Jigar Vora
(W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala
(W2-4)
W2
Design Manager Zack Menegakis
26th April, 2006 Short Final Presentation
Project Objective Design a crucial part of a GPU
called the Multiply Accumulate Unit (MAC) which
will revolutionize graphics.
2Agenda
- Marketing (Jigar)
- Project Description (Farhan)
- Algorithmic Description (Farhan)
- Design Process (Sonali)
- Floorplan Evolution (Sonali)
- Layout (Avni)
- Design Specifications (Avni)
- Conclusion (Jigar)
3MARKETING
- Application of product HDR rendering in gaming
graphics - Why HDR? Used in games like Far Cry
- Optimization for speed( chose this because of
market) - Competition- if enter market, possible barriers
to entry
4MAD MAC and HDR
- What is HDR?
- Show animation explaining concept
5MAD MAC and HDR
- MAD MAC accelerates FP16 blending to enable true
HDR graphics - What is HDR?
- HDR High Dynamic Range
- Dynamic range is defined as the ratio of the
largest value of a signal to the lowest
measurable value - Dynamic range of luminance in real-world scenes
can be 100,000 1 - With HDR rendering, pixel intensity are allowed
to extend beyond 0..1 range of traditional
graphics - Nature isnt clamped to 0..1 and neither should
CG - In lay terms
- Bright things can be really bright
- Dark things can be really dark
- And the details can be seen in both
6(No Transcript)
7PROJECT DESCRIPTION
- Multiply Accumulate unit (MAC)
- Executes function ABC on 16 bit floating point
inputs. Inputs will be OpenEXR format. - Multiply and add in parallel to greatly speed up
operation - Rounding is only performed only once so greater
accuracy than individual multiply and add
functions. - Also known as
- Fused Multiply Add (FMA)
- Multiply Add (MAD/MADD) in graphics shader
programs - Many applications benefit from a fast FMA
- Graphics HDR rendering, blending and shader
ops - DSPs computing vector dot-products in digital
filters - Fast division, square root eliminates extra
hardware - Available in many newer CPUs and DSPs because
its so cool - One ring (circuit) to rule them all!
8ALGORITHMIC DESCRIPTION
- Step through entire process
- Multiply and align occurs concurrently- always
align C to AB - Outputs go to adder, normalize, round, overflow
checker and output register
9Block Diagram
Input
Input
Input
16
16
16
5
RegArray A
RegArray B
RegArray C
10
10
10
5
5
Multiplier
Exp Calc
Align
1
5
14
22
35
Control Logic Sign Dtrmin
Leading 0 Anticipator
Adder/Subtractor
36
4
Normalize
14
5
1
Round
Reg Y
10
5
Output
16
15
1
1
Ovf Checker
10IMPLEMENTATION
- Implementation of each module- how and why we
chose a particular method keeping in mind goal of
speed( multiplier, adder)
11Design Decisions (contd.)
- Multiplier Implementation
- 11 x 11 Carry-Save Multiplier
- Reasons
- Fast because it avoids having ripple carry in
every stage - Enables Compact Layout
12Design Process
- Verilog-gt Schematic-gt Layout
- Behavioral -gt Structural Verilog
- Transistors/gates -gt Full Schematic
- Gate/Component Layout -gt Top Level
- Transistor Count fluctuated from 20,200 to 12,800
- Major design decisions
- Decided against implementing denormal arithmetic
because it would increase the complexity of the
project beyond the scope of the class - Round performed only once at the end.
- Picked nPass over Tgate in the normalize shifter
- Adder variable length carry select-gt Han-Carlson
binary tree adder
13VERIFICATION OF DESIGN
- Verilog Simulations ( show outputs)
- Overview
- How/Why it works
- Behavioral/Structural
- Explain why we couldnt get a high-level
simulator and how we tested our verilog design.
14SCHEMATICS
- Show schematics of major blocks adder,
multiplier, and top-level - HOW WE VERIFIED analog simulation
15Top Level Schematic
16Multiplier Schematic
17Adder Schematic
18FLOORPLAN EVOLUTION
- Initial floorplan
- How it evolved (with animation)- why and how we
changed it
19Main Floorplan
Multiplier
Reg A
Reg C
Exp Calc
Reg B
Align C
Pipeline Reg
Pipeline Reg
Adder
Ld Zero
Pipeline Reg
Round
Normalize
Reg Y
20Floorplan
21Full Chip Layout
Exponent
Multiplier
Zero
Align
Adder
O v f
N o r m a l i z e
R o u n d
22Pipelining
- Initially planned 5-6 pipeline stages
- Reduced to 4 pipeline stages made possible by
implementing fast carry lookahead adders in
critical path modules (adder and multiplier)
23Pipelining Stages
Reg C
Multiplier
Reg A
Exp Calc
Reg B
Pipeline Reg
Pipeline Reg
Align C
Pipeline Reg
Pipeline Reg
Adder
Ld Zero
Pipeline Reg
Round
Normalize
Overflow checker
Reg Y
24LAYOUT
- Final Layout
- Layout of large blocks such as multiplier, adder
and normalize
25Layout Decisions
- 3 standard cell heights
- Uniform width vdd and ground rails
- Wider vdd and ground rails in power hungry
modules - Max of 8 flip flops per clock pulse generator
- Metal directionality
26Multiplier Layout with pipelining
27Adder Layout
28Normalize Layout
29FINAL LAYOUT
30Design Specifications
- Worst case delay 2.25ns
- Long buses are all buffered (not tested yet)
- Estimated clocking speed 400MHz
- Height by width 193.86 um 301.545 um
- Area 58,458 um2
- Aspect ratio 11.55
- Total Transistor density 0.22
31Layout densities
- Active 14.05
- Poly 9.25
- Metal 1 33.89
- Metal 2 18.00
- Metal 3 14.99
- Metal 4 6.29
32Layer Masks - Poly
33Layer Masks Metal 1
34Layer Masks Metal 2
35Layer Masks Metal 3
36Layer Masks Metal 4
37Schematic Power mW (350Mhz) Layout Power mW Schematic Delay Layout Delay
Multiplier -w/ pipeline 2.97 ?? N/A ?? 3.38n 1.9n N/A 2.25n
Exponents 1.608 2.21 1.01n 1.2n
Align 0.094 0.113 480p 637p
Adder 8.48 9.73 1.34n 1.7n
Leading 0 0.232 0.857 506p 551p
Normalize 1.458 1.546 407p 437p
Round 0.631 1.21 864p 986p
OvfCheck 0.13 0.19 453p 475p
Registers ?? ?? 179p 193p
Total ?? ?? - -
38Area um2 Transistor Count Transistor Density
Multiplier -w/ pipeline 20388 4496 0.22
Exponents 5,163 738 0.14
Align 3,995 500 0.13
Adder 13,202 3174 0.24
Leading 0 1,253 364 0.29
Normalize 3,190 942 0.3
Round 1,802 494 0.28
OvfCheck 200 70 0.35
Registers, etc N/A 1948 N/A
Total 58,458 12,730 0.22
39Conclusion
- More marketing
- Summarize chip functionality
- Extending applications of chip
40Comments?