Title: NVIDIA GeForce
1NVIDIA GeForce
- Ryan Hendrixson
- Ryan Schubert
- Allison Walthall
2What Does a GPU Actually Do?
- Historically, from
- Acting simply as a frame buffer
- Doing vertex transformations and pixel color
calculations - Now even programmable
- In the simplest sense, a modern GPU implements a
3D rendering pipeline
33D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Lighting
This is a pipelined sequence of operations to
draw a 3D primitive into a 2D image
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
43D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
53D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Projection Transformation
Clipping
Scan Conversion
Image
63D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Clipping
Scan Conversion
Image
73D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Scan Conversion
Image
83D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Clip primitives outside cameras view
Scan Conversion
Image
93D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling Transformation
Transform into 3D world coordinate system
Lighting
Illuminate according to lighting and reflectance
Viewing Transformation
Transform into 3D camera coordinate system
Projection Transformation
Transform into 2D screen coordinate system
Clipping
Clip primitives outside cameras view
Scan Conversion
Draw pixels
Image
10Modern OpenGL Pipeline
Graphics State
GPU
CPU
VertexProcessor
PixelProcessor
Application
VertexProcessor
Assembly Rasterization
PixelProcessor
VideoMemory(Textures)
Finalpixels(Color, Depth)
Fragments(pre-pixels)
Vertices(3D)
Xformed,LitVertices(2D)
Render-to-texture
- Programmable Vertex Processor
- Programmable Fragment (Pixel) Processor
11OpenGL vs. DirectX
- Just graphics
- Standard C interfaces
- State machine
- Multiple platforms
- Academic use
- Graphics, multimedia, etc.
- C interfaces
- Object oriented
- Windows
- PC games
12Possible GPU Performance Bottlenecks
- CPU/Bus Bound
- Simply not able to send enough vertices to the
card to keep it busy - Vertex Bound
- Vertex processing engine is fully loaded, while
the fragment engine is just waiting and grabbing
data as soon as its ready - Pixel Bound
- The fragment engine is fully loaded, causing the
vertex engine to have to wait before sending more
data
13Early History
- NVIDIA founded in 1993
- 1997 RIVA
- 1998 RIVA TNT
- 1999 GeForce 256 (NV10)
14GeForce 256 (NV10)
- Lighting and transformation
- DDR and SDR
- HDTV compliant
- Hardware alpha-blending
- 4 pixel pipelines at 120 MHz
- Fill Rate 480 Megapixels/second
15GeForce2
- 2000 GeForce 2 GTS
- Doubled the pixel fill rate
- Quadrupled the texel fill rate
- Increased clock speed
- Multi-texturing
- S3TC, MPEG-2, FSAA
16Anti-Aliasing
17GeForce2
- 2000 GeForce 2 MX
- Cut pixel pipeline by 2, making it cost effective
- Twinview
- Compatible with MACs
18GeForce2
- Jan 2001 Apple selected GeForce2 MX as default
high-end graphics solution for Power Mac G4 - August 2000 GeForce2 Ultra
- November 2000 GeForce2 Go
- December 2000 NVIDIA buys 3DFX
19GeForce3
- 2001 GeForce3 (NV20)
- 240 MHz Core/500 MHz Memory
- 57 million transistors
- 46-76 Gigaflops
- Vertex shader technology
- Pixel shader technology
- LightSpeed Memory architecture
20(No Transcript)
21LightSpeed Memory Architecture
22GeForce4
- 2002 GeForce4 Ti (NV25) and MX (NV17)
- Ti
- 4200, 4400, 4600, and 4800 versions
- 63 million transistors
- Chip clock 225-300 MHz
- Memory Clock 500-650 MHz
- 75-100 million vertices/second
23GeForce FX
- November 2002 Geforce FX (NV30)
- 16 variations for different price ranges
- 125 million transistors
- 8 pixels/clock
- 1 tmu/pipe (16 textures/unit)
- 128 bit memory interface
- 128 MB/256 MB Memory size support
24GeForce 6 series
- GeForce 6 series (NV40 )
- 6200 6600 GT and Ultra 6800 GT, Ultra, and
Ultra Extreme - Core clock speed 450 MHz
- Memory clock speed 600 MHz
- 6 4-wide fp32 vector MADDs/ clock cycle vertex
shader units - 16 4-wide fp32 vector MADDs/ clock cycle pixel
shader units
25GeForce 6 series
- Super scalar 16 pipe architecture
- CineFX3.0 engine
- All operations done in FP32 precision per
component - 200 Gigaflops (Compare this to the Itaniums 6.4
Gigaflops)
26General Diagram (6800/NV40)
27TurboCache
- Uses PCI-Express bandwidth to render directly to
system memory - Card needs less memory
- Performance boost while lowering cost
- TurboCache Manager dynamically allocates from
main memory - Local memory used to cache data and to deliver
peak performance when needed
28TurboCache
29NV40 Vertex Processor
An NV40 vertex processor is able to execute one
vector operation (up to four FP32 components),
one scalar FP32 operation, and make one access to
the texture per clock cycle
30NV40 Fragment Processors
Early termination from mini z buffer and z buffer
checks resulting sets of 4 pixels (quads) passed
on to fragment units
31Programmable 2D and Video Processor
- Can be used for video decoding and coding (IDCT,
deinterlacing, color model transformations, etc.)
32Why NV40 series was better
- Massive parallelism
- Scalability
- Lower end products have fewer pixel pipes and
fewer vertex shader units - Computation Power
- 222 million transistors
- First to comply with Microsofts DirectX 9 spec
- Dynamic Branching in pixel shaders
33Dynamic Branching
- Helps detect if pixel needs shading
- Instruction flow handled in groups of pixels
- Specify branch granularity (the number of
consecutive pixels that take the same branch) - Better distribution of blocks of pixels between
the different quad engines
34Dynamic Branching
35GeForce 7 series
- 7800 GT
- 449
- 7 vertex units
- 20 pixel pipelines
- Clock speed 400 MHz
- Memory clock speed 500 MHz
- 7800 GTX
- 600
- 8 vertex units
- 24 pixel pipelines
- Clock speed 430 MHz
- Memory clock speed 600 MHz
36GeForce 7800
- 302 million transistors
- 200 Gigaflops of multiply/add calculations per
second - 128-bit floating point precision through the
entire rendering pipeline - Fill Rate 10.3 Gigatexels
- 860 million vertices/sec
37GeForce 7800
38(No Transcript)
39(No Transcript)
40ALU Units in Pixel Processor
- Sub-unit 1
- NV40 textures data and can issue a MUL vector
instruction or use its mini-ALU to issue a
non-vector instruction - G70 same but also can issue a multiply/add
- Sub-unit 2
- NV40 can issue a multiply/add vector instruction
or use its own mini-ALU to issue a non-vector
instruction - G70 same
41GeForce 6 vs. GeForce 7
- ALU Units
- G70 24 ALU Units
- NV40 16 ALU Units
- Register file same size
- Texture samplers the same but when fetching large
textures in preparation for filtering, G70's
samplers have less latency pulling those textures
out of memory
42GeForce 6 vs. GeForce 7(speculative)
- Increased L2 texture cache (to around 12KB)
- Better cache re-use with larger textures,
decompressing those larger textures into L1
faster - Possibly offering more granularity in cache
access by the GPU, to reduce texture bandwidth,
speeding up rendering.
43GeForce 6 vs. GeForce 7
- 33 more vertex units, each with more
performance - Improved vertex fetch unit (unconfirmed by
Nvidia) - Triangle setup and rasteriser optimized via the
use of a new raster pattern (again unconfirmed by
Nvidia)
44General Diagram (7800/G70)
4532-bit IEEE floating-pointthroughout pipeline
(NV40)
- Framebuffer
- Textures
- Fragment processor
- Vertex processor
- Interpolants
- GeForce 7800 (G70) supports 128 bit through
entire pipeline!
46Hardware supports several other data types
- Fragment processor also supports
- 16-bit half floating point
- 12-bit fixed point
- These may be faster than 32-bit on some HW
- Framebuffer/textures also support
- Large variety of fixed-point formats
- E.g., classical 8-bit per component
- These formats use less memory bandwidth than FP32
47How are current GPUs different from CPU?
- GPU is a stream processor
- Multiple programmable processing units
- Connected by data flows
VertexProcessor
FragmentProcessor
FramebufferOperations
Assembly Rasterization
Application
Framebuffer
Textures
48How are current GPUs different from CPU?
- Optimized for 4-vector arithmetic
- Useful for graphics colors, vectors, texcoords
- Easy way to get high performance/cost
- SIMD/MIMD
49GPU Memory Model vs CPUs
- Much more restricted memory access
- Allocate/free memory only before computation
- Limited memory access during computation (kernel)
- Registers
- Read/write
- Local memory
- Does not exist
- Global memory
- Read-only during computation
- Write-only at end of computation (pre-computed
address) - Disk access
- Does not exist
50GPU Memory Model
- Where is GPU Data Stored?
- Vertex buffer
- Frame buffer
- Texture
VS 3.0 GPUs
Texture
Vertex Processor
Fragment Processor
Frame Buffer(s)
Vertex Buffer
Rasterizer
51GPGPU and Motivation
- GPUs are fast
- Itanium 6.4 GFLOPS
- GeForceFX 7800 200 GFLOPs
- GPUs are getting faster, faster
- CPUs annual growth ? 1.5 ? decade growth ? 60
- GPUs annual growth gt 2.0 ? decade growth gt 1000
52MotivationComputational Power
GPU
GPU
CPU
Courtesy Naga Govindaraju
53GPGPU
- Good for inherently parallel applications
- Rapidly evolving ISA and HW architecture
- Largely secret
- Cant simply port code written for the CPU!
54Programs are Shaders
- Bound by the specific hardware profile
- E.g. different cards have different supported
hardware, OpenGL has different restrictions than
DirectX, etc - Hardware profiles change relatively drastically
as new GPUs are developed - But typically new profiles only add features, so
there is generally still backwards compatibility
(but not always)
55Vertex processor
- 256 instructions per program originally(effective
ly higher with branching) - Now up to 65535 instructions
- Executes on all vertices
- Outputs new vertices or texture coordinates, etc
56Fragment Processor Flow Chart
57Fragment processor hasflexible texture mapping
- Memory is accessible through texture reads
- Texture reads are just another instruction
- Allows computed texture coordinates,nested to
arbitrary depth - Allows multiple uses of a singletexture unit
58Additional fragment processor capabilities
- Read access to window-space position
- Read/write access to fragment Z
- Built-in derivative instructions
- Partial derivatives w.r.t. screen-space x or y
- Useful for anti-aliasing
- Conditional fragment-kill instruction
- Multiple FP formats supported
59Fragment processor limitations
- Originally No branching
- Now support dynamic branching (but its still
costly) - No indexed reads from registers
- Use texture reads instead
- No memory writes
60Branching Instruction Costs(GeForce 6800)
61Fragment shaders
- Originally very limited in size (only 96
instructions), now expanded to 65535
instructions - New cards support dynamic branching (but it still
incurs some performance penalty) - Now have the ability to output to multiple render
targets
62CineFX 4.0 Engine
- A redesigned vertex shader unit reduces the time
to set up and perform geometry processing. - A new pixel shader unit design can carry out
twice as many floating-point operations and
greatly accelerates other mathematical operations
to increase throughput. - An advanced texture unit incorporates new
hardware algorithms and better caching to speed
filtering and blending operations.
63Vertex Shaders
- The 7800 has 8 vertex shaders
- The Triangle Setup stage turns the vertex points
into a triangle - It also determines mathmatically the
rasterization for each triangle - Accelerating triangle setup increases the total
throughput of the 3D pipeline
64Theoretical Rasterization Pattern of a Triangle
65New Pixel Shader MADD
- Multiply and Accumulate are commonly used math
functions in 3D graphics - MADD stands for Multiply-ADD operations
- The 7800 can do twice the amount of MADD
operations than previous GPUs could - This allows developers to create much more
complex visual effects
66Transparency Adaptive Supersampling
- Takes extra passes of thin-lined objects such as
chain linked fences or trees to enhance quality - Pixels inside of a polygon are usually not
touched by anti-aliasing methods - With this, a key set is devised, and those pixels
are anti-aliased, creating a smoother image.
67Transparency Adaptive Supersampling
68Transparency Adaptive Multisampling
- Higher levels of performance, because it uses one
texel to determine other subpixel values - Not as high quality
69(No Transcript)
70Supporting the Future
- The 7800 is already set up to support the new
Microsoft Longhorn OS with some of the following
advancements - Video post-processing
- Real-time desktop compositing
- Seamless multiple 3D applications
- Accelerated antialiased text rendering
- Special effects and animation
71Accelerated Graphics Port (AGP)
- The AGP is superior to the PCI because it
provides a dedicated pathways between the slot
and the processor - Uses sideband addressing
- PCI must load a texture from the hard drive into
the systems RAM, then from the RAM into the GNU
framebuffer - AGP can read textures directly from system RAM by
tricking the CPU into believing the textures
are in the framebuffer, when they are really in
memory
72PCI Express
- Based on the PCI system, allowing for backwards
compatibility - Uses 1 bit, bi-directional lanes (PCI used a bus)
- Each lane can support 250 MB/s in each lane
(4GB/s total) - AGP is only 2 GB/s
73Scalable Link Interface (SLI)
- Takes advantage of the PCI express bus, which
will allow more than one discrete graphics device
on the same PCI host - Allows two of the same GeForce GPUs to run on one
machine, thus sharing load. - There are two modes for this
- Split-frame Rendering (SFR)
- Alternate-frame Rendering (AFR)
74(No Transcript)
75Split-frame Rendering
- Has each GPU render a portion of the screen,
split horizontally - No extra latency
- Not necessarily evenly split
- SFR is load shared, so it splits up the frame by
the amount of work, not the size - A large amount of overhead is involved, causing a
max speed up of around 1.8 times
76Alternate-frame Rendering
- Avoids all the overhead problems of SFR
- Many buffer swaps
- Reliant on the speed of the processor
- Can cause latency issues
- Recommended mode by NVIDIA
77GeForce Go 7800 GTX
- The mobile version of the 7800 GTX
- Everything from the desktop release has been
carried over to this - Can switch between x1 and x16 lanes of PCI
Express - Uses PowerMizer 6.0, which allows this chip to
operate in the same envelope as its predecessor,
the 6800
78(No Transcript)
79(No Transcript)
80GeForce Go 7800 Power Issues
- Power consumption and package are the same as the
6800 Ultra chip, meaning notebook designers do
not have to change very much about their thermal
designs - Dynamic clock scaling can run as slow as 16 MHz
- This is true for the engine, memory, and pixel
clocks - Heavier use of clock gating than the desktop
version - Runs at voltages lower than any other mobile
performance part - Regardless, you wont get much battery-based
runtime for a 3D game
81Questions?
Questions?