Title: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware
1Efficient Partitioning of Fragment Shaders for
Multiple-Output Hardware
- Tim Foley
- Mike Houston
- Pat Hanrahan
- Computer Graphics Lab
- Stanford University
2Motivation
- GPU Programming
- Interactive shading
- Offline rendering
- Computation
- physical simulations
- numerical methods
- BrookGPU Buck et al. 2004
- Shouldnt be constrained by hardware limits
- but demand high runtime performance
3Motivation Multipass Partitioning
- Divide GPU program (shader) into a partition
- set of rendering passes
- each pass satisfies all resource constraints
- save/restore intermediate values in textures
- Many possible partitions exist
- The problem
- given a program, find the best partition
4Related Work
- SGIs ISL Peercy et al. 2000
- treat OpenGL machine as SIMD processor
- Recursive Dominator Split (RDS) Chan et al.
2002 - graph partitioning of shader dag
- Data-Dependent Multipass Control Flow on GPU
Popa and McCool 2004 - partition around flow control and schedule passes
- Mio Riffel et al. 2004
- instruction scheduling with backtracking
5Contribution
- Merging Recursive Dominator Split (MRDS)
- MRDS Extends RDS
- support shaders with multiple outputs
- support hardware with multiple render targets
- generate more optimal partitions
- same running time as RDS
6Outline
- Motivation
- Related Work
- RDS Algorithm
- MRDS Algorithm
- Results
- Future Work
7RDS - Overview
- Input dag of n nodes
- shader ops
- inputs
- interpolants
- constants
- textures
- Goal mark subset of nodes as splits
- split nodes define pass boundaries
- 2n possible subsets
8RDS - Overview
- Input dag of n nodes
- shader ops
- inputs
- interpolants
- constants
- textures
- Goal mark subset of nodes as splits
- split nodes define pass boundaries
- 2n possible subsets
9RDS - Overview
- Input dag of n nodes
- shader ops
- inputs
- interpolants
- constants
- textures
- Goal mark subset of nodes as splits
- split nodes define pass boundaries
- 2n possible subsets
10RDS - Overview
- Combination of approaches to limit search space
- Save/recompute decisions
- primary performance tradeoff
- Dominator tree
- used to avoid save/recompute tradeoffs
11RDS Save / Recompute
- M multiply refereced node
12RDS Save / Recompute
- M multiply refereced node
13RDS Save / Recompute
- M multiply refereced node
14RDS Save / Recompute
- M multiply refereced node
15Dominator
- B dom G
- all paths to B go through G
16Dominator Tree
17Key Insight
- if B, G in same pass
- and B dom G
- then no save/recompute costs for G
18MRDS Multiple-Output Shaders
19MRDS Multiple-Output Shaders
20MRDS Multiple-Output Hardware
float4 x, y ... for( i0 iltN i ) x' xx
- yy y' 2xy x x' y y' ...
21MRDS Multiple-Output Hardware
float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
22MRDS Multiple-Output Hardware
float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
23MRDS Multiple-Output Hardware
- State cannot fit in single output
float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
24MRDS Multiple-Output Hardware
- State cannot fit in single output
float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
25MRDS Dominating Sets
- Dominating Set S A,D
- S dom G
- All paths to G go through element of S
- S, G in same pass
- avoid save/recompute for G
26MRDS Pass Merging
- Generate initial passes with RDS
- Find potential merges
- check if valid
- evaluate change in cost
- Execute from best to worst
- revalidate
- Stop when no more beneficial merges
27MRDS Pass Merging
- Generate initial passes with RDS
- Find potential merges
- check if valid
- evaluate change in cost
- Execute from best to worst
- revalidate
- Stop when no more beneficial merges
28MRDS Pass Merging
- Generate initial passes with RDS
- Find potential merges
- check if valid
- evaluate change in cost
- Execute from best to worst
- revalidate
- Stop when no more beneficial merges
29MRDS Pass Merging
- Generate initial passes with RDS
- Find potential merges
- check if valid
- evaluate change in cost
- Execute from best to worst
- revalidate
- Stop when no more beneficial merges
30MRDS Pass Merging
- Generate initial passes with RDS
- Find potential merges
- check if valid
- evaluate change in cost
- Execute from best to worst
- revalidate
- Stop when no more beneficial merges
31MRDS Pass Merging
- What if RDS chose to recompute G?
- Merge between passes A and D
- eliminates duplicate instructions
- gets high score
32MRDS Pass Merging
- What if RDS chose to recompute G?
- Merge between passes A and D
- eliminates duplicate instructions
- gets high score
33MRDS Time Complexity
- Cost of merging dominated by initial search
- iterates over s2 pairs of splits
- each pair requires size-s set operations and 1
compiler call - O(s2(sn))
- s O(n) in worst case
- MRDS O(n3) in worst case
- in practice we expect s ltlt n
- Assumes compiler calls are linear
- not true for fxc
34MRDS'
- RDS uses linear search for save/recompute
- evaluates cost of both alternatives with RDSh
- RDS O(n RDSh) O(n3)
- MRDS merges after RDS has made these decisions
- MRDS O(RDS n3) O(n3)
- MRDS' merges during cost evaluation
- adds linear factor in worst case
- MRDS' O(n (RDSh n3)) O(n4)
35Results
- 3 Brook Programs
- Procedural Fire
- Mandelbrot Fractal
- Matrix Mulitply
- Compiled for ATI Radeon 9800 XT with
- RDS
- MRDS
- MRDS'
36Results Procedural Fire
- MRDS' better than MRDS and RDS
- better save/recompute decisions
- results in less bandwidth used
37Results Compile Times
38Results Mandelbrot Fractal
- MRDS', MRDS better than RDS
- iterative computation state in 2 variables
- RDS duplicates computation
39Results Matrix Multiply
- Matrix-matrix multiply benefits from blocking
- blocking cuts computation by 2
- Blocking requires multiple outputs
- performance limited by MRT performance
40Summary
- Modified RDS algorithm, MRDS
- supports multiple-output shaders
- generates code for multiple-render-targets
- easy to implement, same running time
- generates better-performing partitions
41Future Work
- Implementations
- Ashli
- combine with Mio
- Exploit new hardware
- data-dependent flow control
- large numbers of outputs
42Acknowledgements
- Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot
- RDS implementation, design discussions
- Kayvon Fatahalian, Ian Buck
- GPUBench results
- ATI
- hardware
- DARPA, ATI, IBM, NVIDIA, SONY
- funding
43(No Transcript)