Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware

Description:

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware. Tim Foley ... mark subset of nodes as splits. split nodes define pass boundaries. 2n ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 44
Provided by: ericc150
Category:

less

Transcript and Presenter's Notes

Title: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware


1
Efficient Partitioning of Fragment Shaders for
Multiple-Output Hardware
  • Tim Foley
  • Mike Houston
  • Pat Hanrahan
  • Computer Graphics Lab
  • Stanford University

2
Motivation
  • GPU Programming
  • Interactive shading
  • Offline rendering
  • Computation
  • physical simulations
  • numerical methods
  • BrookGPU Buck et al. 2004
  • Shouldnt be constrained by hardware limits
  • but demand high runtime performance

3
Motivation Multipass Partitioning
  • Divide GPU program (shader) into a partition
  • set of rendering passes
  • each pass satisfies all resource constraints
  • save/restore intermediate values in textures
  • Many possible partitions exist
  • The problem
  • given a program, find the best partition

4
Related Work
  • SGIs ISL Peercy et al. 2000
  • treat OpenGL machine as SIMD processor
  • Recursive Dominator Split (RDS) Chan et al.
    2002
  • graph partitioning of shader dag
  • Data-Dependent Multipass Control Flow on GPU
    Popa and McCool 2004
  • partition around flow control and schedule passes
  • Mio Riffel et al. 2004
  • instruction scheduling with backtracking

5
Contribution
  • Merging Recursive Dominator Split (MRDS)
  • MRDS Extends RDS
  • support shaders with multiple outputs
  • support hardware with multiple render targets
  • generate more optimal partitions
  • same running time as RDS

6
Outline
  • Motivation
  • Related Work
  • RDS Algorithm
  • MRDS Algorithm
  • Results
  • Future Work

7
RDS - Overview
  • Input dag of n nodes
  • shader ops
  • inputs
  • interpolants
  • constants
  • textures
  • Goal mark subset of nodes as splits
  • split nodes define pass boundaries
  • 2n possible subsets

8
RDS - Overview
  • Input dag of n nodes
  • shader ops
  • inputs
  • interpolants
  • constants
  • textures
  • Goal mark subset of nodes as splits
  • split nodes define pass boundaries
  • 2n possible subsets

9
RDS - Overview
  • Input dag of n nodes
  • shader ops
  • inputs
  • interpolants
  • constants
  • textures
  • Goal mark subset of nodes as splits
  • split nodes define pass boundaries
  • 2n possible subsets

10
RDS - Overview
  • Combination of approaches to limit search space
  • Save/recompute decisions
  • primary performance tradeoff
  • Dominator tree
  • used to avoid save/recompute tradeoffs

11
RDS Save / Recompute
  • M multiply refereced node

12
RDS Save / Recompute
  • M multiply refereced node

13
RDS Save / Recompute
  • M multiply refereced node

14
RDS Save / Recompute
  • M multiply refereced node

15
Dominator
  • B dom G
  • all paths to B go through G

16
Dominator Tree
17
Key Insight
  • if B, G in same pass
  • and B dom G
  • then no save/recompute costs for G

18
MRDS Multiple-Output Shaders
19
MRDS Multiple-Output Shaders
20
MRDS Multiple-Output Hardware
float4 x, y ... for( i0 iltN i ) x' xx
- yy y' 2xy x x' y y' ...
21
MRDS Multiple-Output Hardware
float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
22
MRDS Multiple-Output Hardware
float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
23
MRDS Multiple-Output Hardware
  • State cannot fit in single output

float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
24
MRDS Multiple-Output Hardware
  • State cannot fit in single output

float4 x, y ... for( i0 iltN i ) x' f(
x, y ) y' g( x, y ) x x' y y' ...
25
MRDS Dominating Sets
  • Dominating Set S A,D
  • S dom G
  • All paths to G go through element of S
  • S, G in same pass
  • avoid save/recompute for G

26
MRDS Pass Merging
  • Generate initial passes with RDS
  • Find potential merges
  • check if valid
  • evaluate change in cost
  • Execute from best to worst
  • revalidate
  • Stop when no more beneficial merges

27
MRDS Pass Merging
  • Generate initial passes with RDS
  • Find potential merges
  • check if valid
  • evaluate change in cost
  • Execute from best to worst
  • revalidate
  • Stop when no more beneficial merges

28
MRDS Pass Merging
  • Generate initial passes with RDS
  • Find potential merges
  • check if valid
  • evaluate change in cost
  • Execute from best to worst
  • revalidate
  • Stop when no more beneficial merges

29
MRDS Pass Merging
  • Generate initial passes with RDS
  • Find potential merges
  • check if valid
  • evaluate change in cost
  • Execute from best to worst
  • revalidate
  • Stop when no more beneficial merges

30
MRDS Pass Merging
  • Generate initial passes with RDS
  • Find potential merges
  • check if valid
  • evaluate change in cost
  • Execute from best to worst
  • revalidate
  • Stop when no more beneficial merges

31
MRDS Pass Merging
  • What if RDS chose to recompute G?
  • Merge between passes A and D
  • eliminates duplicate instructions
  • gets high score

32
MRDS Pass Merging
  • What if RDS chose to recompute G?
  • Merge between passes A and D
  • eliminates duplicate instructions
  • gets high score

33
MRDS Time Complexity
  • Cost of merging dominated by initial search
  • iterates over s2 pairs of splits
  • each pair requires size-s set operations and 1
    compiler call
  • O(s2(sn))
  • s O(n) in worst case
  • MRDS O(n3) in worst case
  • in practice we expect s ltlt n
  • Assumes compiler calls are linear
  • not true for fxc

34
MRDS'
  • RDS uses linear search for save/recompute
  • evaluates cost of both alternatives with RDSh
  • RDS O(n RDSh) O(n3)
  • MRDS merges after RDS has made these decisions
  • MRDS O(RDS n3) O(n3)
  • MRDS' merges during cost evaluation
  • adds linear factor in worst case
  • MRDS' O(n (RDSh n3)) O(n4)

35
Results
  • 3 Brook Programs
  • Procedural Fire
  • Mandelbrot Fractal
  • Matrix Mulitply
  • Compiled for ATI Radeon 9800 XT with
  • RDS
  • MRDS
  • MRDS'

36
Results Procedural Fire
  • MRDS' better than MRDS and RDS
  • better save/recompute decisions
  • results in less bandwidth used

37
Results Compile Times
38
Results Mandelbrot Fractal
  • MRDS', MRDS better than RDS
  • iterative computation state in 2 variables
  • RDS duplicates computation

39
Results Matrix Multiply
  • Matrix-matrix multiply benefits from blocking
  • blocking cuts computation by 2
  • Blocking requires multiple outputs
  • performance limited by MRT performance

40
Summary
  • Modified RDS algorithm, MRDS
  • supports multiple-output shaders
  • generates code for multiple-render-targets
  • easy to implement, same running time
  • generates better-performing partitions

41
Future Work
  • Implementations
  • Ashli
  • combine with Mio
  • Exploit new hardware
  • data-dependent flow control
  • large numbers of outputs

42
Acknowledgements
  • Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot
  • RDS implementation, design discussions
  • Kayvon Fatahalian, Ian Buck
  • GPUBench results
  • ATI
  • hardware
  • DARPA, ATI, IBM, NVIDIA, SONY
  • funding

43
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com