Graphics on GRAMPS - PowerPoint PPT Presentation

About This Presentation

Title:

Graphics on GRAMPS

Description:

Real data parallel apps still have performance critical non-data ... Queues are an excellent idiom to capture producer-consumer parallelism thread and data ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 32

Provided by: KayvonFa1

Learn more at: http://graphics.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Graphics on GRAMPS

1
Graphics on GRAMPS

Jeremy Sugerman
Kayvon Fatahalian

2
Background

Context Broader research investigation
generalizing GPU/Cell/compute cores and
combining them with CPUs.
Fundamental Beliefs
Real data parallel apps still have performance
critical non-data parallel pieces
Existing parallel programming models are too
constrained (GPUs) or too hard/vague (CPUs)
Queues are an excellent idiom to capture
producer-consumer parallelism thread and data
Fixed function execution units are not a problem,
but fixed control paths are

3
Compute Cores

CPUs designed for single threads per core
Minimal FLOPS per core
Compute cores design for lots of math per core
Many threads per core
Sometimes wider SIMD per thread
SIMD width hardware threads ops / core
And, more compute than CPU cores fit per chip
Many examples GPU, Cell, Niagara, Larrabee

4
Simplified Direct3D Pipeline

Application launches some drawing
Vertex Assembly (Fixed, Non-Data Parallel)
Vertex Processing (Programmable, Data Parallel)
Primitive Assembly (Fixed, Non-Data Parallel)
Primitive Processing (Programmable, Data
Parallel)
Fragment Assembly (Fixed, Non-Data Parallel)
Fragment Processing (Programmable, Data Parallel)
Pixel / Image Assembly (Fixed, Non-Data Parallel)
Only Data Parallel stages are programmable!

5
Direct3D Pipeline Properties

There is a reason only data parallel stages are
programmable.
Shader stages are inherently per-element (e.g.
vertex / primitive / fragment) and stateless
between them.
Assembly stages also run on many elements, but
they have inter-element dependencies
State can be remembered (vertex caching)
Inputs can be used by multiple outputs (strips)
Programmable Assembly requires heavier (more
serial) threads than Shaders.

6
Question

Can fixed-function control be decoupled from
efficient graphics performance on a compute-
heavy architecture?
Does not necessarily exclude fixed-function
execution blocks (eg. rasterizer, texture units)

7
This Talk

GRAMPS Our current model for programming compute
cores.
Implementing Direct3D 10 in software with
GRAMPS.
(Potentially) thoughts about how REYES, ray
tracers map to GRAMPS.
No explicit discussion of heterogeneous cores.
No fancy scheduling algorithms (yet?)

8
Example Simple 3D Pipeline
Input Vertices
Transformed Vertices
Vertex Shading
Primitive Assembly
Primitives
Fragment Shading
Rasterize (Assemble)
Fragments
Image Assembly
Framebuffer Pixels
Shaded Fragments
9
GRAMPS

General Runtime/Architecture for Multicore
Parallel Systems
Models execution graph of queues connected by
threads
Graph specified by host program
Simulator for exploring compute cores
Currently conflates hardware and runtime
of cores, thread contexts, SIMD width are all
parameters

10
Simple GRAMPS core

T - threads/core
S - SIMD ALUs/core
R - registers/thread
1 thread runs in each clock
Threads issue vector instructions (think S-wide
SSE)

L1 data cache (or scratchpad)
Thread 0
R
Thread 1
Thread 2

Thread T-1
ALU 0
ALU 1
ALU 2
ALU 3
ALU 4
ALU S-1

11
D3D10 Setup

App defines 3 shading environments
Vertex, geometry, fragment
Attach programs and resources
App configure fixed function units
Fixed number of modes
Attach resources
App submits work (vertices) to pipeline
Graphics runtime executes until completion

12
GRAMPS Setup

App defines a set of queues
App defines a set of thread environments
App attaches queues as thread inputs and outputs
App bootstraps computation by inserting data into
queue
Runtime executes threads until completion

13
GRAMPS Entities Execution

Threads Assemble, Shader, Fixed
Assemble Stateful, akin to a regular thread
Fixed Special purpose hardware wrapped to appear
an Assemble thread
Shader Stateless and data parallel

14
GRAMPS Entities Data

Queues for producer-consumer parallelism
Queues for aggregating coherent work
Queues support push and reserve/commit for
in-place Assembly
Chunks are the units / granularity at which
Queues are manipulated.

15
GRAMPS Scheduling

GRAMPS assigns Threads to hw contexts
Based on graph, current Queue contents
Tiered scheduling model
Tier-0 Trivially puts threads onto hw threads
Tier-1 Builds schedules for Tier-0.
Tier-N Arbitrarily clever. Doesnt exist.

16
System(how it works today)
17
D3D10 on GRAMPS
Index queue
postVtxShade queue
idxVtxAssemble
preVtxShade queue
prePrimAssemble queue
vtxShade
primAssemble
prePrimShade queue
shader thread
primShade
postPrimShade queue
assemble thread
rastAssemble
fixed function in GPU
preRast queue
tri setup / clip / cull
tri queue 0
tri queue 1
tri queue 2
tri queue N
rasterize
rasterize
rasterize
rasterize
preFragShade queue
preFragShade queue
preFragShade queue
preFragShade queue
fragShade
fragShade
fragShade
fragShade
postFragShade queue
postFragShade queue
postFragShade queue
postFragShade queue
blend / ztest
blend / ztest
blend / ztest
blend / ztest
18
Internal Queues

Queues just memory state struct (see below)
For now Queues are finite
Queues are contiguous array of chunks
Chunks granularity of manipulation

queue BYTE ptrnum_chunks
chunk_byte_width int num_chunks int
chunk_byte_width int head int tail
int reclaim bool donenum_chunks
19
Ex GRAMPS has chunks
Index queue
postVtxShade queue
idxVtxAssemble
preVtxShade queue
vtxShade
index_queue chunks contain vertex
indices preVtxShade_queue chunks contain 16
pre-transformed vertices postVtxShade_queue
chunks contain 16 transformed vertices
20
Ex GRAMPS has chunks
rasterize
preFragShade queue
fragShade
preFragshade_queue chunks contain Interpolated
inputs for 16 fragments liveness mask per
fragment x,y position per quad uniform data
shared across all fragments
21
Queue API

Window view into a contiguous range of chunks
for assemble threads
Symmetric for producing/consuming access

qwin BYTE ptr int num int id

Shader threads just have push

22
Queue manipulation
(All threads)
void produce() push
(Assemble shader only)
qwin reserve(qwin q, int num_chunks) qwin
commit(qwin q, int num_chunks)
23
Internal threads

Defines a type of thread

ThreadEnv type shader, assemble,
fixed-func Program Code uniforms/constant
data sampler/texture/resource id bindings
List of input queues List of output queues
24
Shader threads

Shading language unchanged (HLSL)
Still write shaders in terms of single elements
Compilation produces code to operate on chunks

void hlsl_likefn(const element inputEl,
element outputEl,
const sampler foo, const tex3d
tex)
25
Internal shader threads

Shader thread code processes chunks
Input
GRAMPS pre-reserved chunks from in/out queues
Environment info (uniforms, consts, etc)

void shaderFn(const chunk in_chunks,
chunk out_chunks,
const env env)

Dispatched shader threads run to completion
Completion implies
inChunks are released
outChunks are commited

26
Assemble threads

Assemble threads build chunks
Access queue data via windows
Commit/reserve/consume may block thread

void assembleFn(qwin in_win,
qwin out_win, const env env)
27
Ex primitive assembly

Input chunks 16 verts
Output chunks 16 prims
Prim structure depends on type of prim
Points lines, triangles, triangle /w adj, etc
Creating prims from verts dependent on topology
Strips or lists
Triangle strip data for output chunk comes from
multiple input chunks

prePrimAssemble queue
primAssemble
prePrimShade queue
28
Ex frag assembly (rast)
For (each input triangle) Add triangle
uniform data to chunk while (chunk not full
triangle not done) rasterize next
tile of quads for (each nonempty quad)
add 4 fragments to chunk add quad
description per chunk if (chunk is
full) qwin_out commit(qwin_out, 1)
grow window with reserve() if
necessary
Building chunks 1. Compact valid quads
2. Data at various frequencies
29
Execution Tier 1
queue
queue
queue
shader threadEnv
assemble threadEnv
assemble threadEnv
shader threadEnv
queue
queue
queue
shader threadEnv
shader threadEnv
assemble threadEnv
assemble threadEnv
ShaderThr dispatch AssembleThr resume
Tier 1 to Tier 0 FIFO
Thread_Done() (implicit commit) Produce() Reserve(
) Commit()
30
Execution Tier 0

Each cycle round robin runnable threads
Thread stalls place on wait list
When thread completes
Pull next thread from fifo, assign to empty
thread slot
Send completion message to tier 0

Tier 1 to Tier 0 FIFO
L1 data cache (or scratchpad)
Thread 0
R
Thread 1
Thread 2

Tier 0 Scheduler
Thread T-1

ALU 0
ALU 1
ALU 2
ALU 3
ALU 4
ALU S-1
31
Validation