Title: ManyCore Programming with GRAMPS
1Many-Core Programming with GRAMPS Real Time
REYESJeremy Sugerman, Kayvon
FatahalianStanford UniversityJune 12, 2008
2Background, Outline
- Stanford Graphics / Architecture Research
- CPU, GPU trends
- And collision?
- Two research areas
- HW/SW Interface, Programming Model
- Future Graphics API
3Problem Statement
- Drive efficient development and execution in
many-/multi-core systems. - Support homogeneous, heterogeneous cores.
- Inform future hardware
- Status Quo
- GPU Pipeline (Good for GL, otherwise hard)
- CPU (No guidance, fast is hard)
4GRAMPS
Rasterization Pipeline
Input Fragment Queue
Output Fragment Queue
Frame Buffer
FB Blend
Shade
Rasterize
Ray Queue
Ray Tracing Pipeline
Camera
Intersect
Ray Hit Queue
Fragment Queue
Frame Buffer
FB Blend
Shade
- Software defined graphs
- Producer-consumer, data-parallelism
- Initial focus on rendering
5As a GPU Evolution
- Not (too) radical for graphics
- Like fixed ? programmable shading
- Pipeline undergoing massive shake up
- Diversity of new parameters and use cases
- Bigger picture than graphics
- Rendering is more than GL/D3D
- Compute is more than rendering
- Larrabee has no innate pipeline
6As a Compute Evolution
- Sounds like streaming
- Execution graphs, kernels, data-parallelism
- Streaming squeeze out every FLOP
- Goals bulk transfer, arithmetic intensity
- Intensive static analysis, custom chips (mostly)
- Bounded space, data access, execution time
- GRAMPS interesting apps are irregular
- Goals Dynamic, data-dependent code
- Aggregate work at run-time
- Heterogeneous commodity platforms
- Naturally supports streaming when applicable
7GRAMPS Role
- A graphics pipeline is now an app!
- GRAMPS models parallel state machines.
- Compared to status quo
- More flexible than a GPU pipeline
- More guidance than bare metal
- Portability in between
- Not domain specific
8GRAMPS Interfaces
- Host/Setup Create execution graph
- Thread Stateful, singleton
- Shader Data-parallel, auto-instanced
9What Weve Built (System)
10GRAMPS Scheduler
- Tiered Scheduler
- Fat cores per-thread, per-core
- Micro cores shared hw scheduler
- Top level tier N
11What Weve Built (Apps)
Direct3D Pipeline (with Ray-tracing Extension)
Ray-tracing Pipeline
Sample Queue
Tile Queue
Ray Queue
Tiler
Sampler
Intersect
Camera
Ray Hit Queue
Fragment Queue
Frame Buffer
Shade
FB Blend
12Initial Results
- Queues are small, utilization is good
13GRAMPS Visualization
14GRAMPS Visualization
15GRAMPS Portability
- Portability really means performance.
- Less portable than GL/D3D
- GRAMPS graph is hardware sensitive
- More portable than bare metal
- Enforces modularity
- Best case, just works
- Worst case, saves boilerplate
16High-level Challenges
- Is GRAMPS a suitable GPU evolution?
- Enable pipeline competitive with bare metal?
- Enable innovation advanced / alternative
methods? - Is GRAMPS a good parallel compute model?
- Map well to hardware, hardware trends?
- Support important apps?
- Concepts influence developers?
17Whats Next for GRAMPS?
- Implementation scheduling, simulation details
- Model
- Graph modification (state change)
- Blocking calls (join)
- Intra/inter-stage synchronization primitives
- Data sharing / ref-counting
- Workloads REYES, physics, others?
- Develop new graphics pipelines
18Real-Time REYES
19Just Build It
- Build a real-time REYES pipeline...
- that is tightly integrated with ray tracing
for global effects.
20What does real-time REYES mean? (to us)
- Smooth surfaces via adaptive tessellation
- Everything is a displaced subdivision surface
- Shade on surface, prior to rasterization
- Stochastic rasterization for motion blur and DOF
- Order-independent transparency
21OpenGL/Direct3D
REYES
Split
Tessellate (xbox)
Dice
Vertex Shade
Displace
Rasterize
Early Z
Early Z
Shade
Frag Shade
Rasterize
Z Test
Z Test
Blend/Resolve
Blend/Resolve
22REYES Tessellation
Split primitive into smaller primitives until a
GOOD grid can be created.
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Grids
Regular parametric sampling of primitive surface
(like XBox360). Compact representation for many
adjacent polygons. Grids provide SIMD
efficiency and bulk processing benefits.
GOOD GRID - Max polygon area lt 1 pixel
- All polys about the same size - Bounded
polys per grid
27REYES
OpenGL/Direct3D
Split
Tessellate (xbox)
Dice
Vertex Shade
Displace
Rast
Early Z
Early Z
Shade
Frag Shade
Rast/Crack Fix
Z Test
Z Test
Blend/Resolve
Blend/Resolve
28What does real-time REYES mean? (to us)
- Smooth surfaces via adaptive tessellation
- Splitting is irregular (and serial)
- Crack fixing
- Shade on surface, prior to rasterization
- We feel confident about this
- But most work done before moving to raster
space hmm - Stochastic rasterization for motion blur and DOF
- Many tiny polygons ? parallel rasterization
- SIMD tricky
- Order-independent transparency
- Not unique to REYES
29Shading in a Hybrid System
- Evaluate displacement (due to REYES or on demand
for ray tracing) - Shade grids
- Shade ray hits
- Looking forward shade quads too?
- One shading system or two or three?
30This Project is Really About
- Re-architecting REYES pipeline for real-time
performance (for throughput architectures like
LRB) - Hybrid rendering study interoperability of
advanced techniques (REYES ray tracing maybe
Direct3D) - Hybrid shading system
- Understand workload balance
- Hybrid pipeline interface real-time, retained
mode - Pursuit of more flexible, advanced graphics
pipelines
31Questions?