Afrigraph Tutorial B: Interactive RayTracing

About This Presentation

Title:

Afrigraph Tutorial B: Interactive RayTracing

Description:

Efficient occlusion culling is hard. Visibility determined at end ... Occlusion Culling & Logarithmic Complexity. RT never even looks at invisible geometry ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 87

Provided by: philipps2

Category:

more less

Transcript and Presenter's Notes

Title: Afrigraph Tutorial B: Interactive RayTracing

1
Afrigraph Tutorial BInteractive Ray-Tracing

Ingo Wald
Philipp Slusallek
Saarland University
Computer Graphics Group
http//graphics.cs.uni-sb.de

For almost 20 years, researchers have argued that
eventually, Ray-Tracing will become faster than
rasterization

For almost 20 years, researchers have argued that
eventually, Ray-Tracing will become faster than
rasterization
And nothing happened...
Well, almost ...

4
UNC Powerplant (12.5 Mtris, gt10 fps)
5
Four Power Plants (50 Mtris)
6
Tutorial Overview

Introduction
Introduction to Ray-Tracing
Discussion Ray-Tracing versus Rasterization
Previous Work
Approximating Ray-Tracing
Accelerated Ray-Tracing
Interactive Ray-Tracing on PCs
Coherent Ray-Tracing Implementation
Comparisons (SW / HW)
Distributed RT of Massive Models
Outlook Hardware-Architectures for Ray-Tracing
Future Research and Conclusions

7
Tutorial Overview

Introduction
Introduction to Ray-Tracing
Discussion Ray-Tracing versus Rasterization
Previous Work
Approximating Ray-Tracing
Accelerated Ray-Tracing
Interactive Ray-Tracing on PCs
Coherent Ray-Tracing Implementation
Comparisons (SW / HW)
Distributed RT of Massive Models
Outlook Hardware-Architectures for Ray-Tracing
Future Research and Conclusions

8
Introduction to Ray-Tracing

In principle Very simple algorithm
For each pixel
Create ray through that pixel
Cast ray into scene and find closest intersection
Shade ray at intersection point
Can also shoot new rays during shading
Determine visibility of point lights by shadow
rays
Compute reflected/refracted light by recursively
tracing reflection-/refraction-rays
Basically, thats all

9
Ray-Tracing Algorithm
10
Introduction to Ray-Tracing

Only three main components
Generating rays
Finding the closest intersection of a ray
Ray traversal
Ray-object intersection
Shading

11
Ray-Generation

Generate initial ray for each pixel
Other camera models are trivial
Fisheye lens
Non-linear distortions/Lens effects
Motion blur, depth of field
Options
More samples for anti-aliasing
Adaptive Sampling
Combine with IBR
E.g. RenderCache Reuse samples by reprojection

12
Ray-Traversal
Grid (2D)

Need to find objects quickly
Exhaustive search infeasible
Build spatial index structure
Grid, octree, BSP-tree, BVH, ...
Advantages
Logarithmic complexity
Occlusion culling
Early ray termination
Problems
Multiple intersection computations
(objects often in multiple voxels)
Dynamic scenes ?

Octree (2D)
13
Ray-Object-Intersection

Need to compute intersections fast
Requires many floating point operations
But typically dominated by traversal (21)
Plenty of algorithms
Plenty of primitives
Even for triangles
Optimizations
Use SIMD CPU-extensions (SSE, AltiVec, 3D-Now)
Data parallel execution
Proper caching of data

14
Shading

Lots of reflection models possible
Phong, Cook-Torrance, Ward,
Direct use of Shading Languages (Renderman)
Shading after visibility has been computed
No overhead due to overdraw
Every ray is shaded exactly once
Can generate new rays
Shadow, reflection, transmission, ...
Need to deal with recursion
Rendering cost linear in rays traced

15
Introduction to Ray-Tracing

Only three main components
Generating rays
Finding the closest intersection of a ray
Ray traversal
Ray-object intersection
Shading
Problem
Find closest intersection is very expensive
And Lots of rays per image

16
Rasterization Pipeline
Application

In Contrast Rasterization
Efficient HW implementation
Use of object coherence
Many new features
Rendering is driven by App.
Application submits geometry
Visibility determined at end
Z-buffer fragment test

TL, Vertex Ops
Rasterization
Texturing
Fragment Ops
Fragment Tests
Framebuffer
17
RasterizationDrawbacks

Drawbacks of this approach
Use of object coherence
Only if triangle is large
Rendering is driven by App.
Application has to know what is visible
Efficient occlusion culling is hard
Visibility determined at end
Overdraw Discard all but one fragments
High depth complexity very inefficient

18
Ray-Tracing versus Rasterization

Flexibility
Handling unstructured groups of rays
Image-based rendering, reflections, shadows
Generality
Ray-Tracing is the basis for many algorithms
Global illumination, visibility,
Used in many disciplines
Physics, Biology, Chemistry, Telecom,

19
Ray-Tracing versus Rasterization

Simple and Efficient Shading
Shading happens after visibility computation
Direct use of Shading Languages
Correctness Image Quality
Rasterization inherently relies on approximations
Environment maps, shadow maps, ...
Ray-traced images are correct by default
True reflections and shadows
Use of approximations is optional

20
Ray-Tracing versus Rasterization

Parallel Scalability
Ray-Tracing is embarrassingly parallel
(e.g. each pixel independent of all others)
Scales well with the available hardware
Needs fast access to scene data base

21
Ray-Tracing versus Rasterization

Scalability with Scene Size
Occlusion Culling Logarithmic Complexity
RT never even looks at invisible geometry
RT traversal allows for efficient searching
O(log N)
Rasterization shows linear behavior O(N)
? RT wins for complex scenes
But rasterization is improving

22
Ray-Tracing versus Rasterization

Coherence
Key to efficient rendering
Rasterization Object coherence
Allows for efficient HW implementation
But only really efficient for large triangles
Ray-Tracing Ray coherence
Improved caching reduced bandwidth
Allows for data parallel computation
RT has much more coherence than assumed
But harder to exploit

23
Ray-Tracing versus Rasterization

Conclusion of that Comparison
Ray Tracing has many advantages
These advantages become ever more pronounced
Not only qualty, also efficiency
But Ray-Tracing is (still) costly
Have to make it faster !

24
Tutorial Overview

Introduction
Introduction to Ray-Tracing
Discussion Ray-Tracing versus Rasterization
Previous Work
Approximating Ray-Tracing
Accelerated Ray-Tracing
Interactive Ray-Tracing on PCs
Coherent Ray-Tracing Implementation
Comparisons (SW / HW)
Distributed RT of Massive Models
Outlook Hardware-Architectures for Ray-Tracing
Future Research and Conclusions

25
Previous and Related Work

Two ways to achieve ray-tracing like quality
interactively
Trace less rays per frame Approximative
ray-tracing
Rasterization hardware
Image-based techniques
Interpolation of ray-traced results
Trace more rays/sec Accelerated ray-tracing
Better data structures
Better algorithms
Better implementations
Parallel processing

26
Previous and Related Work

Two ways to achieve ray-tracing like quality
interactively
Trace less rays per frame Approximative
ray-tracing
Rasterization hardware
Image-based techniques
Interpolation of ray-traced results
Trace more rays/sec Accelerated ray-tracing
Better data structures
Better algorithms
Better implementations
Parallel processing

27
Approximated Ray-TracingRasterization Hardware

HW-Accelerated vista/shadow buffers
Compute visible geometry in HW
Lookup of geometry in frame buffer
Only works for primary rays and point lights
Creates artifacts (e.g. shadow buffer resolution)
Augmenting hardware with RT effects
Selective ray-tracing
Integrate ray-tracing with OpenGL rendering
Rasterization for diffuse objects
Textures or splatting Stamminger/Haber 00/01
for ray-traced samples

28
Approximated Ray-TracingCorrective Textures
29
Approximated Ray-TracingImage-Based Techniques

RenderCache Walter et al. 99
Store ray samples per pixel (color, depth, ...)
Reproject samples for next frame
Detect and fill holes by sending few new rays
Heuristic algorithms based on neighborhood
Locate and correct errors (shadow, etc)
Pseudo-randomly sample a few other pixel
Adaptively sample near error regions
But Reprojection and Heuristics are expensive
Pays off (only) when pixels are very expensive to
compute directly (e.g. global illumination)
Scales badly with CPUs

30
Approximated Ray-TracingImage-Based Techniques

Holodeck Ward 98
Similar to RenderCache, but
Long term storage of ray samples on disk
Fast access to samples based on grid structure
Builds light-field-like data representation

31
Approximated Ray-TracingImage-Based Techniques

Interpolation in the image plane
Pixel-selected ray-tracing Akimoto, 89
Coarse sampling grid
Adaptive refinement based on error criteria
Linear interpolation between samples
General ray interpolation Bala, 99
Object-/Ray-/Image-Space
Time
Error bounded

32
Previous and Related Work

Two ways to achieve ray-tracing like quality
interactively
Trace less rays per frame Approximative
ray-tracing
Rasterization hardware
Image-based techniques
Interpolation of ray-traced results
Trace more rays/sec Accelerated ray-tracing
Better data structures
Better algorithms
Better implementations
Parallel processing

33
Accelerated Ray TracingBetter Data
Structures/Algorithms

Best data structure (Grid vs BSP vs) ?
Always scene and implementation dependent
In practice, most do about equally well
Well-reserached topic ? New data structures are
unlikely to be found
But Potential for better algorithms
Can we better exploit coherence ?
Can we build data structures faster ?
Can we build data structures fully automatically
?
Also Need for dynamic data structures

34
Accelerated Ray-TracingParallelization on
SuperComputers

RT of large CSG models Muuss 95
Motivation Interactively render complex data
sets
Idea Use raytracing
Flexibility Avoid tessellation of CSG-models
Take advantage of logarithmic complexity of RT
Exploit parallelism
Implementation
Optimized, general RT algorithm
96 CPU, SGI PowerChallenge, shared memory
Results
1-2 frames per second _at_ video resolution (in
95!!!)

35
Accelerated Ray-TracingParallelization on
SuperComputers

Utah Parallel RT System Parker 99
Similar approach to Muuss
Parallelization on shared memory machine
Supports general primitives and volume data sets
Results
Has shown scalability up to 128 CPUs
Importance of caching analysis
New goal interactive visual cues for
visualization(Same information at less cost)

36
Tutorial Overview

Introduction
Introduction to Ray-Tracing
Discussion Ray-Tracing versus Rasterization
Previous Work
Approximating Ray-Tracing
Accelerated Ray-Tracing
Interactive Ray-Tracing on PCs
Coherent Ray-Tracing Implementation
Comparisons (SW / HW)
Distributed RT of Massive Models
Outlook Hardware-Architectures for Ray-Tracing
Future Research and Conclusions

37
IRT on PCsWhat to keep in mind

PC hardware has changed dramatically
Processors become much faster
But increase in ray-tracing speed is gradual
Increasing gap between speed of CPU and memory
But ray-tracing algorithm did not change
SIMD extensions
Flops become increasingly cheap
But difficult to take advantage of in ray-tracing
Fast (and cheap) networking network of PCs
But good performance on non-shared-memory is hard
Small clusters are around everywhere

38
IRT on PCsWhat to keep in mind

PC hardware has changed dramatically
Have to adapt our algorithms !
Special emphasis on
Keeping the CPU busy
Memory Caching(1 cache miss can cost several
triangle intersections)
SIMD
Not so important any more
Instruction count, avoiding float ops

39
General Optimizations Cache

Main memory is too slow for CPU (110)
(bandwidth and latency)
Keep relevant data in caches
Design algorithms for cache reuse ? coherence
Align data to cache lines (32 bytes)
Separate data according to usage
Separate volatile from non-volatile data
Store intersection data separate from shading
data(e.g. shading normals not needed for
intersection)
Prefetch data
Design algorithms to enable data access prediction

40
General Optimizations Cache

Cache Reuse Example Triangle Data Structure
Variant 1 Struct Triangle Vec3f a,b,c
Intersect() routine works on this structure
Prefetching hard (2 levels of indirection)
Data stored in 4 different memory regions
(1 struct 3 vectors)
Worst case 8 cache misses
(if each of the 4 data overlaps cacheline border)

41
General Optimizations Cache

Cache Reuse Example Triangle Data Structure
Variant 2 With preprocessed intersection data
All necessary data packed into 48 aligned
bytes(see paper)
Con Additional data to store (48b/triangle)
But several advantages
At most 2 cache misses
1 continuous memory region ? Trivial to prefetch

42
General Optimizations Cache

This was only one example Similarly for
BSP Nodes (even more important)
Triangle lists
Materials
Shading Data

43
General Optimizations Simplification

Today's CPUs have very long pipelines
Simplify the code to avoid pipeline stalls
Choose simple algorithms
KISS wins(KISS keep it simple and stupid)
E.g. BSP-tree traversal simpler than grids
Easier to maintain and optimize (e.g.
prefetching)
Write tight inner loops
E.g. better caching and handling of branches
Avoid conditionals/relative jumps in inner loops
E.g. support only triangles
Avoid memory-access stalls
? Caching, caching, caching !!!

44
OptimizationSIMD Extensions

Most CPUs provide SIMD extensions
Intel SSE (Others 3D-Now!, AltiVec, ...)
Use SIMD higher speed lower bandwidth
Up to four parallel floating point operations
? For the cost of 1 !
Fetch data once to reduce bandwidth to cache
Amortize loading cost over 4 operations
?Factor 4 in bandwidth reduction
Overhead due to restricted instruction set
E.g. no SSE dot product
Con Programming in assembly language

45
OptimizationSIMD Extensions

How to use SIMD Extensions ?
Either Instruction-parallel
Combine 4 computations in normal algorithm
E.g. the 4 mults in a dot product
Or Data-parallel
Run algorithm on 4 different data in parallel
E.g. 4 independent dot products

46
SIMD Intersection

SIMD best used in data parallel fashion
Little instruction-level parallelism (in RT)
? Just doesnt work
Data parallel 1 ray ? 4 triangles
Hard to always have four triangles ready
Data parallel traversal for 1 ray ?
Data parallel 4 rays ? 1 triangle
Must traverse rays in parallel ? ray packets
Standard intersection code
Overhead for terminated rays(E.g. 1 ray hits, 3
rays miss)

47
SIMD Intersection

Performance Results
Comparison against already optimized C code
Amortized cost for SSE code
? 20-36 million intersections/sec! (P-III, 800
MHz)

48
SIMD BSP-Traversal

Recursive Traversal Algorithm

49
SIMD BSP-Traversal

SIMD-Traversal
Traverse four rays in parallel
Intersection with split plane traversal
decision
Combine decisions flags
All rays must perform the same traversal
Make sure order is consistent
Easy to guarantee Same ray origin or same signs
of direction vector
Avoid recursion function calls
Maintain stack manually
Worst case as bad as before

50
SIMD BSP-Traversal

Overhead of SIMD-Traversal (in )
Fixed resolution at 10242 (l), fixed 2x2 packet
(r)
Traversal still dominates rendering cost
Overall speedup factor 2 to 2.3

51
Coherent Algorithm Tracing Ray Packets

Many rays are very similar
e.g. primary and shadow rays, but others too
Handle rays together in packets of 4 rays
Process them in lock-step (? SIMD)
Reorder computations to be partly breadth-first
Load data once and use it for all rays
Reduces memory bandwidth (e.g. SSE Factor 4 !)
Increases Cache Utilization
Coherence increases with image resolution
more rays in same view frustum

52
Coherent Algorithms Shading

SIMD Phong-Shading
Fixed cost per image
Rearrange data from ray packets
Different depth non-coherent shadow rays
Different materials different shaders
Algorithm
Parallel shadow rays to light sources
SIMD shading using shadow flags
Constant shading texturing cost (lt10)
Procedural shading is easy (noise)

53
Coherent Ray-Tracing Summary

Speedup
Prerequisite Expose coherence in ray-tracing
algorithm
Factor gt5 General optimizations
Factor gt2 SIMD computations
Further optimizations are possible
Better prefetching, more efficient shading
Performance
200K to 1.5M primary rays/s (800 MHz, P-III)
Almost linear in of reflection shadow rays

54
Comparison Test Scenes
55
Comparison Software Ray-Tracers

Time per primary ray (1 CPU, 5122, in ?s)
Main memory RTRT 256MB, others up to 1GB
Rayshade Best grid resolution

56
Comparison OpenGL Hardware

Frame rate with SGI-Performer (5122, fps)
HW Octane V8, Onyx3/IR3, Geforce II GTS
CPUs Onyx 8, nVidia 2, RTRT 1

57
Comparison Scaling with Scene Size

Render time of subsampled terrain (spf)
Typical linear scaling of rasterization HW
Worst case for RT No occlusion
Only 1 CPU !

Demo / Video

59
Distributed RT of Massive Models
60
Reference Model (12.5 Mtris)
61
Previous Work

Rendering of Massive Models Aliaga 99
Framerate 5 to 15 fps for single power plant
Needs shared-memory supercomputer (SGI)
Framework of algorithms
Textured-depth-meshes (96 reduction in tris)
View-Frustum Culling LOD (50 each)
Hierarchical occlusion maps (10)
Extensive preprocessing required
Entire model 3 weeks (estimated)
Only semi-automatic

62
Distributed RT of Massive Models

Ray-Tracing and massive models just match
Logarithmic scaling in primitives
Ideal for big models
Preprocessing
Simple and fast spatial sorting, fully automatic
Distributed computing
Parallel scalability to many networked computers
No scene replication
? Our Approach Use coherent ray-tracing
Caching of scene data in network
Deal with network issues by reordering

63
Ray-Tracing Issues

Distributed Scene Management
Several GB of scene data
File size and virtual address space (32 bit)
Cannot use OS caching (demand paging)
Cache miss will stall the entire process
1ms network latency time to trace several
hundred rays
Reordering would need non-blocking memory read
Need to handle cache manually
No longer limited by address space
Allows reordering of computations
Do not wait for missing data
Continue with other rays while data is being
fetched

64
Massive Models Caching

2-Level BSP-Trees
Caching based on voxels
Voxels are completely self-contained

65
Structure of the BSP-Tree
66
Distribution Issues

Preprocessing
Simple spatial sorting
Need out-of-core algorithm due to model size
Simplistic implementation 2.5 hours
Estimated with optimizations lt 30 min
Model Server
Single server provides all model data
Potenial bottleneck
Should be distributed as well
At least for more than 10 clients
Trivial to implement

67
Distribution Issues

Load Balancing
Tile based (32x32 pixels)
Demand driven
Avoid idle-times
prefetching tiles
Asynchronous communication
Frame-to-Frame Coherence
Keep rays on the same client
Simple Keep tiles on the same client
Better Assign tiles based on reprojected pixels
Larger effective cache size
Increases with number of clients

68
Results

Setup
Seven dual Pentium-III 800-866 MHz
FastEthernet (100Mbit) for normal clients
GigabitEthernet only for display model server
Performance for one Power Plant
4-5 fps without SSE optimization
Factor 2 speedup with SSE
Almost perfect scaling from 1 to 14 CPUs
Never tried any more than that

69
Animation Framerate vs. Bandwidth
70
Speedup
71

Demo / Video

72
Tutorial Overview

Introduction
Introduction to Ray-Tracing
Discussion Ray-Tracing versus Rasterization
Previous Work
Approximating Ray-Tracing
Accelerated Ray-Tracing
Interactive Ray-Tracing on PCs
Coherent Ray-Tracing Implementation
Comparisons (SW / HW)
Distributed RT of Massive Models
Outlook Hardware-Architectures for Ray-Tracing
Future Research and Conclusions

73
Ray-Tracing Hardware

Summary so far
RT has many technicaladvantages
Better performance forlarge scenes, (logN vs N)
Better image quality, more features
But High initial cost onmain CPU
? Hardware support would help

74
Ray-Tracing HardwareWhy today ?

The setting has changed
Real scenes arent suited for rasterization any
more
High depth complexity
Large scenes, small triangles
Shading becomes more expensive
Demand for more features (shading,
programmability)
Advantages of raytracing finally come to play
Also Flops arent that expensive any more
Number of Gigaflops per Gforce ?
Neither is memory

75
Ray-Tracing HardwarePrevious Work

Over the last decade Several research systems
Often suffered from lack of resources
Memory and Flops too expensive 10 years ago
Offline-Ray-Tracing AR250 (ART)
Accelerated offline rendering, bandwidth limited
Volume-Ray-Casting systems
Full volume ray casting on a chip
Many, some already commercially successful

76
Ray-Tracing HardwareThe SHARP Architecture

SHARP architecture Tim Purcell, Stanford
Mixed SW/HW approach
Based on SmartMemories Mai 00
Multiprocessor on a Chip
Roughly 64 R10k, with 8GB/s (!) memory bandwith

77
Ray-Tracing HardwareThe SHARP Architecture

Conclusions from SHARP(Also see Siggraph 2001,
Course 13)
Simple caching works very well
Good ray coherence
Off-chip bandwidth is minimal
Simple memory access design
Performance (512x512)
Conference scene 50 fps
Reconfigurability allows to adapt to demands
Adapt number of shading/traversal units to scene

78
Ray-Tracing HardwareOther Architectures

RAYA (MERL, Siggraph 2001, Course 13)
Based on Memory Coherent Ray-Tracing Pharr
CORA (Saarbrücken)
Hardware version of Coherent RT Algorithm
Custom-design chip
Est. performance 30/25 fps at 1024x768
Cruiser 3.5 Mtris, 2 lights
BunnyQuake 110 Ktris, 2 lights, 3 reflection
levels

79
Tutorial Overview

Introduction
Introduction to Ray-Tracing
Discussion Ray-Tracing versus Rasterization
Previous Work
Approximating Ray-Tracing
Accelerated Ray-Tracing
Interactive Ray-Tracing on PCs
Coherent Ray-Tracing Implementation
Comparisons (SW / HW)
Distributed RT of Massive Models
Outlook Hardware-Architectures for Ray-Tracing
Future Research and Conclusions

80
What you should take home with you

Interactive Ray Tracing IS feasible
If importance is paid to underlying hardware
Its not only feasible, its already there
Not only a theoretical phantasy any more
And even on cheap PCs
Not only better, it can even be faster
At least for certain applications

81
The Future

IRT enables completely new applications
Just think what has been done OpenGL
Large scale visualization engineering,
Handling of huge models
Interactive global illumination (?)
Need to adapt algorithms to new situation
Flexible rendering
Gaze tracking and non-uniform sampling density
Image-Based or Frameless rendering
Question What can IRT do for you?

82
Open Research Problems

Can we make it even faster ?
Hardware
What is the best HW architecture?
Dynamic Scenes
Optimized rebuild or transformation of index?
API
Better alternative to OpenGLs push model?
OpenGL not suited for Ray-Tracing
Global Illumination
Efficient new algorithms

83
Acknowledgements

AMD
Generous support, sponsoring and collaboration
soon 24-node dual-Althlon IV, 1.5GHz cluster
Presenters of the Siggraph 2001 Course 13
Images, material, and information
Tim Purcell Pat Hanrahan (Stanford)
Many discussions and ideas
The Max-Planck-Institute at Saarbruecken
Collaboration and use of their Graphics Hardware
C. Benthin M. Wagner others
Work on the RT implementation and discussions

84
Links