Title: Practical Parallel Processing for Today
1- Practical Parallel Processing for Todays
Rendering Challenges - SIGGRAPH 2001 Course 40
- Los Angeles, CA
2Speakers
- Alan Chalmers, University of Bristol
- Tim Davis, Clemson University
- Erik Reinhard, University of Utah
- Toshi Kato, SquareUSA
3Schedule
- Introduction
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
- Practical Applications
- Summary / Discussion
4Schedule
- Introduction (Davis)
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
- Practical Applications
- Summary / Discussion
5The Need for Speed
- Graphics rendering is time-consuming
- large amount of data in a single image
- animations much worse
- Demand continues to rise for high-quality graphics
6Rendering and Parallel Processing
- A holy union
- Many graphics rendering tasks can be performed in
parallel - Often embarrassing parallel
73-D Graphics Boards
- Getting better
- Perform tricks with texture mapping
- Steve Jobs remark on constant frame rendering
time
8Parallel / Distributed Rendering
- Fundamental Issues
- Task Management
- Task subdivision, Migration, Load balancing
- Data Management
- Data distributed across system
- Communication
9Schedule
- Introduction
- Parallel / Distributed Rendering Issues
(Chalmers) - Classification of Parallel Rendering Systems
- Practical Applications
- Summary / Discussion
10Introduction
- Parallel processing is like a dogs walking on
its hind legs. It is not done well, but you are
surprised to find it done at all - Steve Fiddes (apologies to Samuel Johnson)
- Co-operation
- Dependencies
- Scalability
- Control
11Co-operation
- Solution of a single problem
- One person takes a certain time to solve the
problem - Divide problem into a number of sub-problems
- Each sub-problem solved by a single worker
- Reduced problem solution time
- BUT
- co-operation ? overheads
12Working Together
- Overheads
- access to pool
- collision avoidance
13Dependencies
- Divide a problem into a number of distinct stages
- Parallel solution of one stage before next can
start - May be too severe ? no parallel solution
- each sub-problem dependent on previous stage
- Dependency-free problems
- order of task completion unimportant
- BUT co-operation still required
14Building with Blocks
Strictly sequential
Dependency-free
15Scalability
- Upper bound on the number of workers
- Additional workers will NOT improve solution
time - Shows how suitable a problem is for parallel
processing - Given problem ? finite number of sub-problems
- more workers than tasks
- Upper bound may be (a lot) less than number of
tasks - bottlenecks
16Bottleneck at Doorway
More workers may result in LONGER solution time
17Control
- Required by all parallel implementations
- What constitutes a task
- When has the problem been solved
- How to deal with multiple stages
- Forms of control
- centralised
- distributed
18Control Required
Sequential
Parallel
19Inherent Difficulties
- Failure to successfully complete
- Sequential solution
- deficiencies in algorithm or data
- Parallel solution
- deficiencies in algorithm or data
- deadlock
- data consistency
20Novel Difficulties
- Factors arising from implementation
- Deadlock
- processor waiting indefinitely for an event
- Data consistency
- data is distributed amongst processors
- Communication overheads
- latency in message transfer
21Evaluating Parallel Implementations
- Realisation penalties
- Algorithmic penalty
- nature of the algorithm chosen
- Implementation penalty
- need to communicate
- concurrent computation communication activities
- idle time
22Solution Times
23Task Management
- Providing tasks to the processors
- Problem decomposition
- algorithmic decomposition
- domain decomposition
- Definition of a task
- Computational Model
24Problem Decomposition
- Exploit parallelism
- Inherent in algorithm
- algorithmic decomposition
- parallelising compilers
- Applying same algorithm to different data items
- domain decomposition
- need for explicit system software support
25Abstract Definition of a Task
- Principal Data Item (PDI) - application of
algorithm - Additional Data Items (ADIs) - needed to complete
computation
26Computational Models
- Determines the manner tasks are allocated to PEs
- Maximise PE computation time
- Minimise idle time
- load balancing
- Evenly allocate tasks amongst the processors
27Data Driven Models
- All PDIs allocated to specific PEs before
computation starts - Each PE knows a priori which PDIs it is
responsible for - Balanced (geometric decomposition)
- evenly allocate tasks amongst the processors
- if PDIs not exact multiple of Pes then some PEs
do one extra task
28Balanced Data Driven
initial distribution
solution time
24
3
result collation
29Demand Driven Model
- Task computation time unknown
- Work is allocated dynamically as PEs become idle
- PEs no longer bound to particular PDIs
- PEs explicitly demand new tasks
- Task supplier process must satisfy these demands
30Dynamic Allocation of Tasks
2 x total comms time
solution time
total comp time for all PDIs
number of PEs
31Task Supplier Process
PROCESS Task_Supplier() Begin
remaining_tasks total_number_of_tasks (
initialise all processors with one task )
FOR p 1 TO number_of_PEs SEND task TO
PEp remaining_tasks remaining_tasks
-1 WHILE results_outstanding DO
RECEIVE result FROM PEi IF
remaining_tasks gt 0 THEN SEND task TO
PEi remaining_tasks
remaining_tasks -1 ENDIF End (
Task_Supplier )
Simple demand driven task supplier
32Load Balancing
- All PEs should complete at the same time
- Some PEs busy with complex tasks
- Other PEs available for easier tasks
- Computation effort of each task unknown
- hot spot at end of processing ? unbalanced
solution - Any knowledge about hot spots should be used
33Task Definition Granularity
- Computational elements
- Atomic element (ray-object intersection)
- sequential problems lowest computational element
- Task (trace complete path of one ray)
- parallel problems smallest computational element
- Task granularity
- number of atomic units is one task
34Task Packet
- Unit of task distribution
- Informs a PE of which task(s) to perform
- Task packet may include
- indication of which task(s) to compute
- data items (the PDI and (possibly) ADIs)
- Task packet for ray tracer ? one or more rays to
be traced
35Algorithmic Dependencies
- Algorithm adopted for parallelisation
- May specify order of task completion
- Dependencies MUST be preserved
- Algorithmic dependencies introduce
- synchronisation points ? distinct problem stages
- data dependencies ? careful data management
36Distributed Task Management
- Centralised task supply
- All requests for new tasks to System Controller ?
bottleneck - Significant delay in fetching new tasks
- Distributed task supply
- task requests handled remotely from System
Controller - spread of communication load across system
- reduced time to satisfy task request
37Preferred Bias Allocation
- Combining Data driven Demand driven
- Balanced data driven
- tasks allocated in a predetermined manner
- Demand driven
- tasks allocated dynamically on demand
- Preferred Bias Regions are purely conceptual
- enables the exploitation of any coherence
38Conceptual Regions
- task allocation no longer arbitrary
39Data Management
- Providing data to the processors
- World model
- Virtual shared memory
- Data manager process
- local data cache
- requesting locating data
- Consistency
40Remote Data Fetches
- Advanced data management
- Minimising communication latencies
- Prefetching
- Multi-threading
- Profiling
- Multi-stage problems
41Data Requirements
- Requirements may be large
- Fit in the local memory of each processor
- world model
- Too large for each local memory
- distributed data
- provide virtual world model/virtual shared memory
42Virtual Shared Memory (VSM)
- Providing a conceptual single memory space
- Memory is in fact distributed
- Request is the same for both local remote data
- Speed of access may be (very) different
43Consistency
- Read/write can result in inconsistencies
- Distributed memory
- multiple copies of the same data item
- Updating such a data item
- update all copies of this data item
- invalidate all other copies of this data item
44Minimising Impact of Remote Data
- Failure to find a data item locally ? remote
fetch - Time to find data item can be significant
- Processor idle during this time
- Latency difficult to predict
- eg depends on current message densities
- Data management must minimise this idle time
45Data Management Techniques
- Hiding the Latency
- Overlapping the communication with computation
- prefetching
- multi-threading
- Minimising the Latency
- Reducing the time of a remote fetch
- profiling
- caching
46Prefetching
- Exploiting knowledge of data requests
- A priori knowledge of data requirements
- nature of the problem
- choice of computational model
- DM can prefetch them (up to some specified
horizon) - available locally when required
- overlapping communication with computation
47Multi-Threading
- Keeping PE busy with useful computation
- Remote data fetch ? current task stalled
- Start another task (Processor kept busy)
- separate threads of computation (BSP)
- Disadvantages Overheads
- Context switches between threads
- Increased message densities
- Reduced local cache for each thread
48Results for Multi-Threading
- More than optimal threads reduces performance
- Cache 22 situation
- less local cache ? more data misses ? more
threads
49Profiling
- Reducing the remote fetch time
- At the end of computation all data requests are
known - if known then can be prefetched
- Monitor data requests for each task
- build up a picture of possible requirements
- Exploit spatial coherence (with preferred bias
allocation) - prefetch those data items likely to be required
50Spatial Coherence
51Schedule
- Introduction
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
(Davis) - Practical Applications
- Summary / Discussion
52Classification of Parallel Rendering Systems
- Parallel rendering performed in many ways
- Classification by
- task subdivision
- polygon rendering
- ray tracing
- hardware
- parallel hardware
- distributed computing
53Classification by Task Subdivision
- Original rendering task broken into smaller
pieces to be processed in parallel - Depends on type of rendering
- Goals
- maximize parallelism
- minimize overhead, including communication
54Task Subdivision in Polygon Rendering
- Rendering many primitives
- Polygon rendering pipeline
- geometry processing (transformation, clipping,
lighting) - rasterization (scan conversion, visibility,
shading)
55Polygon Rendering Pipeline
56Primitive Processing and Sorting
- View processing of primitives as sorting problem
- primitives can fall anywhere on or off the screen
- Sorting can be done in either software or
hardware, but mostly done in hardware
57Primitive Processing and Sorting
- Sorting can occur at various places in the
rendering pipeline - during geometry processing (sort-first)
- between geometry processing and rasterization
(sort-middle) - during rasterization (sort-last)
58Sort-first
59Sort-first Method
- Each processor (renderer) assigned a portion of
the screen - Primitives arbitrarily assigned to processors
- Processors perform enough calculations to send
primitives to correct renderers - Processors then perform geometry processing and
rasterization for their primitives in parallel
60Screen Subdivision
61Sort-first Discussion
- Communication costs can be kept low
- - Duplication of effort if primitives fall into
more than one screen area - - Load imbalance if primitives concentrated
- - Very few, if any, sort-first renderers built
62Sort-middle
63Sort-middle Method
- Primitives arbitrarily assigned to renderers
- Each renderer performs geometry processing on its
primitives - Primitives then redistributed to rasterizers
according to screen region
64Sort-middle Discussion
- Natural breaking point in graphics pipeline
- - Load imbalance if primitives concentrated in
particular screen regions - Several successful hardware implementations
- PixelPlanes 5
- SGI Reality Engine
65Sort-last
66Sort-last Method
- Primitives arbitrarily distributed to renderers
- Each renderer computes pixel values for its
primitives - Pixel values are then sent to processors
according to screen location - Rasterizers perform visibility and compositing
67Sort-last Discussion
- Less prone to load imbalance
- - Pixel traffic can be high
- Some working systems
- Denali
68Task Subdivision in Ray Tracing
- Ray tracing often prohibitively expensive on
single processor - Prime candidate for parallelization
- each pixel can be rendered independently
- Processing easily subdivided
- image space subdivision
- object space subdivision
- object subdivision
69Image Space Subdivision
70Image Space Subdivision Discussion
- Straightforward
- High parallelism possible
- - Entire scene database must reside on each
processor - need adequate storage
- Low processor communication
71Image Space Subdivision Discussion
- - Load imbalance possible
- screen space may be further subdivided
- Used in many parallel ray tracers
- works better with MIMD machines
- distributed computing environments
72Object Space Subdivision
- 3-D object space divided into voxels
- Each voxel assigned to a processor
- Rays are passed from processor to processor as
voxel space is traversed
73Object Space Subdivision Discussion
- Each processor needs only scene information
associated with its voxel(s) - - Rays must be tracked through voxel space
- Load balance good
- - Communication can be high
- Some successful systems
74Object Partitioning
- Each object in the scene is assigned to a
processor - Rays passed as messages between processors
- Processors check for intersection
75Object Partitioning Discussion
- Load balancing good
- - Communication high due to ray message traffic
- - Fewer implementations
76Schedule
- Introduction
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
- Practical Applications
- Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence (Davis) - Interactive Ray Tracing
- Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer - Summary / Discussion
77Practical Experiences at Clemson
- Problems with Rendering
- Current Resources
- Deciding on a Solution
- A New Render Farm
78A Demand for Rendering
- Computer Animation course
- 3 SIGGRAPH animation submissions
- render over semester break
79Current Resources
- dedicated lab
- 8 SGI 02s (R12000, 384 MB)
- general-purpose lab
- 4 SGI 02s
- shared lab
- dual-pipe Onyx2 (8 R12000, 8 GB)
- 10 SGI 02s (R12000, 256 MB)
- offices
- 5 SGI 02s
80Resource Problems
- Rendering prohibits interactive sessions
- Little organized control over resources
- users must be self-monitoring
- m renders on n machines ? 1 render on n/m
machines - Disk space
- Cross-platform distributed rendering to PCs
problematic - security (rsh)
- distributed rendering software
- directory paths
81Short-term Solutions
- Distributed rendering restricted to late night
- Resources partitioned
82Problems with Maya
- video
- Traditional distributed computing problems
- dropped frames
- incomplete frames
- tools developed
83Problems with Maya
84Problems with Maya
85Problems with Maya
- Animation inconsistencies
- next slide
- Some frames would not render
- Particle system inconsistencies
86Problems with Maya
87Rendering Tips
88Rendering Tips
89Deciding on a Solution - RenderDrive
- RenderDrive by ART (Advanced Rendering
Technology) - network appliance for ray tracing
- 16-48 specialized processors
- claims speedups of 15-40 over Pentium III
- 768MB to 1.5GB memory
- 4GB hard disk cache
90Deciding on a Solution - RenderDrive
- plug-in interface to Maya
- Renderman ray tracer
- 15K - 25K
91Deciding on a Solution - PCs
- Network of PCs as a render farm
- 10 PCs each with 1.4GHz, 1GB memory, and 40GB
hard drive - Maya will run under Windows 2000 or Linux (Maya
4.0) - Distributed rendering software not included for
Windows 2000
92Deciding on a Solution - PCs Win
- RenderDrive had some unusual anomalies
- Interactive capabilities
- Scan-line or ray tracing
- Distributed rendering software may be included
- Problems with security still exist
- shared file system
93Schedule
- Introduction
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
- Practical Applications
- Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence (Davis) - Interactive Ray Tracing
- Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer - Summary / Discussion
94Agenda
- Background
- Temporal Depth-Buffer
- Frame Coherence Algorithm
- Parallel Frame Coherence Algorithm
95Background - Ray Tracing
- Closest to physical model of light
- High cost in terms of time / complexity
96Background - Frame Coherence
- Frame coherence
- those pixels that do not change from one frame to
the next - derived from object and temporal coherence
- We should not have to re-compute those pixels
whose values will not change - writing pixels to frame files
97Background - Test Animation
- Glass Bounce (60 frames at 320x240 5 obj)
98Background - Frame Coherence
99Previous Work
- Frame coherence
- moving camera/static world Hubschman and Zucker
81 - estimated frames Badt 88
- stereoscopic pairs Adelson and Hodges 93/95
- 4D bounding volumes Glassner 88
- voxels and ray tracking Jevans 92
- incremental ray tracing Murakami90
100Previous Work (cont.)
- Distributed computing
- Alias and 3D Studio
- most major productions starting with Toy Story
Henne 96
101Goals
- Render exactly the same set of frames in much
less time - Work in conjunction with other optimization
techniques - Run on a variety of platforms
- Extend a currently popular ray tracer (POV-Ray)
to allow for general use
102Temporal Depth-Buffer
- Similar to traditional z-buffer
- For each pixel, store a temporal depth in frame
units
103Frame Coherence Algorithm
104Frame Coherence Algorithm
105Frame Coherence Algorithm
- Identify volume within 3D object space where
movement occurs - Divide volume uniformly into voxels
- For each voxel, create a list of frame numbers in
which changing objects inhabit this voxel
106Frame Coherence Algorithm
- In each frame, track rays through voxels for each
pixel - From the voxels traversed, find the one with the
lowest frame number - Record that number in the temporal depth-buffer
107Frame Coherence Algorithm
for each frame of the animation for each
pixel that needs to be computed for this frame
trace the rays for this pixel for
each voxel that any of these rays intersect
get the next frame number to compute
set the t-buffer entry to the lowest frame
number found
108Frame Coherence Algorithm
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
3
3
3
5
5
5
5
5
5
5
5
5
5
3
3
3
5
5
5
3
5
5
5
5
5
2
2
2
3
3
5
5
5
5
5
5
5
5
2
2
2
3
3
5
5
5
2
5
5
5
5
2
2
2
2
5
5
5
5
5
5
5
5
5
2
2
2
2
5
5
5
5
5
1
5
5
5
5
2
2
2
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
109Voxel Volume
- Uniform voxel spatial subdivision
- Voxel can be non-cubical
- Ways to determine voxel volume
- user-supplied
- pre-processing phase
- active voxel marking
- in distributed environment, done by master or
slave or both
110Frame Coherence Example
111Test Animation
- Pool Shark (620 frames at 640x480 174 obj)
112Test Animations - Problem
113Results
114Frame Coherence Discussion
- Localized movement can have global effects
- Performance depends on both the number and
complexity of recomputed pixels - Issues
- overhead
- antialiasing
- motion blur
115Temporal Depth-Buffer Discussion
- Uses less memory than other methods
- Simple
- Can be used with other algorithms
116Parallel Frame Coherence Algorithm
- Distributed computing environment
- 1-8 Sun Sparc Ultra 5 processors running at 270
MHz - Coarse-grain parallelism
- Load balancing
- divide work among processors
- keep data together for frame coherence
117Load Balancing
- Image space subdivision
- each processor computes a subregion for the
entire length of the run - Recursively subdivide subsequences to keep
processors busy
118Screen Subdivision
119Load Balancing
- Coarse bin packing find block with smallest
number of computed frames - Keep statistics on average first frame time and
average coherent frame time - Find a hole in the sequence
- Leave some free frames before new start
120Load Balancing Example
121Results - Parallel Frame Coherence
122Results
123Another Test Animation
- Soda Worship (60 frames at 160x120 839 obj)
124Another Test Animation
125Results
126Results Discussion
- Good speedup
- Multiplicative speedup with both
- Speedup limitations
- voxel approximation
- writing pixels to frame files (communication)
127Conclusions
- Frame coherence algorithm combined with
distributed computing provides good speedup - Algorithm scales well
- Techniques are useful and accessible to a wide
variety of users - Benefits depend on inherent properties of the
animation
128Shameless Advertisement
- Masters of Fine Arts in Computing (MFAC)
- special effects and animation courses
- two year program
- Clemson Computer Animation Festival in Fall 2002
129Schedule
- Introduction
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
- Practical Applications
- Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence - Interactive Ray Tracing (Reinhard)
- Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer - Summary / Discussion
130Overview
- Introduction
- Interactive ray tracer
- Animation and interactive ray tracing
- Sample reuse techniques
131Introduction
132Interactive Ray Tracing
- Renders effects not available using other
rendering algorithms - Feasible on high-end supercomputers provided
suitable hardware is chosen - Scales sub-linearly in scene complexity
- Scales almost linearly in number of processors
133Hardware Choices
- Shared memory vs. distributed memory
- Latency and throughput for pixel communication
- Choice ? Shared memory
- This section of the course focuses on SGI Origin
series super computers
134Shared Memory
- Shared address space
- Physically distributed memory
- ccNUMA architecture
135SGI Origin 2000 Architecture
136Implications
- ccNUMA machines are easy to program,
- But it is more difficult to generate efficient
code - Memory mapping and processor placement may be
important for certain applications - Topic returns later in this course
137Overview
- Introduction
- Interactive ray tracer
- Animation and interactive ray tracing
- Sample re-use techniques
138Interactive Ray Tracing
139Basic Algorithm
- Master-slave configuration
- Master (display thread) displays results and
farms out ray tasks - Slaves produce new rays
- Task size reduced towards end of each frame
- Load balancing
- Cache coherence
140Tracing a Single Ray
- Use spatial subdivisions for ray acceleration
(assumed familiar) - Use grid or bounding volume hierarchy
- Could be optimized further, but good results have
been obtained with these acceleration structures - Efficiency mainly due to low level optimization
141Low Level Optimization
- Ray tracing in general
- Ray coherence neighboring rays tend to intersect
the same objects - Cache coherence objects previously intersected
are likely to still reside in cache for current
ray - Memory access patterns are important (next slide)
142Memory Access
- On SGI Origin series computers
- Memory allocated for a specific process may be
located elsewhere in the machine ? reading memory
may be expensive - Processes may migrate to other processors when
executing a system call ? whole cache becomes
invalidated previously local memory may now be
remote and more expensive to access
143Memory Access (2)
- Pin down processes to processors
- Allocate memory close to where the processes run
that will use this memory - Use sysmp and sproc for processor placement
- Use mmap or dplace for memory placement
144Further Low Level Optimizations
- Know the architecture you work on (Appendix III.A
in the course notes) - Use profiling to find expensive bits of code and
cache misses (Appendix III.B in the course notes) - Use padding to fit important data structures on a
single cache line
145Frameless Rendering
- Display pixel as soon as it is computed
- No concept of frames
- Perceptually preferable
- Equivalent of a full frame takes longer to
compute - Less efficient exploitation of cache coherence
- This alternative will return later in this course
146Overview
- Introduction
- Interactive ray tracer
- Animation and interactive ray tracing
- Sample re-use techniques
147Animation and Interactive Ray Tracing
148Why Animation?
- Once interactive rendering is feasible,
walk-through is not enough - Desire to manipulate the scene interactively
- Render preprogrammed animation paths
149Issues to Be Addressed
- What stops us from animating objects?
- Answer spatial subdivisions
- Acceleration structures normally built during
pre-processing - They assume objects are stationary
150Possible Solutions
- Target applications that require a small number
of objects to be manipulated/ animated - Render these objects separately
- Traversal cost will be linear in the number of
animated objects - Only feasible for extremely small number of
objects
151Possible Solutions (2)
- Target small number of manipulated or animated
objects - Modify existing spatial subdivisions
- For each frame delete object from data structure
- Update objects coordinates
- Re-insert object into data structure
- This is our preferred approach
152Spatial Subdivision
- Should be able to deal with
- Basic operations such as insertion and deletion
of objects should be rapid - User manipulation can cause the extent of the
scene to grow
153Subdivisions Investigated
- Regular grid
- Hierarchical grid
- Borrows from octree spatial subdivision
- In our case this is a full tree all leaf nodes
are at the same depth - Both acceleration structures are investigated in
the next few slides
154Regular Grid Data Structure
- We assume familiarity with spatial subdivisions!
155Object Insertion Into Grid
- Compute bounding box of object
- Compute overlap of bounding box with grid voxels
- Object is inserted into overlapping voxels
- Object deletion works similarly
156Extensions to Regular Grid
- Dealing with expanding scenes requires
- Modifications to object insertion/deletion
- Ray traversal
157Extensions to Regular Grid (2)
158Features of New Grid Data Structure
- We call this an Interactive Grid
- Straightforward object insertion/deletion
- Deals with expanding scenes
- Insertion cost depends on relative object size
- Traversal cost somewhat higher than for
regular grid
159Hierarchical Grid
- Objectives
- Reduce insertion/deletion cost for larger objects
- Retain advantages of interactive grid
160Hierarchical Grid (2)
161Hierarchical Grid (3)
- Build full octree with all leaf nodes at the same
level - Allow objects to reside in leaf nodes as well as
in nodes higher up in the hierarchy - Each object can be inserted into one or more
voxels of at most one level in the hierarchy - Small object reside in leaf nodes, large objects
reside elsewhere in the hierarchy
162Hierarchical Grid (4)
- Features
- Deals with expanding scenes similar to
interactive grid - Reduced insertion/deletion cost
- Traversal cost somewhat higher than interactive
grid
163Test Scenes
164Video
165Measurements
- We measure
- Traversal cost of
- Interactive grid
- Hierarchical grid
- Regular grid
- Object update rates of
- Interactive grid
- Hierarchical grid
166Framerate vs. Grid Size (Sphereflake)
167Framerate vs. Grid Size (Triangles)
168Framerate Over Time (Sphereflake)
169Framerate Over Time (Triangles)
170Conclusions
- Interactive manipulation of ray traced scenes is
both desirable and feasible using these
modifications to grid and hierarchical grids - Slight impact on traversal cost
- (More results available in course notes)
171Overview
- Introduction
- Interactive ray tracer
- Animation and interactive ray tracing
- Sample re-use techniques
172Sample Re-use Techniques
173Brute Force Ray Tracing
- Enables interactive ray tracing
- Does not allow large image sizes
- Does not scale to scenes with
high depth complexity
174Solution
- Exploit temporal coherence
- Re-use results from previous frames
175Practical Solutions
- Tapestry (Simmons et. al. 2000)
- Focuses on complex lighting simulation
- Render cache (Walter et. al. 1999)
- Addresses scene complexity issues
- Explained next
- Parallel render cache (Reinhard et. al. 2000)
- Builds on Walters render cache
- Explained next
176Render Cache Algorithm
- Basic setup
- One front-end for
- Displaying pixels
- Managing previous results
- Parallel back-end for
- Producing new pixels
177Render Cache Front-end
- Frame based rendering
- For each frame do
- Project existing points
- Smooth image and display
- Select new rays using heuristics
- Request samples from back-end
- Insert new points into point cloud
178Render Cache
179Render Cache (2)
- Point reprojection is relatively cheap
- Smooth camera movement for small images
- Does not scale to large images or large numbers
of renderers ? front-end becomes bottleneck
180Parallel Render Cache
- Aim remove front-end bottleneck
- Distribute point reprojection functionality
- Integrate point reprojection with renderers
- Front-end only displays results
181Parallel Render Cache (2)
182Parallel Render Cache (3)
- Features
- Scalable behavior for scene complexity
- Scalable in number of processors
- Allows larger images to be rendered
- Retains artifacts from render cache
- Introduces new artifacts
183Artifacts
- Render cache artifacts at tile boundaries
- Image deteriorates during camera movement
- These artifacts are deemed more acceptable than
loss of smooth camera movement!
184Video
185Test Scenes
186Results
- Sub-parts of algorithm measured individually
- Measure time per call to subroutine
- Sum over all processors and all invocations
- Afterwards divide by number of processors and
number of invocations - Results are measured in events per second per
processor
187Scalability (Teapot Model)
188Scalability (Room Model)
189Samples Per Second
190Reprojections Per Second
191Conclusions
- Exploitation of temporal coherence gives
significantly smoother results than available
with brute force ray tracing alone - This is at the cost of some artifacts which
require further investigation - (More results available in course notes)
192Acknowledgements
- Thanks to
- Steven Parker for writing the interactive ray
tracer in the first place - Brian Smits, Peter Shirley and Charles Hansen for
involvement in the animation and parallel point
reprojection projects - Bruce Walter and George Drettakis for the render
cache source code
193Schedule
- Introduction
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
- Practical Applications
- Rendering at Clemson / Distributed Computing and
Spatial/Temporal Coherence - Interactive Ray Tracing
- Parallel Rendering and the Quest for Realism The
Kilauea Massively Parallel Ray Tracer (Kato) - Summary / Discussion
194Outline
- What is Kilauea ?
- Parallel ray tracing photon mapping
- Kilauea architecture
- Shading logic
- Rendering results
195Outline
- What is Kilauea ?
- Parallel ray tracing photon mapping
- Kilauea architecture
- Shading logic
- Rendering results
196Objective
- Global illumination
- Extremely complex scenes
197Parallel Processing
- Hardware
- Multi-CPU machine
- Linux PC cluster
- Software
- Threading (Pthread)
- Message passing (MPI)
198Our Render Farm
199Global Illumination
200Ray Tracing Renderer
201Ray Tracing Renderer
202Ray Tracing Renderer
203Outline
- What is Kilauea ?
- Parallel ray tracing photon mapping
- Kilauea architecture
- Shading logic
- Rendering results
204Parallel Ray Tracing
205Parallel Ray Tracing
206Accel Grid
207Simple Case (scene distribution)
208Simple Case (ray tracing)
209Parallel Ray Tracing
210Complex Case (scene distribution)
211Complex Case (accel grid construction)
212Complex Case (ray tracing)
213Outline
- What is Kilauea ?
- Parallel ray tracing photon mapping
- Kilauea architecture
- Shading logic
- Rendering results
214Parallel Photon Mapping
- Photon trace
- Photon lookup
215Parallel Photon Mapping
- Photon trace
- Photon lookup
216Photon Tracing (simple case)
217Photon Tracing (complex case)
218Parallel Photon Mapping
- Photon trace
- Photon lookup
219Photon Lookup (simple case)
220Photon Lookup (complex case)
221Outline
- What is Kilauea ?
- Parallel ray tracing photon mapping
- Kilauea architecture
- Shading logic
- Rendering results
222Task
- Mtask
- Wtask
- Btask
- Stask
- Rtask
- Atask
- Etask
- Ltask
- Ptask
- Otask
223Task Assignment
224Roles of Tasks
225Task Configuration
226Task Configuration
227Task Configuration
228Task Interaction
229Task Interaction
230Task Interaction
231Task Interaction
232Task Interaction
233Task Interaction
234Task Interaction (simple case)
235Roles of Tasks (photon map)
236Task Configuration (photon map)
237Task Configuration (photon map)
238Task Interaction (photon map)
239Task Interaction (photon map)
240Task Interaction (photon map)
241Task Interaction (photon map)
242Task Configuration (simple photon)
243Task Priority
244Outline
- What is Kilauea ?
- Parallel ray tracing photon mapping
- Kilauea architecture
- Shading logic
- Rendering results
245Parallel Shading Problem
246Parallel Shading Problem
247Parallel Shading Problem (solution)
248Parallel Shading Problem (solution)
249Parallel Shading Problem (solution)
250Parallel Shading Problem (solution)
251Parallel Shading Problem (solution)
252Parallel Shading Problem (solution)
253Parallel Shading Problem (solution)
254Parallel Shading Problem (solution)
255Decomposing Shading Computation
256Decomposing Shading Computation
257Decomposing Shading Computation
258SPOT
259SPOT Condition
260Parallel Shading Solution using SPOT
261Parallel Shading Solution using SPOT
262Shader SPOT Network Example
263Outline
- What is Kilauea ?
- Parallel ray tracing photon mapping
- Kilauea architecture
- Shading logic
- Rendering results
264Rendering Results
- Test machine specification
- 1GHz Dual Pentium III
- 512Mbyte memory
- 100BaseT Ethernet
- 18 machines connected via 100BaseT switch
265Quatro
- 700,223 triangles, 1 area point sky light, 1280
x 692 - 18 machines 7min 19sec
266Quatro single Atask test
267Jeep
- 715,059 triangles, 1 directional sky light,
1280 x 692 - 18 machines 8min 27sec
268Jeep4
- 2,859,636 triangles, 1 directional sky light,
1280 x 692 - 18 machines 12min 38sec 2 Atsks x 1
269Jeep4 2 Atasks test
270Jeep8
- 5,719,072 triangles, 1 directional sky light,
1280 x 692 - 16 machines 18min 43sec 4 Atasks x 4
271Escape POD
- 468,321 triangles, 1 directional sky light,
1280 x 692 - 18 machines 14min 55sec
272ansGun
- 20,279 triangles, 1 spot sky light, 1280 x 960
- 18 machines 16min 38sec
273SCN101
- 787,255 triangls, 1 area light, 1280 x 692
- 18 machines 9min 10sec
274Video
275Conclusion / Future Work
- We achieved
- Close to linear parallel performance
- Highly extensible architecture
- We will achieve even more
- Speed
- Stability
- Usability (user interface)
- Etc.
276Additional Information
- Kilauea live rendering demo
- BOOTH 1927 SquareUSA
- http//www.squareusa.com/kilauea/
277Schedule
- Introduction
- Parallel / Distributed Rendering Issues
- Classification of Parallel Rendering Systems
- Practical Applications
- Summary / Discussion (Chalmers)
278Summary
279Contact Information
- Alan Chalmers
- alan_at_cs.bris.ac.uk
- Tim Davis
- tadavis_at_cs.clemson.edu
- Toshi Kato
- http//www.squareusa.com/kilauea/
- Erik Reinhard
- reinhard_at_cs.utah.edu
- Slides
- http//www.cs.clemson.edu/tadavis