Title: Interactive Distributed Ray Tracing of Highly Complex Models
1Interactive Distributed Ray Tracing of Highly
Complex Models
- Ingo Wald
- University of Saarbrücken
- http//graphics.cs.uni-sb.de/wald
- http//graphics.cs.uni-sb.de/rtrt
2Reference Model (12.5 million tris)
3Power Plant- Detail Views
4Previous Work
- Interactive Rendering of Massive Models (UNC)
- Framework of algorithms
- Textured-depth-meshes (96 reduction in tris)
- View-Frustum Culling LOD (50 each)
- Hierarchical occlusion maps (10)
- Extensive preprocessing required
- Entire model 3 weeks (estimated)
- Framerate (Onyx) 5 to 15 fps
- Needs shared-memory supercomputer
5Previous Work II
- Memory Coherent RT, Pharr (Stanford)
- Explicit cache management for rays and geometry
- Extensive reordering and scheduling
- Too slow for interactive rendering
- Provides global illumination
- Parallel Ray-Tracing, Parker et al. (Utah) Muus
(ARL) - Needs shared-memory supercomputer
- Interactive Rendering with Coherent Ray Tracing
(Saarbrücken, EG 2001) - IRT on (cheap) PC systems
- Avoiding CPU stalls is crucial
6Previous Work Lessons Learned
- Rasterization possible for massive models but
not straightforward (UNC) - Interactive Ray Tracing is possible
(Utah,Saarbrücken) - Easy to parallelize
- Cost is only logarithmic in scene size
- Conclusion Parallel, Interactive Ray Tracing
should work great for Massive Models
7Parallel IRT
- Parallel Interactive Ray Tracing
- Supercomputer more threads
- PCs Distributed IRT on CoW
- Distributed CoW Need fast access to scene data
- Simplistic access to scene data
- mmapCaching, all done automatically by OS
- Either Replicate scene
- Extremely inflexible
- Or Access to single copy of scene over NFS
(mmap) - Network issues Latencies/Bandwidth
8Simplistic Approach
- Caching via OS support wont work
- OS cant even address more than 2Gb of data
- Massive Models gtgt 2Gb !
- Also an issue when replicating the scene
- Process stalls due to demand paging
- stalls very expensive !
- Dual-1GHz-PIII 1 ms stall 1 million cycles
about 1000 rays ! - OS automatically stalls process ? reordering
impossible
9Distributed Scene Access
- Simplistic approach doesnt work
- Need manual caching and memory management
10Caching Scene Data
- 2-Level Hierarchy of BSP-Trees
- Caching based on self-contained voxels
- Clients need only top-level bsp (few kb)
- Straightforward implementation
11BSP-Tree Structure and Caching Grain
12Caching Scene Data
- Preprocessing Splitting Into Voxels
- Simple spatial sorting (bsp-tree construction)
- Out-of-core algorithm due to model size
- Filesize-limit and address space (2GB)
- Simplistic implementation 2.5 hours
- Model Server
- One machine serves entire model
- ?Single server Potential bottleneck !
- Could easily be distributed
13Hiding CPU Stalls
- Caching alone does not prevent stalls !
- Avoiding Stalls ? Reordering
- Suspend rays that would stall on missing data
- Fetch missing data asynchronously !
- Immediately continue with other ray
- Potentially no CPU stall at all !
- Resume stalled rays after data is available
- Can only hide some latency
- ? Minimize voxel-fetching latencies
14Reducing Latencies
- Reduce Network Latencies
- Prefetching ?
- Hard to predict data accesses several ms is
advance ! - Latency is dominated by transmission time
- (100Mbit/s ? 1MB 80ms 160 million cycles !!!)
- Reduce transmitted data volume
15Reducing Bandwidth
- Compression of Voxel Data
- LZO-library provides for 31 compression
- If compared to original transmission time,
decompression cost is negligible ! - Dual-CPU system Sharing of Voxel Cache
- Amortize bandwidth, storage and decompression
effort over both CPUs - ?Even better for more CPUs
16Load Balancing
- Load Balancing
- Demand driven distribution of tiles (32x32)
- Buffering of work tiles on the client
- Avoid communication latency
- Frame-to-Frame Coherence
- ?Improves Caching
- Keep rays on the same client
- Simple Keep tiles on the same client
(implemented) - Better Assign tiles based on reprojected pixels
(future)
17Results
- Setup
- Seven dual Pentium-III 800-866 MHzas rendering
clients - 100 Mbit FastEthernet
- One display model server (same machine)
- GigabitEthernet (already necessary for pixels
data) - Powerplant Performance
- 3-6 fps in pure C implementation
- 6-12 fps with SSE support
18Animation Framerate vs. Bandwidth
19Scalability
- Server bottleneck after 12 CPUs
- ? Distribute model server!
20Performance Detail Views
Framerate (640x480) 3.9 - 4.7 fps (seven dual
P-III 800-866 Mhz CPUs, NO SSE)
21Shadows and Reflections
Framerate 1.4-2.2 fps (NO SSE)
22Demo
23Conclusions
- IRT works great for highly complex models !
- Distribution issues can be solved
- At least as fast as sophisticated HW-techniques
- Less preprocessing
- Cheap
- Simple easy to extend (shadows, reflections,
shading,)
24Future Work
- Smaller cache granularity
- Distributed scene server
- Cache-coherent load balancing
- Dynamic scenes instances
- Hardware support for ray-tracing
25Acknowledgments
- Anselmo Lastra, UNC
- Power plant reference model
- other complex models
are welcome
26Questions ?
For further information visit http//graphics.cs.
uni-sb.de/rtrt
27Four Power Plants (50 million tris)
28Detailed View of Power Plant
Framerate 4.7 fps (seven dual P-III 800-866 Mhz
CPUs, NO SSE)
29Detail View Furnace
Framerate 3.9 fps, NO SSE
30Overview
- Reference Model
- Previous Work
- Distribution Issues
- Massive Model Issues
- Images Demo