Title: RealTime Parallel Radiosity
1Real-Time Parallel Radiosity
- Matt Craighead
- May 8, 2002
- 6.338J/18.337J, Course 6 AUP
2What is Radiosity?
- A computer graphics technique for lighting
- Two types of lighting algorithms
- Local easy, fast, but not realistic
- Global slow, difficult, but highest quality
- Radiosity is a global algorithm
- Global algorithms try to take into account
interreflections in scenes.
3Radiosity in Real Time?
- Local algorithms can easily run in real time,
either in software or hardware. - Computational demands grow with number of
surfaces and number of lights. 106 surfaces is
not unreasonable! - Most global algorithms take quadratic time in
number of surfaces (all interactions!).
4Local Doesnt Mean Bad
- Id Softwares Doom 3 (video capture)
5but Global is Better
- State-of-the-art radiosity rendering from 1988 (5
hours render time!)
6Local Lighting Math
- Consider a light source at some point in
three-dimensional space, and a surface at some
other location. - How much does the light directly contribute to
the brightness of the surface? - Note that global algorithms still need to do
local lighting as a first step.
7Local Lighting Math
- Set up a 3-dimensional coordinate system centered
on the surface. Define unit vectors L (light), N
(normal), E (eye) - L points towards the light.
- N is perpendicular to the surface.
- E points towards the viewer.
N
E
L
8Local Lighting Math
- If an object sits between the light and surface,
the lighting contribution is zero. - Otherwise
- Surfaces generally reflect light in two ways
diffuse (dull) and specular (shiny). - We can see three colors red, green, blue.
- So let Md, Ms be 3-vectors indicating how much of
R,G,B are reflected in each way.
9Local Lighting Math
- Diffuse lighting is independent of view angle.
It is brightest when N and L are most closely
aligned, and falls off with the cosine of the
angle between them. - All lighting also falls off with the square of
the distance from the light source. - So, the diffuse term is d-2Md(N L).
10Local Lighting Math
- Specular lighting is view-dependent. In one
simple formulation (Blinn shading), we determine
the vector H halfway between E and L and evaluate
d-2Ms(N H)s, where s indicates the shininess of
the surface.
H
N
E
L
11Local Lighting Math
- The easiest way to think of H, the half-angle
vector, is that if H and N were aligned, and the
surface were a mirror, then the light would
reflect straight to the eye. - (N H)s can be thought of as representing a
probability distribution of microfacets, whose
normals are clustered around N but do vary.
Smoother surfaces have higher s.
12Local Lighting Math
- So, our full contribution from a light source is
d-2(Md(N L) Ms(N H)s). - This may also be multiplied by a light color.
- We may also add in an emissive term Me for
glowing objects. - If the lack of interreflection makes things too
dark, we may add an ambient term LaMd.
13Approximations
- We may not compute this formula at every pixel on
the screen, but only at the vertices of the
object instead. - The specular exponent may be evaluated using a
power approximation rather than a real power
function. - The specular formula is already a cheesy hack
14More Approximations
- Shadows are hard. At the cost of much realism,
they can be omitted or faked. - Draw a little dark spot under an object
- Ambient is itself an approximation of real global
illumination. - N L and (N H)s are idealizationsin reality,
this can be an arbitrary 4-dimensional function
called a BRDF.
15Radiosity
- Radiosity is a global algorithm that handles
diffuse lighting only. - The term global illumination refers to global
algorithms that handle specular lighting also. - Specular makes things much more difficult.
- I will only discuss plain radiosity.
16Radiosity in a Nutshell
- Suppose there are n surfaces in the scene.
- Let Ai be the area of surface i.
- Let Ei be the amount of light energy emitted from
surface i per unit time and area. - Let Bi be the amount of light energy emitted
and/or reflected from surface i per unit time and
area. - Let ?i be the diffuse albedo of surface i.
17Radiosity in a Nutshell
- Now, let Fij (called a form factor) be the
fraction of light from surface i that reaches
surface j. - Then, for all i, we must have
- AiBi AiEi ?i ?j ? 1,n AjBjFji
- This is just a linear system with n equations and
n unknowns. Solve and you get the Bs.
18That Easy?
- Well, not quite that easy.
- Solving the system of equations is O(n3).
- So well iterate instead.
- We still have to do local lighting.
- These become the Eis.
- We have to compute the Fij terms somehow.
- This turns out to be expensive.
- If the scene is static, precompute!
19Computing Form Factors
- Fij turns out to be a big ugly integral over the
area of both i and j. - Worse, one of the terms in the integral is
whether the two dAs can see one another! - So, no closed-form solution.
- Standard numerical integration is no good. A
raycast per sample takes too long.
20Computing Form Factors
- The usual solution is called the hemicube
algorithm. - Render the scene from the point of view of the
surface, in all directions. In effect, you are
projecting onto a hemicube. - Count up the number of times you can see each
surface (weighted appropriately). - Takes advantage of 3D acceleration!
21Hemicube Algorithm in Action
22Simplified Radiosity Equation
- It so happens that FjiAj FijAi.
- This is a simple property of the integral for F.
- So we can simplify the radiosity equation
- AiBi AiEi ?i ?j ? 1,n AjBjFji
- AiBi AiEi ?i ?j ? 1,n AiBjFij
- Bi Ei ?i ?j ? 1,n BjFij
- B E RB (where Rij ?iFij)
23Solving the Radiosity Equation
- B E RB is just a matrix/vector equation.
- Direct solution B (I ? R)-1E
- Iterative solution
- If E is a local lighting solution, then call it
B0. - Now let Bi1 B0 RBi.
- Then, Bi is simply the lighting solution after i
bounces! Since ?i lt 1 for all i (conservation of
energy), the Bis converge to B.
24Iterative vs. Direct Radiosity
- If F is a dense matrix, then direct solution
takes time O(n3), while k steps of iteration
takes time O(kn2). - Realistically, k is just a constant say, 5.
- Iterative solution is practical with n ranging up
to the hundreds of thousands! - Iteration time is proportional to the number of
nonzero entries of F.
25Sparsity of Form Factors
- In a simple cube-shaped room, all form factors
will be nonzero except for pairs on the same
wall. - So F is 5/6 nonzero. Not very encouraging
- As more objects are added, F becomes sparser.
- As the scene expands beyond one room, F becomes
much sparser. - So iterative radiosity scales extremely well!
26Storage and Precision
- Radiosity hogs memory.
- If you grid a cube-shaped room 100x100 on each
wall, and store F as a dense matrix of floats,
thats 14.4 GB. (!!!) - Storing F as sparse helps a lot.
- Good for iteration speed too.
- Using smaller values than floats helps too.
- 16-bit fixed-point is good enough.
27RLE Encoding of Form Factors
- F tends to have runs of zeros and nonzeros.
- Smart traversal order of grids makes the runs
longer.
Bad
Good
28RLE Encoding of Form Factors
- My disk storage format
- 1 byte run length, run type
- Length up to 84, type is zero, ?255, or ?65535
- Then, variable bytes with run data
- Compression ratio for my scene 5.971
- My memory storage format
- 2 bytes run length up to 65535
- 2 bytes run type (zero or ?65535)
- 2N bytes run data
- Compression ratio for my scene 2.491
29Parallelization
- Split up the surfaces among the CPUs.
- Each CPU owns those rows of the form factor
matrix. - Each CPU computes its surfaces local lighting.
- Every iteration requires an all-to-all
communication, so that each CPU has the full B
vector. - At present, my storage of F is unbalanced
compression ranges from 4.31 to 1.61.
30Radiosity Iteration Kernel
- Hand-written MMX assembly code (only main inner
loop shown) -
inner_loop prefetchnta ebx128
prefetchnta eax128 movq mm4, eax
pshufw mm7, mm4, 0xFF pshufw mm6, mm4, 0xAA
pshufw mm5, mm4, 0x55 pshufw mm4, mm4,
0x00 pmulhuw mm4, ebx pmulhuw mm5,
ebx6 pmulhuw mm6, ebx12 pmulhuw
mm7, ebx18
paddw mm0, mm4 paddw mm1, mm5 paddw
mm2, mm6 paddw mm3, mm7 add eax, 8
add ebx, 24 dec esi jnz inner_loop
31Radiosity Kernel Performance
- Timed 8 CPUs doing 500 iterations.
- Portable C fixed-point kernel 38.04 s
- Optimized MMX kernel 21.64 s
- On most loaded CPU, works out to 528 million
multiply-adds/s for the C version, 1.24 billion
multiply-adds/s for MMX. - But MMX code wastes 25 of them, so real rate is
928 million.
32Local Lighting Implementation
- One raycast is required for each quad, for each
light source! This can be expensive. - To accelerate raycasts, I made a simplified
version of my scene that was virtually
indistinguishable for raycasting purposes. - 13028 quads reduced to 120 polys, 110 cylinders
- Some cylinders used as geometry, some as bounding
volumes
33Overall Performance
- Again, 8 CPUs on 500 iterations
- Iteration only 21.64 s
- Communication only 5.96 s
- Iteration plus communication 26.84 s
- All computation 65.7 s
- All computation plus communication 64.38 s
34Remarks on Performance
- The communication overlaps very well with the
computation, to the point that it is actually a
speedup. (!) - MPI_Isend, MPI_Irecv are essential to achieving
this. - The O(n) local lighting computation is actually
taking much longer than the O(n2) radiosity
computation. - Local lighting is only pseudo-O(n), because of
the raycast costalthough for large scenes,
raycast cost should still be much less than
O(n2), due to other optimizations.
35Radiosity Frontend
- Separate client application that runs on a
Windows PC with OpenGL acceleration. - Radiosity solver running on cluster is server.
- Original plan was that the frontend would send
the scene to the server, and the server would use
the scene provided. - Since the cluster has no OpenGL acceleration, I
was reluctantly forced to precompute form
factors. - All aspects of scene except form factors still
sent to the server by the client form factors
are read from disk.
36Client/Server Architecture
- User precomputes form factors and FTPs them to
the Beowulf. - Server listens on beowulf.lcs.mit.edu5353.
- Client connects and sends scene information.
- Server reads form factors off of disk.
- Both open a network thread.
- Server streams radiosity to client via TCP/IP.
- Computation, rendering, and communication are
completely decoupled.
37Frontend Features
- Per-vertex or per-pixel local lighting, with
local viewer. Optional specular. - Shadows implemented using stencil buffer.
- Display radiosity input/output (E and B).
- Bilinear filtering of radiosity solutions on
grid-shaped surfaces. - Ultra-high-quality mode where radiosity is used
for indirect lighting only.
38Demonstration of Full System
39Demonstration of Full System
40Demonstration of Full System
41Work Yet to be Done
- More complex scene would be nice. This one is
13K quads, and I should be able to do 50K. - More optimization work on raycasting.
- Better load balancing.
- Optimization of some modes on frontend, so they
run reasonably on my laptops GeForce2 Go, not
just on a GeForce4 Ti - Alleviation of ugly banding on certain lighting
modes caused by 8-bit-per-component precision.
42Conclusions
- Real-time radiosity is feasible.
- Not tomorrow, but today.
- If todays cluster is tomorrows desktop,
real-time radiosity could start showing up in
real applications not too many years from now. - Biggest limitation may be the ability to compute
form factors efficiently. - Faster graphics hardware will make this happen.