Title: Iosif Antochi
1Optimizations and Trade-offs for Low-Power 3D
Graphics Tile-Based Rendering Architectures
Computer Engineering Laboratory Delft University
of Technology The Netherlands
2Summary
- Introduction
- The GRAAL environment
- Tile-based rendering
- Memory Requirements
- Scene Management
- State Management
3GRAAL EnvironmentOverview
Applications (Benchmarks)
Tracer (Player)
SIMULATOR
HUMAN
INTERACTION
New Architecture
Simulator Front-end (Augmented Mesa)
TCP/IP or files
Performance Evaluation
Fast Back-end (Qt)
RTL Back-end (SystemC)
4 Introduction Graphics Application Data
Structures
World
Object 1
Object 2
Object n
Texture mapping
Position
Shape
Textures
Texture
5A Rendered Scene
Scene from the Quake 3 FPS game developed by id
Software
6The Structure of the Scene
73D Graphics Pipeline
3D Graphics Pipeline
8Tile Based Rendering
- Part 1 Memory Requirements
9Overview
- Motivation
- Background
- Traditional and tile-based rendering methods
- Tile size vs. external data traffic
- Tile-based vs. conventional rendering
- Conclusions
10Motivation
- External memory accesses consume a lot of power
- Tile-based rendering might be used to reduce
external memory traffic.
11Traditional Rendering
12Tile-Based Rendering
13Rendering Models
Traditional rendering
Tile-Based rendering
14Tile Size Vs. External Data Traffic
15Triangle Size Histogram
- First indication of required tile size
- Very few triangles (7) are larger than 1024
pixels
16Number of Kilotriangles Transferred per Frame for
Various Tile Sizes
- As expected, external traffic reduces if tile
size is increased - A tile size of 32 32 yields a good trade-off
between external traffic and on chip memory
17Tile-Based vs. Conventional Rendering
18 Data Traffic Front
- The front data traffic increases for tile-based
rendering - State change increases more than geometry for
tile-based rendering
19 Data Traffic Back
- Usually, there is no z data traffic for
tile-based rendering - The color data traffic decreases for tile-based
rendering
20 Total Data Traffic
- Tile-based rendering reduces the total amount of
data traffic by a factor of 1.96
21TBR Data Traffic - Conclusions
- A tile size of 32 ? 32 pixels yields good
trade-off between the amount of on-chip memory
and the amount of external data traffic. - Tile-based rendering reduces the total amount of
external traffic by a factor of 1.96. - For workloads with a high overlap and low
overdraw, traditional rendering can outperform
tile-based rendering. For workloads with a low
overlap and high overdraw, tile-based rendering
is more better than traditional rendering.
22Tile Based Rendering
- Part 2 Scene Management Details
23Scene Management Overview
- Motivation
- Background
- The Two-Stage Model
- Overlap Tests
- Scene Management Algorithms
- Results
- Conclusions
24Motivation
- Tile-based rendering requires that primitives are
sorted into bins corresponding to tiles.
25Background Scene Management for Tile-Based
Rendering
26Two Stage Model
- Tile-based Rendering Model -
- ( based on retained mode execution )
- procedure call_driver_for_instr(i)
-
-  if (!buffer_is_full IS_Bufferable(i))Â
- Buffer_Instr(i)
- Â elseÂ
-     // we have to render all buffered instructions
-     // since either we have a Swap_Buffers Instruc
tion -     // or we ran out of buffer spaceÂ
- Â Â Â
- Â Â Â Â Â inictxSave_Context()
- Â Â Â Â Â Â for ( tile0 tilelt maxtile tile )
- Â Â Â Â Â Â Â Â Â
-     if (tile!0) Set_Context(inictx)
- Â Â Â Â RestoreCurrentTile(tile)
- Â Â Â Â Render_All_Instr_For_Tile(tile)
- Â Â Â Â SwapTile(tile)
- Stage 1 Buffering (and initial sorting )
- Stage 2 (Second sorting and) primitive sending
27BBOX Overlap Test
Determines if the bounding box of a triangle
overlaps with a tile
28LET Overlap Test
Consider a 2D vector defined by two points A
(X,Y) and B (X dX Y dY), and a line L AB
that passes through the two points. The edge
function for a certain point P (x, y) is defined
as
Triangle T(A,B,C) overlaps tile T(xc,yc,l) if
?
?
?
29Algorithm DIRECT
- For each tile scan the whole list of primitives
and send the primitives that (potentially)
overlap the tile to the rasterizer. - Pseudocode
- for each triangle Tr
- buffer Tr
-
- for each tile T
- for each triangle Tr
- compute bbox of Tr
- if bbox of Tr and T overlap
- send Tr
30Algorithm TWO_STEP
- Compute and store the bounding box of each
triangle during the buffering stage. This avoids
having to recompute the bounding box for each
triangle/tile tuple, but it requires more memory. - Pseudocode
- for each triangle Tr
- buffer Tr
- compute and store bbox of Tr
-
- for each tile T
- for each triangle Tr
- if bbox of Tr and T overlap
- send Tr
31Algorithm TWO_STEP_L
- In the second stage the LET overlap test is used
instead of BBOX. Since the LET test contains the
BBOX test, the main LET is applied only to
triangles that have passed the BBOX test. - Pseudocode
- for each triangle Tr
- buffer Tr
- compute and store bbox of Tr
- for each tile T
- for each triangle Tr
- if Tr and T overlap (using LET)
- send Tr
32Algorithm SORT
- For each tile a buffer with pointers to the
primitives that overlap the tile is created.
During the second step these buffers are scanned. - Pseudocode
- for each triangle Tr
- buffer Tr
- compute bbox of Tr
- for each tile T that overlaps bbox of Tr
- insert pointer to Tr in the buffer of
T - for each tile T
- for each triangle Tr in the buffer of T
- send Tr
33Algorithm SORT_L
- Identical to SORT except that the LET is used to
determine if a triangle and a tile overlap. - Pseudocode
- for each triangle Tr
- buffer Tr
- compute bbox of Tr
- for each tile T that overlaps bbox of Tr
- If LET indicates Tr and T overlap
- insert pointer to Tr in the buffer of
T - for each tile T
- for each triangle Tr in the buffer of T
- send Tr
34Experimental Setup
- OpenGL Benchmarks
- Quake III (Q3) (low high resolution)
- Tux Racer (Tux)
- AWadvs-04 (AW). Part of Viewperf
- VRML Scenes - Austrian National Library (ANL),
Graz 3D (GRA), Dino (DIN) - Estimations and Assumptions
- Some of the parameters such as the average number
of operations to compute the bbox or perform bbox
or let were determined by running the
benchmarks. - Some other parameters such as the average number
of operations to buffer a primitive or insert a
primitive to a tile buffer were statistically
estimated.
35Estimated Running Time
Estimated Running Time Relative to DIRECT
36Amount of Additional Memory Required
Kbytes
37TBR Scene Management - Conclusions
- Which algorithm is preferable depends on the
available additional memory and computational
power. - The DIRECT algorithm has poor performance. On
average the DIRECT algorithm is 44 times slower
than SORT. - The TWO_STEP algorithm is slower than SORT by a
factor of 6 while reducing the amount of
additional memory by a factor of 3.2. - SORT_LET is slower than SORT by a factor of 1.6
38Efficient Bounding-Box Computation
For a tile T given by the tuple (T.MinX,
T.MinY, T.MaxX, T.MaxY), a possible
implementation of the BBOX test in C is if
(BBOX.MaxX lt T.MinX) / Test 1 / return
NoOverlap if (BBOX.MinX gt T.MaxX) / Test
2 / return NoOverlap if (BBOX.MaxY lt T.MinY)
/ Test 3 / return NoOverlap if
(BBOX.MinY gt T.MaxY) / Test 4 / return
NoOverlap return MightOverlap
Let a triangle Tr be defined by three points
A(x,y), B(x,y), C(x,y) Bounding box (BBOX) of
Tr is defined by the tuple (BBOX.MinX, BBOX.Min
Y, BBOX.MaxX, BBOX.MaxY) where BBOX.MinX MIN
(A.x, B.x, C.x) BBOX.MinY MIN (A.y, B.y, C.y)
BBOX.MaxX MAX (A.x, B.x, C.x) BBOX.MaxY MAX
(A.y, B.y, C.y)
39Bounding-Box Tests Order
- The four comparisons required to determine if a
BBOX and a tile overlap can be performed in an
arbitrary order. - This gives a total of 24 possible arrangements.
However, not every order produces the same number
of comparisons on average. - A tile divides the screen into five, possibly
intersecting regions the tile itself, the region
to the east of the tile ( x T.MaxX), the region
to the west of the tile (xlt T.MinX ), the region
to the north (y T.MaxY ), and the region to the
south (ylt T.MinY ).
40Static Bounding-Box Tests
STATIC1 If a certain test (comparison) fails,
then there is a high probability that the test in
the opposite direction along the same dimension
succeeds. This is because after these two tests
there is only a small region left where the BBOX
of a primitive can be situated.
STATIC2 The first and second (and, hence, the
third and fourth) comparison check different
dimensions. For example, one possible order is
west, south, east, north.
41Dynamic Bounding-Box Tests
The probability that a primitive is completely
located in the largest region is the highest.
This observation is the basis of our dynamic
versions of the bounding box test.
DYNAMIC1 First checks the largest region.
Thereafter, the opposite direction along the same
dimension is tested. The third test examines the
largest region in the other dimension, and the
fourth test checks the remaining region.
DYNAMIC2 The comparison corresponding to the
largest region is applied first, then the
comparison corresponding to the second largest
region, etc. The region to the east of the tile
is the largest and checked first, then the region
to the south, then the one to the north, and,
finally, the region to the west.
We remark that although these schemes are called
dynamic, the order in which the comparisons are
applied depends only on the tile position and can
be determined statically off-line. For example,
for all tiles in the upper left sub-scene under
the main diagonal, the order is east, south,
north, west.
42Bounding-Box Tests Experimental Results
The average number of comparisons per primitive
for each workload
43Tile Based Rendering
- Part 3 State Management Details
44SW Driver Block Diagram
Main Memory
Applications
Global Primitive List (GPL)
Mesa Core
Texture Images List (TIL)
Texture Objects List (TOL)
- Global Instr. Buffering
- Initial primitive sorting
- Small triangle filter
- Texture Preprocessing
Graal Device Driver
Per Tile Sorted Primitive List (Pointers to GPL)
- Tile Instr. Iterator
- Sends instructions to the Graal Accelerator in a
tile based order until all tiles are completed
GPP (ARM)
Soc Bus
Rasterizer instructions LDTRI DEPTHEN
SETDPTFCT . SWPBUFF
Graal 3D Graphics Accelerator
45State Information In Detail
- Unit enable/disable
- Enable blending
- Disable depth
- Unit functionality change
- Change blending mode
- Change depth test function
- Texturing state (much larger than the rest of the
state information)
46Lazy State Update
- Initial stream
- Bindtex 1
- Endepth
- Tri 1
- Disdepth
- Tri 2
- Endepth
- Bindtex 2
- Tri 4
- Bindtex 3
- Tri5
- Current tile stream
- Bindtex 1
- Endepth
- Tri 1
- Disdepth
- Endepth
- Bindtex 2
- Bindtex 3
- Tri5
- Current tile stream
- using lazy update
- Bindtex 1
- Endepth
- Tri 1
- Bindtex 3
- Tri5
47Texture State Handling (I)
48Texture State Handling (II)
- Late commit of texture images
- We use a global list of texture images, but a
separate list of (bindable) texture objects for
each context, thus we share the texture images in
order to save space and memory transfers. - Deleting texture objects can be solved either by
partial rendering or by postponing it when
possible.
49State Management Algorithms
- What can we do whenever an instruction that has
side-effects is encountered (e.g.,
DeleteTexture) in the input stream ? - PARTIAL RENDERING (DIRECT)
- Render all previously buffered instructions
- Executes the instruction.
- This algorithm might also introduce significant
rendering overhead.
- DELAYED EXECUTION
- The driver will postpones the execution of the
current instruction until all the primitives
depending on the current instruction are rendered
or the end of the current frame is reached. - Execute instruction
50Example
Start Frame Tile 1 c1SaveCurrentContext RestoreTi
leFromGlobalBuffer CreateTexture(i) MakeCurrentTex
ture(i) Triangle(2) MarkDeleteTexture(i) RenameTex
ture(i,j) MakeCurrentTexture(j) Triangle(3) SaveTi
leToGlobalBuffer Tile 2 RestoreContext(c1) Restore
TileFromGlobalBuffer MakeCurrentTexture(i) Triangl
e(1) MakeCurrentTexture(j) Triangle(3) SaveTileToG
lobalBuffer After Last Tile DeleteTexture(i) MoveT
extureLinks(i,j) End Frame Delayed tiled
instruction stream using delayed commit
Start Frame Tile 1 c1SaveCurrentContext RestoreTi
leFromGlobalBuffer CreateTexture(i) MakeCurrentTex
ture(i) Triangle(2) SaveTileToGlobalBuffer c2Save
CurrentContext Tile 2 RestoreContext(c1) RestoreTi
leFromGlobalBuffer MakeCurrentTexture(i) Triangle(
1) SaveTileToGlobalBuffer DeleteTexture(i) Tile1 c
1SaveCurrentContext RestoreTileFromGlobalBuffer C
reateTexture(i) MakeCurrentTexture(i) Triangle(3)
SaveTileToGlobalBuffer Tile 2 RestoreContext(c1) R
estoreTileFromGlobalBuffer MakeCurrentTexture(i) T
riangle(3) SaveTileToGlobalBuffer End Frame
Tiled instruction stream using partial rendering
Start Frame CreateTexture(i) MakeCurrentTexture(i)
Triangle(1) Triangle(2) DeleteTexture(i) CreateTe
xture(i) MakeCurrentTexture(i) Triangle(3) End
Frame Initial instruction stream
Assumptions triangle 1 overlaps tile 2,
triangle 2 overlaps tile 1, and triangle 3
overlaps tiles 1 and 2.
51State Management Experimental Results (I)
Percentage of state information and triangles
sent to the accelerator per frame.
52State Management Experimental Results (II)
Average number of state information writes to the
accelerator per frame.
53State Management Conclusions
- While in traditional (non tile-based) rendering
the state information traffic can be negligible
compared to the traffic generated by the
primitives, in tile-based rendering
architectures, since the state information might
need being duplicated in multiple streams, the
required processing power and generated traffic
can increase significantly. - To remove a state change instruction from the
instruction stream of a tile, information about
the previous or the following state change
instructions and/or primitives is required. Thus,
in order to send an optimal state change stream
to the accelerator, i.e., use minimal bandwidth,
additional processing power and more processor
bandwidth is required. -
- By sending an optimized state change stream to
the accelerator, the state change traffic to the
accelerator was decreased up to 58.
54Questions?