Iosif Antochi

About This Presentation

Title:

Iosif Antochi

Description:

Optimizations and Trade-offs for Low-Power 3D Graphics Tile-Based Rendering Architectures ... Tux Racer (Tux) AWadvs-04 (AW). Part of Viewperf ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 55

Provided by: Stam57

Category:

more less

Transcript and Presenter's Notes

Title: Iosif Antochi

1
Optimizations and Trade-offs for Low-Power 3D
Graphics Tile-Based Rendering Architectures

Iosif Antochi

Computer Engineering Laboratory Delft University
of Technology The Netherlands
2
Summary

Introduction
The GRAAL environment
Tile-based rendering
Memory Requirements
Scene Management
State Management

3
GRAAL EnvironmentOverview
Applications (Benchmarks)
Tracer (Player)
SIMULATOR
HUMAN
INTERACTION
New Architecture
Simulator Front-end (Augmented Mesa)
TCP/IP or files
Performance Evaluation
Fast Back-end (Qt)
RTL Back-end (SystemC)
4
Introduction Graphics Application Data
Structures
World
Object 1
Object 2
Object n

Texture mapping
Position
Shape

Textures
Texture
5
A Rendered Scene
Scene from the Quake 3 FPS game developed by id
Software
6
The Structure of the Scene
7
3D Graphics Pipeline
3D Graphics Pipeline
8
Tile Based Rendering

Part 1 Memory Requirements

9
Overview

Motivation
Background
Traditional and tile-based rendering methods
Tile size vs. external data traffic
Tile-based vs. conventional rendering
Conclusions

10
Motivation

External memory accesses consume a lot of power
Tile-based rendering might be used to reduce
external memory traffic.

11
Traditional Rendering
12
Tile-Based Rendering
13
Rendering Models
Traditional rendering
Tile-Based rendering
14
Tile Size Vs. External Data Traffic
15
Triangle Size Histogram

First indication of required tile size

Very few triangles (7) are larger than 1024
pixels

16
Number of Kilotriangles Transferred per Frame for
Various Tile Sizes

As expected, external traffic reduces if tile
size is increased
A tile size of 32 32 yields a good trade-off
between external traffic and on chip memory

17
Tile-Based vs. Conventional Rendering
18
Data Traffic Front

The front data traffic increases for tile-based
rendering
State change increases more than geometry for
tile-based rendering

19
Data Traffic Back

Usually, there is no z data traffic for
tile-based rendering
The color data traffic decreases for tile-based
rendering

20
Total Data Traffic

Tile-based rendering reduces the total amount of
data traffic by a factor of 1.96

21
TBR Data Traffic - Conclusions

A tile size of 32 ? 32 pixels yields good
trade-off between the amount of on-chip memory
and the amount of external data traffic.
Tile-based rendering reduces the total amount of
external traffic by a factor of 1.96.
For workloads with a high overlap and low
overdraw, traditional rendering can outperform
tile-based rendering. For workloads with a low
overlap and high overdraw, tile-based rendering
is more better than traditional rendering.

22
Tile Based Rendering

Part 2 Scene Management Details

23
Scene Management Overview

Motivation
Background
The Two-Stage Model
Overlap Tests
Scene Management Algorithms
Results
Conclusions

24
Motivation

Tile-based rendering requires that primitives are
sorted into bins corresponding to tiles.

25
Background Scene Management for Tile-Based
Rendering
26
Two Stage Model

Tile-based Rendering Model -
( based on retained mode execution )
procedure call_driver_for_instr(i)
if (!buffer_is_full IS_Bufferable(i))
Buffer_Instr(i)
else
// we have to render all buffered instructions
// since either we have a Swap_Buffers Instruc
tion
// or we ran out of buffer space
inictxSave_Context()
for ( tile0 tilelt maxtile tile )
if (tile!0) Set_Context(inictx)
RestoreCurrentTile(tile)
Render_All_Instr_For_Tile(tile)
SwapTile(tile)

Stage 1 Buffering (and initial sorting )
Stage 2 (Second sorting and) primitive sending

27
BBOX Overlap Test
Determines if the bounding box of a triangle
overlaps with a tile
28
LET Overlap Test
Consider a 2D vector defined by two points A
(X,Y) and B (X dX Y dY), and a line L AB
that passes through the two points. The edge
function for a certain point P (x, y) is defined
as
Triangle T(A,B,C) overlaps tile T(xc,yc,l) if
?
?
?
29
Algorithm DIRECT

For each tile scan the whole list of primitives
and send the primitives that (potentially)
overlap the tile to the rasterizer.
Pseudocode
for each triangle Tr
buffer Tr
for each tile T
for each triangle Tr
compute bbox of Tr
if bbox of Tr and T overlap
send Tr

30
Algorithm TWO_STEP

Compute and store the bounding box of each
triangle during the buffering stage. This avoids
having to recompute the bounding box for each
triangle/tile tuple, but it requires more memory.
Pseudocode
for each triangle Tr
buffer Tr
compute and store bbox of Tr
for each tile T
for each triangle Tr
if bbox of Tr and T overlap
send Tr

31
Algorithm TWO_STEP_L

In the second stage the LET overlap test is used
instead of BBOX. Since the LET test contains the
BBOX test, the main LET is applied only to
triangles that have passed the BBOX test.
Pseudocode
for each triangle Tr
buffer Tr
compute and store bbox of Tr
for each tile T
for each triangle Tr
if Tr and T overlap (using LET)
send Tr

32
Algorithm SORT

For each tile a buffer with pointers to the
primitives that overlap the tile is created.
During the second step these buffers are scanned.
Pseudocode
for each triangle Tr
buffer Tr
compute bbox of Tr
for each tile T that overlaps bbox of Tr
insert pointer to Tr in the buffer of
T
for each tile T
for each triangle Tr in the buffer of T
send Tr

33
Algorithm SORT_L

Identical to SORT except that the LET is used to
determine if a triangle and a tile overlap.
Pseudocode
for each triangle Tr
buffer Tr
compute bbox of Tr
for each tile T that overlaps bbox of Tr
If LET indicates Tr and T overlap
insert pointer to Tr in the buffer of
T
for each tile T
for each triangle Tr in the buffer of T
send Tr

34
Experimental Setup

OpenGL Benchmarks
Quake III (Q3) (low high resolution)
Tux Racer (Tux)
AWadvs-04 (AW). Part of Viewperf
VRML Scenes - Austrian National Library (ANL),
Graz 3D (GRA), Dino (DIN)
Estimations and Assumptions
Some of the parameters such as the average number
of operations to compute the bbox or perform bbox
or let were determined by running the
benchmarks.
Some other parameters such as the average number
of operations to buffer a primitive or insert a
primitive to a tile buffer were statistically
estimated.

35
Estimated Running Time
Estimated Running Time Relative to DIRECT
36
Amount of Additional Memory Required
Kbytes
37
TBR Scene Management - Conclusions

Which algorithm is preferable depends on the
available additional memory and computational
power.
The DIRECT algorithm has poor performance. On
average the DIRECT algorithm is 44 times slower
than SORT.
The TWO_STEP algorithm is slower than SORT by a
factor of 6 while reducing the amount of
additional memory by a factor of 3.2.
SORT_LET is slower than SORT by a factor of 1.6

38
Efficient Bounding-Box Computation
For a tile T given by the tuple (T.MinX,
T.MinY, T.MaxX, T.MaxY), a possible
implementation of the BBOX test in C is if
(BBOX.MaxX lt T.MinX) / Test 1 / return
NoOverlap if (BBOX.MinX gt T.MaxX) / Test
2 / return NoOverlap if (BBOX.MaxY lt T.MinY)
/ Test 3 / return NoOverlap if
(BBOX.MinY gt T.MaxY) / Test 4 / return
NoOverlap return MightOverlap
Let a triangle Tr be defined by three points
A(x,y), B(x,y), C(x,y) Bounding box (BBOX) of
Tr is defined by the tuple (BBOX.MinX, BBOX.Min
Y, BBOX.MaxX, BBOX.MaxY) where BBOX.MinX MIN
(A.x, B.x, C.x) BBOX.MinY MIN (A.y, B.y, C.y)
BBOX.MaxX MAX (A.x, B.x, C.x) BBOX.MaxY MAX
(A.y, B.y, C.y)
39
Bounding-Box Tests Order

The four comparisons required to determine if a
BBOX and a tile overlap can be performed in an
arbitrary order.
This gives a total of 24 possible arrangements.
However, not every order produces the same number
of comparisons on average.
A tile divides the screen into five, possibly
intersecting regions the tile itself, the region
to the east of the tile ( x T.MaxX), the region
to the west of the tile (xlt T.MinX ), the region
to the north (y T.MaxY ), and the region to the
south (ylt T.MinY ).

40
Static Bounding-Box Tests
STATIC1 If a certain test (comparison) fails,
then there is a high probability that the test in
the opposite direction along the same dimension
succeeds. This is because after these two tests
there is only a small region left where the BBOX
of a primitive can be situated.
STATIC2 The first and second (and, hence, the
third and fourth) comparison check different
dimensions. For example, one possible order is
west, south, east, north.
41
Dynamic Bounding-Box Tests
The probability that a primitive is completely
located in the largest region is the highest.
This observation is the basis of our dynamic
versions of the bounding box test.
DYNAMIC1 First checks the largest region.
Thereafter, the opposite direction along the same
dimension is tested. The third test examines the
largest region in the other dimension, and the
fourth test checks the remaining region.
DYNAMIC2 The comparison corresponding to the
largest region is applied first, then the
comparison corresponding to the second largest
region, etc. The region to the east of the tile
is the largest and checked first, then the region
to the south, then the one to the north, and,
finally, the region to the west.
We remark that although these schemes are called
dynamic, the order in which the comparisons are
applied depends only on the tile position and can
be determined statically off-line. For example,
for all tiles in the upper left sub-scene under
the main diagonal, the order is east, south,
north, west.
42
Bounding-Box Tests Experimental Results
The average number of comparisons per primitive
for each workload
43
Tile Based Rendering

Part 3 State Management Details

44
SW Driver Block Diagram
Main Memory
Applications
Global Primitive List (GPL)
Mesa Core
Texture Images List (TIL)
Texture Objects List (TOL)

Global Instr. Buffering
Initial primitive sorting
Small triangle filter
Texture Preprocessing

Graal Device Driver
Per Tile Sorted Primitive List (Pointers to GPL)

Tile Instr. Iterator
Sends instructions to the Graal Accelerator in a
tile based order until all tiles are completed

GPP (ARM)
Soc Bus
Rasterizer instructions LDTRI DEPTHEN
SETDPTFCT . SWPBUFF
Graal 3D Graphics Accelerator
45
State Information In Detail

Unit enable/disable
Enable blending
Disable depth
Unit functionality change
Change blending mode
Change depth test function
Texturing state (much larger than the rest of the
state information)

46
Lazy State Update

Initial stream
Bindtex 1
Endepth
Tri 1
Disdepth
Tri 2
Endepth
Bindtex 2
Tri 4
Bindtex 3
Tri5

Current tile stream
Bindtex 1
Endepth
Tri 1
Disdepth
Endepth
Bindtex 2
Bindtex 3
Tri5

Current tile stream
using lazy update
Bindtex 1
Endepth
Tri 1
Bindtex 3
Tri5

47
Texture State Handling (I)
48
Texture State Handling (II)

Late commit of texture images
We use a global list of texture images, but a
separate list of (bindable) texture objects for
each context, thus we share the texture images in
order to save space and memory transfers.
Deleting texture objects can be solved either by
partial rendering or by postponing it when
possible.

49
State Management Algorithms

What can we do whenever an instruction that has
side-effects is encountered (e.g.,
DeleteTexture) in the input stream ?
PARTIAL RENDERING (DIRECT)
Render all previously buffered instructions
Executes the instruction.
This algorithm might also introduce significant
rendering overhead.

DELAYED EXECUTION
The driver will postpones the execution of the
current instruction until all the primitives
depending on the current instruction are rendered
or the end of the current frame is reached.
Execute instruction

50
Example
Start Frame Tile 1 c1SaveCurrentContext RestoreTi
leFromGlobalBuffer CreateTexture(i) MakeCurrentTex
ture(i) Triangle(2) MarkDeleteTexture(i) RenameTex
ture(i,j) MakeCurrentTexture(j) Triangle(3) SaveTi
leToGlobalBuffer Tile 2 RestoreContext(c1) Restore
TileFromGlobalBuffer MakeCurrentTexture(i) Triangl
e(1) MakeCurrentTexture(j) Triangle(3) SaveTileToG
lobalBuffer After Last Tile DeleteTexture(i) MoveT
extureLinks(i,j) End Frame Delayed tiled
instruction stream using delayed commit
Start Frame Tile 1 c1SaveCurrentContext RestoreTi
leFromGlobalBuffer CreateTexture(i) MakeCurrentTex
ture(i) Triangle(2) SaveTileToGlobalBuffer c2Save
CurrentContext Tile 2 RestoreContext(c1) RestoreTi
leFromGlobalBuffer MakeCurrentTexture(i) Triangle(
1) SaveTileToGlobalBuffer DeleteTexture(i) Tile1 c
1SaveCurrentContext RestoreTileFromGlobalBuffer C
reateTexture(i) MakeCurrentTexture(i) Triangle(3)
SaveTileToGlobalBuffer Tile 2 RestoreContext(c1) R
estoreTileFromGlobalBuffer MakeCurrentTexture(i) T
riangle(3) SaveTileToGlobalBuffer End Frame
Tiled instruction stream using partial rendering
Start Frame CreateTexture(i) MakeCurrentTexture(i)
Triangle(1) Triangle(2) DeleteTexture(i) CreateTe
xture(i) MakeCurrentTexture(i) Triangle(3) End
Frame Initial instruction stream
Assumptions triangle 1 overlaps tile 2,
triangle 2 overlaps tile 1, and triangle 3
overlaps tiles 1 and 2.
51
State Management Experimental Results (I)
Percentage of state information and triangles
sent to the accelerator per frame.
52
State Management Experimental Results (II)
Average number of state information writes to the
accelerator per frame.
53
State Management Conclusions

While in traditional (non tile-based) rendering
the state information traffic can be negligible
compared to the traffic generated by the
primitives, in tile-based rendering
architectures, since the state information might
need being duplicated in multiple streams, the
required processing power and generated traffic
can increase significantly.
To remove a state change instruction from the
instruction stream of a tile, information about
the previous or the following state change
instructions and/or primitives is required. Thus,
in order to send an optimal state change stream
to the accelerator, i.e., use minimal bandwidth,
additional processing power and more processor
bandwidth is required.
By sending an optimized state change stream to
the accelerator, the state change traffic to the
accelerator was decreased up to 58.