Fast Computation of Database Operations using Graphics Processors - PowerPoint PPT Presentation

1 / 90

About This Presentation

Title:

Fast Computation of Database Operations using Graphics Processors

Description:

Fast Computation of Database Operations using Graphics Processors. Naga K. Govindaraju ... Have multiple vertex and pixel processing engines running parallel ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 91

Provided by: Naga6

Category:

more less

Transcript and Presenter's Notes

Title: Fast Computation of Database Operations using Graphics Processors

1
Fast Computation of Database Operations using
Graphics Processors

Naga K. Govindaraju
Univ. of North Carolina
Modified By,
Mahendra Chavan for CS632

2
Goal

Utilize graphics processors for fast computation
of common database operations

3
Motivation Fast operations

Increasing database sizes
Faster processor speeds but low improvement in
query execution time
Memory stalls
Branch mispredictions
Resource stalls Eg. Instruction dependency
Utilize the available architectural features and
exploit parallel execution possibilities

4
Graphics Processors

Present in almost every PC
Have multiple vertex and pixel processing engines
running parallel
Can process tens of millions of geometric
primitives per second
Peak Perf. Of GPU is increasing at the rate of
2.5-3 times a year!
Programmable- fragment programs executed on
pixel processing engines

5
Main Contributions

Algorithms for predicates, boolean combinations
and aggregations
Utilize SIMD capabilities of pixel processing
engines
They have used these algorithms for selection
queries on one or more attributes and aggregate
queries

6
Related Work

Hardware Acceleration for DB operations
Vector processors for relational DB operations
Meki and Kambayashi 2000
SIMD instructions for relational DB operations
Zhou and Ross 2002
GPUs for spatial selections and joins Sun et al.
2003

7
Graphics Processors Design Issues

Programming model is limited due to lack of
random access writes
Design algorithms avoiding data rearrangements
Programmable pipeline has poor branching
Design algorithms without branching in
programmable pipeline - evaluate branches using
fixed function tests

8
Frame Buffer

Pixels stored on graphics card in a frame buffer.
Frame buffer conceptually divided into
Color Buffer
Stores color component of each pixel in the frame
buffer
Depth Buffer
Stores depth value associated with each pixel.
The depth is used to determine surface visibility
Stencil Buffer
Stores stencil value for each pixel . Called
Stencil because, it is typically used for
enabling/disabling writes to frame buffer

9
Graphics Pipeline
Vertices
Vertex Processing Engine
Alpha Test
Stencil Test
Depth Test
10
Graphics Pipeline

Vertex Processing Engine
Transforms vertices to points on screen
Setup Engine
Generates Info. For color, depth etc. associated
with primitive vertices
Pixel processing Engines
Fragment processors, performs a series of tests
before writing the fragments to frame buffer

11
Pixel processing Engines

Alpha Test
Compares fragments alpha value to user-specified
reference value
Stencil Test
Compares fragments pixels stencil value to
user-specified reference value
Depth Test
Compares depth value of the fragment to the
reference depth value.

12
Operators

lt
gt
lt
gt
Never
Always

13
Occlusion Query
Fragment Programs

Users can supply custom fragment programs on each
fragment

Gives no. of fragments that pass different no. of
tests

14
Radeon R770 GPU by AMD Graphics Product Group
15
Data Representation on GPUs

Textures 2 D arrays- may have multiple channels
We store data in textures in floating point
formats
To perform computations on the values, render the
quadrilateral, generate fragments, run fragment
programs and perform tests!

16
Stencil Tests

Fragments failing Stencil test are rejected from
the rasterization pipeline
Stencil Operations
KEEP keep the stencil value in the stencil
buffer
INCR stencil value
DECR stencil value
ZERO stencil value 0
REPLACE stencil value reference value
INVERT bitwise invert (stencil value)

17
Stencil and Depth Tests

We can setup the stencilOP routine as below
For each fragment , three possible outcomes,
based on the outcome, corresponding stencil op.
is executed
Op1 when a fragment fails stencil test
Op2 when a fragment passes stencil test but
fails depth test
Op3 when a fragment passes stencil and depth
test

18
Outline

Database Operations on GPUs
Implementation Results
Analysis
Conclusions

19
Outline

Database Operations on GPUs
Implementation Results
Analysis
Conclusions

20
Overview

Database operations require comparisons
Utilize depth test functionality of GPUs for
performing comparisons
Implements all possible comparisons lt, lt, gt, gt,
, !, ALWAYS, NEVER
Utilize stencil test for data validation and
storing results of comparison operations

21
Basic Operations

Basic SQL query
Select A
From T
Where C
A attributes or aggregations (SUM, COUNT, MAX
etc)
Trelational table
C Boolean Combination of Predicates (using
operators AND, OR, NOT)

22
Outline Database Operations

Predicate Evaluation
(a op constant) depth test and stencil test
(a op b) (a-b op 0 ) can be executed on
GPUs
Boolean Combinations of Predicates
Express as CNF and repetitively use stencil tests
Aggregations
Occlusion queries

23
Outline Database Operations

Predicate Evaluation
Boolean Combinations of Predicates
Aggregations

24
Basic Operations

Predicates ai op constant or ai op aj
Op is one of lt,gt,lt,gt,!, , TRUE, FALSE
Boolean combinations Conjunctive Normal Form
(CNF) expression evaluation
Aggregations COUNT, SUM, MAX, MEDIAN, AVG

25
Predicate Evaluation

ai op constant (d)
Copy the attribute values ai into depth buffer
Define the comparison operation using depth test
Draw a screen filling quad at depth d

26
ai op d
If ( ai op d ) pass fragment Else reject
fragment
Screen
d
27
Predicate Evaluation

ai op aj
Treat as (ai aj) op 0
Semi-linear queries
Defined as linear combination of attribute values
compared against a constant
Linear combination is computed as a dot product
of two vectors
Utilize the vector processing capabilities of GPUs

28
Data Validation

Performed using stencil test
Valid stencil values are set to a given value s
Data values that fail predicate evaluation are
set to zero

29
Outline Database Operations

Predicate Evaluation
Boolean Combinations of Predicates
Aggregations

30
Boolean Combinations

Expression provided as a CNF
CNF is of form (A1 AND A2 AND AND Ak)
where Ai (Bi1 OR Bi2 OR OR Bimi )
CNF does not have NOT operator
If CNF has a NOT operator, invert comparison
operation to eliminate NOT
Eg. NOT (ai lt d) gt (ai gt d)

31
Boolean Combination

We will focus on (A1 AND A2)
All cases are considered
A1 (TRUE AND A1)
If Ei (A1 AND A2 AND AND Ai-1 AND Ai),
Ei (Ei-1 AND Ai)

Clear stencil value to 1
For each Ai , i1,.,k
do
if (mod(I,2)) / Valid stencil value is 1 /
Stencil test to pass if stencil value is equal to
1
StencilOp (KEEP,KEPP, INCR)
Else
Stencil test to pass if stencil value is equal to
2
StencilOp (KEEP,KEPP, DECR)
Endif
For each Bij, j1,..,mi
Do
Perform Bij using COMPARE / depth test /
End for
If (mod(I,2)) / valid stencil value is 2 /
If stencil value on screen is 1 , REPLACE with 0
Else / valid stencil value is 1 /

33
A1 AND A2
A1
B23
B22
B21
34
A1 AND A2
35
A1 AND A2
A1
36
A1 AND A2
Stencil value 0
A1
Stencil value 2
37
A1 AND A2
St 0
A1
B22
St0
St2
B23
St1
St1
B21
St1
38
A1 AND A2
Stencil 0
A1
B22
St 0
B23
St1
St1
B21
St1
39
A1 AND A2
St 0
St 1A1 AND B22
St1 A1 AND B23
St1 A1 AND B21
40
Range Query

Compute ai within low, high
Evaluated as ( ai gt low ) AND ( ai lt high )

41
Outline Database Operations

Predicate Evaluation
Boolean Combinations of Predicates
Aggregations

42
Aggregations

COUNT, MAX, MIN, SUM, AVG
No data rearrangements

43
COUNT

Use occlusion queries to get pixel pass count
Syntax
Begin occlusion query
Perform database operation
End occlusion query
Get count of number of attributes that passed
database operation
Involves no additional overhead!

44
MAX, MIN, MEDIAN

We compute Kth-largest number
Traditional algorithms require data
rearrangements
We perform no data rearrangements, no frame
buffer readbacks

45
K-th Largest Number

Say vk is the k-th largest number
How do we generate a number m equal to vk?
Without knowing vks bit-representation and using
comparisons

46
Our algorithm

b_max max. no. of bits in the values in tex
x0
For i b_max-1 down to 0
Count Compare (text gt x 2i)
If Count gt k-1
xx2i
Return x

47
K-th Largest Number

Lemma Let vk be the k-th largest number. Let
count be the number of values gt m
If count gt (k-1) mlt vk
If count lt (k-1) mgtvk
Apply the earlier algorithm ensuring that count
gt(k-1)

48
Example

Vk 11101001
M 00000000

49
Example

Vk 11101001
M 10000000
M lt Vk

50
Example

Vk 11101001
M 11000000
M lt Vk

51
Example

Vk 11101001
M 11100000
M lt Vk

52
Example

Vk 11101001
M 11110000
M gt Vk
Make the bit 0
M 11100000

53
Example

Vk 11101001
M 11101000
M lt Vk

54
Example

Vk 11101001
M 11101100
M gt Vk
Make this bit 0
M 11101000

55
Example

Vk 11101001
M 11101010
M gt Vk
M 11101000

56
Example

Vk 11101001
M 11101001
M lt Vk

57
Example

Integers ranging from 0 to 255
Represent them in depth buffer
Idea Use depth functions to perform comparisons
Use NV_occlusion_query to determine maximum

58
Example Parallel Max

S10,24,37,99,192,200,200,232
Step 1 Draw Quad at 128
S 10,24,37,99,192,200,200,232
Step 2 Draw Quad at 192
S 10,24,37,192,200,200,232
Step 3 Draw Quad at 224
S 10,24,37,192,200,200,232
Step 4 Draw Quad at 240 No values pass
Step 5 Draw Quad at 232
S 10,24,37,192,200,200,232
Step 6,7,8 Draw Quads at 236,234,233 No values
pass
Max is 232

59
SUM and AVG

Mipmaps multi resolution textures consisting of
multiple levels
Highest level contains average of all values at
lowest level
SUM AVG COUNT
Problems with mipmaps
If we want sum of a subset of values then we have
to introduce conditions in the fragment programs
Floating point representations may have problems

60
Accumulator

Data representation is of form
ak 2k ak-1 2k-1 a0
Sum sum(ak) 2k sum(ak-1) 2k-1sum(a0)
Current GPUs support no bit-masking operations
AVG SUM/COUNT

61
TestBit

Read the data value from texture, say ai
F frac(ai/2k1)
If Fgt0.5, then k-th bit of ai is 1
Set F to alpha value. Alpha test passes a
fragment if alpha valuegt0.5

62
Outline

Database Operations on GPUs
Implementation Results
Analysis
Conclusions

63
Implementation

Dell Precision Workstation with Dual 2.8GHz Xeon
Processor
NVIDIA GeForce FX 5900 Ultra GPU
2GB RAM

64
Implementation

CPU Intel compiler 7.1 with hyperthreading,
multi-threading, SIMD optimizations
GPU NVIDIA Cg Compiler

65
Benchmarks

TCP/IP database with 1 million records and four
attributes
Census database with 360K records

66
Copy Time
67
Predicate Evaluation (3 times faster)
68
Range Query(5.5 times faster)
69
Multi-Attribute Query (2 times)
70
Semi-linear Query (9 times faster)
71
COUNT

Same timings for GPU implementation

72
Kth-Largest for median(2.5 times)
73
Kth-Largest
74
Kth-Largest conditional
75
Accumulator(20 times slower!)
76
Outline

Database Operations on GPUs
Implementation Results
Analysis
Conclusions

77
Analysis Issues

Precision
Currently depth buffer has only 24 bit precision
, inadequate
Copy time
Copy from texture to depth buffer no mechanism
in GPU
Integer arithmetic
Not enough arithmetic inst. In pixel processing
engines
Depth compare masking
Useful to have comparison mask for depth function

78
Analysis Issues

Memory management
Current GPUS have 512 MB video memory, we may use
the out-ofcore techniques and swap
No random writes
No data re-arrangements possible

79
Analysis Performance

Relative Performance Gain
High Performance Predicate evaluation,
multi-attribute queries, semi-linear queries,
count
Medium Performance Kth-largest number
Low Performance - Accumulator

80
High Performance

Parallel pixel processing engines
Pipelining
Multi-attribute queries get advantage
Early Depth culling
Before passing through the pixel processing
engine
Eliminate branch mispredictions

81
Medium Performance

Parallelism
FX 5900 has clock speed 450MHz, 8 pixel
processing engines
Rendering single 1000x1000 quad takes 0.278ms
Rendering 19 such quads take 5.28ms. Observed
time is 6.6ms
80 efficiency in parallelism!!

82
Low Performance

No gain over SIMD based CPU implementation
Two main reasons
Lack of integer-arithmetic
Clock rate

83
Outline

Database Operations on GPUs
Implementation Results
Analysis
Conclusions

84
Conclusions

Novel algorithms to perform database operations
on GPUs
Evaluation of predicates, boolean combinations of
predicates, aggregations
Algorithms take into account GPU limitations
No data rearrangements
No frame buffer readbacks

85
Conclusions

Preliminary comparisons with optimized CPU
implementations is promising
Discussed possible improvements on GPUs
GPU as a useful co-processor

86
Relational Joins

Modern GPUs have thread groups
Each thread group have several threads
Data Parallel primitives
Map
Scatter scatters the Data of a relation with
respect to an array L
Gather reverse of scatter
Split Divides the relation into a number of
disjoint partitions with a given partitioning
function

87
NINLJ
R
Thread Group 1
Thread Group i
Thread Group Bp
Thread Group j
S
88
INLJ

Used Cache Optimized Search Trees (CSS trees) for
index structure
Inner relation as the CSS tree
Multiple keys are searched in parallel on the tree

89
Sort Merge join

Merge step is done in parallel
3 steps
Divide relation S into Q chunks Q S / M
Find the corresponding matching chunks from R by
using the start and end of each chunk of S
Merge each pair of S and R chunk in parallel. 1
thread group per pair.

90
Hash join

Partitioning
Use the Split primitive to partition both the
relations
Matching
Read the inner relation in memory relation
Each tuple from the outer relation uses
sequential/binary search on the inner relation
For binary search, the inner relation will be
sorted using bitonic sort.

Write a Comment

User Comments (0)