Title: Fast Computation of Database Operations using Graphics Processors
1Fast Computation of Database Operations using
Graphics Processors
- Naga K. Govindaraju
- Univ. of North Carolina
- Modified By,
- Mahendra Chavan for CS632
2Goal
- Utilize graphics processors for fast computation
of common database operations
3Motivation Fast operations
- Increasing database sizes
- Faster processor speeds but low improvement in
query execution time - Memory stalls
- Branch mispredictions
- Resource stalls Eg. Instruction dependency
- Utilize the available architectural features and
exploit parallel execution possibilities
4Graphics Processors
- Present in almost every PC
- Have multiple vertex and pixel processing engines
running parallel - Can process tens of millions of geometric
primitives per second - Peak Perf. Of GPU is increasing at the rate of
2.5-3 times a year! - Programmable- fragment programs executed on
pixel processing engines
5Main Contributions
- Algorithms for predicates, boolean combinations
and aggregations - Utilize SIMD capabilities of pixel processing
engines - They have used these algorithms for selection
queries on one or more attributes and aggregate
queries
6Related Work
- Hardware Acceleration for DB operations
- Vector processors for relational DB operations
Meki and Kambayashi 2000 - SIMD instructions for relational DB operations
Zhou and Ross 2002 - GPUs for spatial selections and joins Sun et al.
2003
7Graphics Processors Design Issues
- Programming model is limited due to lack of
random access writes - Design algorithms avoiding data rearrangements
- Programmable pipeline has poor branching
- Design algorithms without branching in
programmable pipeline - evaluate branches using
fixed function tests
8Frame Buffer
- Pixels stored on graphics card in a frame buffer.
- Frame buffer conceptually divided into
- Color Buffer
- Stores color component of each pixel in the frame
buffer - Depth Buffer
- Stores depth value associated with each pixel.
The depth is used to determine surface visibility - Stencil Buffer
- Stores stencil value for each pixel . Called
Stencil because, it is typically used for
enabling/disabling writes to frame buffer
9Graphics Pipeline
Vertices
Vertex Processing Engine
Alpha Test
Stencil Test
Depth Test
10Graphics Pipeline
- Vertex Processing Engine
- Transforms vertices to points on screen
- Setup Engine
- Generates Info. For color, depth etc. associated
with primitive vertices - Pixel processing Engines
- Fragment processors, performs a series of tests
before writing the fragments to frame buffer
11Pixel processing Engines
- Alpha Test
- Compares fragments alpha value to user-specified
reference value - Stencil Test
- Compares fragments pixels stencil value to
user-specified reference value - Depth Test
- Compares depth value of the fragment to the
reference depth value.
12Operators
13 Occlusion Query
Fragment Programs
- Users can supply custom fragment programs on each
fragment
- Gives no. of fragments that pass different no. of
tests
14Radeon R770 GPU by AMD Graphics Product Group
15Data Representation on GPUs
- Textures 2 D arrays- may have multiple channels
- We store data in textures in floating point
formats - To perform computations on the values, render the
quadrilateral, generate fragments, run fragment
programs and perform tests!
16Stencil Tests
- Fragments failing Stencil test are rejected from
the rasterization pipeline - Stencil Operations
- KEEP keep the stencil value in the stencil
buffer - INCR stencil value
- DECR stencil value
- ZERO stencil value 0
- REPLACE stencil value reference value
- INVERT bitwise invert (stencil value)
17Stencil and Depth Tests
- We can setup the stencilOP routine as below
- For each fragment , three possible outcomes,
based on the outcome, corresponding stencil op.
is executed - Op1 when a fragment fails stencil test
- Op2 when a fragment passes stencil test but
fails depth test - Op3 when a fragment passes stencil and depth
test
18Outline
- Database Operations on GPUs
- Implementation Results
- Analysis
- Conclusions
19Outline
- Database Operations on GPUs
- Implementation Results
- Analysis
- Conclusions
20Overview
- Database operations require comparisons
- Utilize depth test functionality of GPUs for
performing comparisons - Implements all possible comparisons lt, lt, gt, gt,
, !, ALWAYS, NEVER - Utilize stencil test for data validation and
storing results of comparison operations
21Basic Operations
- Basic SQL query
- Select A
- From T
- Where C
- A attributes or aggregations (SUM, COUNT, MAX
etc) - Trelational table
- C Boolean Combination of Predicates (using
operators AND, OR, NOT)
22Outline Database Operations
- Predicate Evaluation
- (a op constant) depth test and stencil test
- (a op b) (a-b op 0 ) can be executed on
GPUs - Boolean Combinations of Predicates
- Express as CNF and repetitively use stencil tests
- Aggregations
- Occlusion queries
23Outline Database Operations
- Predicate Evaluation
- Boolean Combinations of Predicates
- Aggregations
24Basic Operations
- Predicates ai op constant or ai op aj
- Op is one of lt,gt,lt,gt,!, , TRUE, FALSE
- Boolean combinations Conjunctive Normal Form
(CNF) expression evaluation - Aggregations COUNT, SUM, MAX, MEDIAN, AVG
25Predicate Evaluation
- ai op constant (d)
- Copy the attribute values ai into depth buffer
- Define the comparison operation using depth test
- Draw a screen filling quad at depth d
26ai op d
If ( ai op d ) pass fragment Else reject
fragment
Screen
d
27Predicate Evaluation
- ai op aj
- Treat as (ai aj) op 0
- Semi-linear queries
- Defined as linear combination of attribute values
compared against a constant - Linear combination is computed as a dot product
of two vectors - Utilize the vector processing capabilities of GPUs
28Data Validation
- Performed using stencil test
- Valid stencil values are set to a given value s
- Data values that fail predicate evaluation are
set to zero
29Outline Database Operations
- Predicate Evaluation
- Boolean Combinations of Predicates
- Aggregations
30Boolean Combinations
- Expression provided as a CNF
- CNF is of form (A1 AND A2 AND AND Ak)
- where Ai (Bi1 OR Bi2 OR OR Bimi )
- CNF does not have NOT operator
- If CNF has a NOT operator, invert comparison
operation to eliminate NOT - Eg. NOT (ai lt d) gt (ai gt d)
31Boolean Combination
- We will focus on (A1 AND A2)
- All cases are considered
- A1 (TRUE AND A1)
- If Ei (A1 AND A2 AND AND Ai-1 AND Ai),
- Ei (Ei-1 AND Ai)
32- Clear stencil value to 1
- For each Ai , i1,.,k
- do
- if (mod(I,2)) / Valid stencil value is 1 /
- Stencil test to pass if stencil value is equal to
1 - StencilOp (KEEP,KEPP, INCR)
- Else
- Stencil test to pass if stencil value is equal to
2 - StencilOp (KEEP,KEPP, DECR)
- Endif
- For each Bij, j1,..,mi
- Do
- Perform Bij using COMPARE / depth test /
- End for
- If (mod(I,2)) / valid stencil value is 2 /
- If stencil value on screen is 1 , REPLACE with 0
- Else / valid stencil value is 1 /
33A1 AND A2
A1
B23
B22
B21
34A1 AND A2
35A1 AND A2
A1
36A1 AND A2
Stencil value 0
A1
Stencil value 2
37A1 AND A2
St 0
A1
B22
St0
St2
B23
St1
St1
B21
St1
38A1 AND A2
Stencil 0
A1
B22
St 0
B23
St1
St1
B21
St1
39A1 AND A2
St 0
St 1A1 AND B22
St1 A1 AND B23
St1 A1 AND B21
40Range Query
- Compute ai within low, high
- Evaluated as ( ai gt low ) AND ( ai lt high )
41Outline Database Operations
- Predicate Evaluation
- Boolean Combinations of Predicates
- Aggregations
42Aggregations
- COUNT, MAX, MIN, SUM, AVG
- No data rearrangements
43COUNT
- Use occlusion queries to get pixel pass count
- Syntax
- Begin occlusion query
- Perform database operation
- End occlusion query
- Get count of number of attributes that passed
database operation - Involves no additional overhead!
44MAX, MIN, MEDIAN
- We compute Kth-largest number
- Traditional algorithms require data
rearrangements - We perform no data rearrangements, no frame
buffer readbacks
45K-th Largest Number
- Say vk is the k-th largest number
- How do we generate a number m equal to vk?
- Without knowing vks bit-representation and using
comparisons
46Our algorithm
- b_max max. no. of bits in the values in tex
- x0
- For i b_max-1 down to 0
- Count Compare (text gt x 2i)
- If Count gt k-1
- xx2i
- Return x
47K-th Largest Number
- Lemma Let vk be the k-th largest number. Let
count be the number of values gt m - If count gt (k-1) mlt vk
- If count lt (k-1) mgtvk
- Apply the earlier algorithm ensuring that count
gt(k-1)
48Example
49Example
- Vk 11101001
- M 10000000
- M lt Vk
50Example
- Vk 11101001
- M 11000000
- M lt Vk
51Example
- Vk 11101001
- M 11100000
- M lt Vk
52Example
- Vk 11101001
- M 11110000
- M gt Vk
- Make the bit 0
- M 11100000
53Example
- Vk 11101001
- M 11101000
- M lt Vk
54Example
- Vk 11101001
- M 11101100
- M gt Vk
- Make this bit 0
- M 11101000
55Example
- Vk 11101001
- M 11101010
- M gt Vk
- M 11101000
56Example
- Vk 11101001
- M 11101001
- M lt Vk
57Example
- Integers ranging from 0 to 255
- Represent them in depth buffer
- Idea Use depth functions to perform comparisons
- Use NV_occlusion_query to determine maximum
58Example Parallel Max
- S10,24,37,99,192,200,200,232
- Step 1 Draw Quad at 128
- S 10,24,37,99,192,200,200,232
- Step 2 Draw Quad at 192
- S 10,24,37,192,200,200,232
- Step 3 Draw Quad at 224
- S 10,24,37,192,200,200,232
- Step 4 Draw Quad at 240 No values pass
- Step 5 Draw Quad at 232
- S 10,24,37,192,200,200,232
- Step 6,7,8 Draw Quads at 236,234,233 No values
pass - Max is 232
59SUM and AVG
- Mipmaps multi resolution textures consisting of
multiple levels - Highest level contains average of all values at
lowest level - SUM AVG COUNT
- Problems with mipmaps
- If we want sum of a subset of values then we have
to introduce conditions in the fragment programs - Floating point representations may have problems
60Accumulator
- Data representation is of form
- ak 2k ak-1 2k-1 a0
- Sum sum(ak) 2k sum(ak-1) 2k-1sum(a0)
- Current GPUs support no bit-masking operations
- AVG SUM/COUNT
61TestBit
- Read the data value from texture, say ai
- F frac(ai/2k1)
- If Fgt0.5, then k-th bit of ai is 1
- Set F to alpha value. Alpha test passes a
fragment if alpha valuegt0.5
62Outline
- Database Operations on GPUs
- Implementation Results
- Analysis
- Conclusions
63Implementation
- Dell Precision Workstation with Dual 2.8GHz Xeon
Processor - NVIDIA GeForce FX 5900 Ultra GPU
- 2GB RAM
64Implementation
- CPU Intel compiler 7.1 with hyperthreading,
multi-threading, SIMD optimizations - GPU NVIDIA Cg Compiler
65Benchmarks
- TCP/IP database with 1 million records and four
attributes - Census database with 360K records
66Copy Time
67Predicate Evaluation (3 times faster)
68Range Query(5.5 times faster)
69Multi-Attribute Query (2 times)
70Semi-linear Query (9 times faster)
71COUNT
- Same timings for GPU implementation
72Kth-Largest for median(2.5 times)
73Kth-Largest
74Kth-Largest conditional
75Accumulator(20 times slower!)
76Outline
- Database Operations on GPUs
- Implementation Results
- Analysis
- Conclusions
77Analysis Issues
- Precision
- Currently depth buffer has only 24 bit precision
, inadequate - Copy time
- Copy from texture to depth buffer no mechanism
in GPU - Integer arithmetic
- Not enough arithmetic inst. In pixel processing
engines - Depth compare masking
- Useful to have comparison mask for depth function
78Analysis Issues
- Memory management
- Current GPUS have 512 MB video memory, we may use
the out-ofcore techniques and swap - No random writes
- No data re-arrangements possible
79Analysis Performance
- Relative Performance Gain
- High Performance Predicate evaluation,
multi-attribute queries, semi-linear queries,
count - Medium Performance Kth-largest number
- Low Performance - Accumulator
80High Performance
- Parallel pixel processing engines
- Pipelining
- Multi-attribute queries get advantage
- Early Depth culling
- Before passing through the pixel processing
engine - Eliminate branch mispredictions
81Medium Performance
- Parallelism
- FX 5900 has clock speed 450MHz, 8 pixel
processing engines - Rendering single 1000x1000 quad takes 0.278ms
- Rendering 19 such quads take 5.28ms. Observed
time is 6.6ms - 80 efficiency in parallelism!!
82Low Performance
- No gain over SIMD based CPU implementation
- Two main reasons
- Lack of integer-arithmetic
- Clock rate
83Outline
- Database Operations on GPUs
- Implementation Results
- Analysis
- Conclusions
84Conclusions
- Novel algorithms to perform database operations
on GPUs - Evaluation of predicates, boolean combinations of
predicates, aggregations - Algorithms take into account GPU limitations
- No data rearrangements
- No frame buffer readbacks
85Conclusions
- Preliminary comparisons with optimized CPU
implementations is promising - Discussed possible improvements on GPUs
- GPU as a useful co-processor
86Relational Joins
- Modern GPUs have thread groups
- Each thread group have several threads
- Data Parallel primitives
- Map
- Scatter scatters the Data of a relation with
respect to an array L - Gather reverse of scatter
- Split Divides the relation into a number of
disjoint partitions with a given partitioning
function
87NINLJ
R
Thread Group 1
Thread Group i
Thread Group Bp
Thread Group j
S
88INLJ
- Used Cache Optimized Search Trees (CSS trees) for
index structure - Inner relation as the CSS tree
- Multiple keys are searched in parallel on the tree
89Sort Merge join
- Merge step is done in parallel
- 3 steps
- Divide relation S into Q chunks Q S / M
- Find the corresponding matching chunks from R by
using the start and end of each chunk of S - Merge each pair of S and R chunk in parallel. 1
thread group per pair.
90Hash join
- Partitioning
- Use the Split primitive to partition both the
relations - Matching
- Read the inner relation in memory relation
- Each tuple from the outer relation uses
sequential/binary search on the inner relation - For binary search, the inner relation will be
sorted using bitonic sort.