InMemory Grid Files on Graphics Processors - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

InMemory Grid Files on Graphics Processors

Description:

In-Memory Grid Files on Graphics Processors. Ke Yang, Bingsheng He, Rui Fang, ... at the NVIDIA CUDA Forum, especially Mark Harris, for their help with the G80 ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 38

Provided by: KEY86

Category:

more less

Transcript and Presenter's Notes

Title: InMemory Grid Files on Graphics Processors

1
In-Memory Grid Files on Graphics Processors

Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga K.
Govindaraju, Qiong Luo, Pedro Sander, Jiaoying
Shi?
HKUST keyang, saven, rayfang, mianlu, luo,
psander_at_cse.ust.hk
Microsoft Corporation, USA nagag_at_microsoft.com
?Zhejiang University jyshi_at_cad.zju.edu.cn

2
Outline

Introduction
Background
Grid Files on GPU
Hierarchical Grid files
Results
Conclusions

3
Multidimensional Data

Multidimensional Data
Points, line segments, polygons, volumes in 2D,
3D or higher
d-D point in space defined by its d coordinates
along each axis
Applications
Geosciences, mechanical CAD, robotics, visual
perception and autonomous navigation,
environmental protection, and medical imaging
(Gunther and Buchmann 1990)
Query
Exact match query record
Range query box

4
Multidimensional access methods

No total ordering that preserves spatial
proximity
Multidimensional access methods required
Hashing-based (Grid files, linear hashing, etc)
Tree-structured (K-D-B tree, R-tree, etc)
Space-filling curves
Given status quo of CPUs, Challenges
Structure complex
Computation expensive
Memory access intensive

New hardware?
5
The Graphics Processing Unit (GPU)

Dedicated graphics rendering device
Ubiquitous commodity hardware
Parallel machine with massive SIMD processors
High memory bandwidth
Programmable API
Much larger aggregated
FLOPS than CPU

Speed Comparison of GPU (aggregated) and
CPU Source NVIDIA CUDA Programming Guide
6
Example GeForce 8800GTX

16 multiprocessors, each supporting up to 768
concurrent threads, and containing
8 processors, each at 1.35GHz
8192 registers
Shared memory
Constant cache
Texture cache
Observed performance 330GFLOPS
Device memory 768MB, 86GB/sec

7
Overview of Our Work

A GPU-based grid file
For static, memory-resident multidimensional
point data
A hierarchical grid file variant
To handle data skews
Implementation on GPU
2x-8x faster than the CPU in query tests

8
Outline

Introduction
Background
Grid Files on GPU
Hierarchical Grid files
Results
Conclusions

9
Database Processing on GPUs

General-purpose computing using GPUs (GPGPU)
Existing work
3D graphics pipeline
Drawing geometries
OpenGL / DirectX programs

10
Programming on New Generation GPUs

CUDA NVidias Compute Unified Device
Architecture
G80 series
API extension of C

Generalizes GPU resources
Hides graphics concepts
GPU as SIMD multiprocessors
kernel programs
Workflow

1. Data copy-in
2. Multithread processing
3. Results copy-back
11
Grid Files

Hashing-based multidimensional access method
Orthogonal grid
Cells
Splitting planes
Scales
Directory
Dynamic insertion/deletion
Split / merge
Distribute the buckets evenly
Splitting super linear growth
Merging deadlocks

P(px7,py5)?
12
Outline

Introduction
Background
Grid Files on GPU
Hierarchical Grid files
Results
Conclusions

13
Adapting the Grid File Structure

Adapt traditional grid file to a static,
memory-resident one
Build using CPU-GPU cooperation
Resulted structure
Scales
Directory entries bucket offsets
Rearranged R the buckets
Copy the structures in GRAM
Store scales in constant memory

14
Query Processing

A large num. of queries in parallel Q
Coalesced read
Exact match queries
Identify the cell containing the query record
Sequential scan in the bucket of that cell
Range queries
Scan all buckets overlapping the query box
Cells on box boundaries further takes
point-level test

15
Conflict-free Writing of Results

Three-step scheme
Count
Prefix sum
Write

thread1
thread2
thread3
thread4
Results
16
Outline

Introduction
Background
Grid Files on GPU
Hierarchical Grid Files
Results
Conclusions

17
Hierarchical Grid Files

Buckets in above structure may not be balanced
Querying a crowed bucket more expensive
imbalance
Hierarchical grid file recursively divide crowed
cells
Append info of newborn sub-grid to the directory

18
Query Processing

Each thread recursively decodes the offset
Write recursion to while loop
Flow control cause threads to diverge
Only a small num. of braches (lt 5 levels)
Store 1st level scales in constant memory

19
Outline

Introduction
Background
Grid Files on GPU
Hierarchical Grid Files
Results
Conclusions

20
Experimental setup

Record structure uint id uint keyd
Hardware configuration
CPU Intel P4 Dual-Core, 1GB DRAM
GPU GeForce8800GTX, 768MB GRAM
Exact match query
Uniform data, skewed data (synthetic /
real-world)
Range query
Uniform data
In each test, time cost on CPU vs. GPU

Uniform, varying num. of dim
GPU 8x-2x faster, decreasingly
overhead of locating cells proportional to num.
of scales

Uniform, varying num. of tuples
GPU 6x-8x faster, increasingly
Storing scales into constant memory w/ 40
faster than w/o

Synthetic skew, varying stdev
Both suffer when skew severe
Less skewed, fewer levels
Hierarchical grid file on GPU gt5x faster

Sphere, varying num. of points
GPU faster than CPU
Hierarchy of no speedup on on both CPU and GPU 1
level

Dragon, varying num. of points
Max level is 3
CPU-H 1.3x-1.5x faster than w/o
GPU-H 2.3x-4.5x faste than w/o
GPU benefits more from load balance than CPU does
Best Hierarchy GPU

Range query, uniform, varying selectivity
Both times increase linearly
GPU 4x-6x faster

25
Discussion

Whats special about GPU?
Parallel device with massive (gt 1M) lightweght
threads
Matches query-intensive workloads
Simple data structure (as opposed to linklist)
Matches array-based GRAM access
Single query is simple hierarchy futher improves
load balance
Matches SIMD processing
GPU is potentially more preferable to a
multi-core CPU with a powerful instruction set
but a small num. of heavyweight threads

26
Outline

Introduction
Background
Grid Files on GPU
Hierarchical Grid files
Results
Conclusions

27
Conclusions

In-memory grid files on GPU
GPU well-suited for acceleration
Hierarchical grid file handle skewed data
Future work
Dynamic insertion / deletion
Spatial access (R-trees) on GPUs
Queries on multi-core CPUs

28
Acknowledgements

Anonymous reviewers for their insightful comments
and suggestions
People at the NVIDIA CUDA Forum, especially Mark
Harris, for their help with the G80
implementation issues
Dr. Lidan Shou of Zhejiang University for his
lectures on multidimensional access methods

29
Thanks!
30
Backup
31
Details of building

Build a grid file from a given data set, R
For each dimension, sort R along that dimension,
sample quantiles as the scale
Partition the data space in order to balance the
bucket size of each cell as much as possible
Build a histogram of num. of records in each
bucket
For each record, use the scales to identify the
bucket it belongs to
Prefix sum of the histogram bucket offsets in
the rearranged R
Scatter records into corresponding buckets with
given offsets

32
Building Time

16Million 2D records
Pure CPU-based building 12 sec
8 sec for sorting
GPU bitonic sort Govindaraju Sigmod05
-gt 3 sec for sorting

33
Range query details

Given 2 end points, L and H, of the major
diagonal of the box
Obtain the two corresponding end cells, CL and CH
The coordinates of CL and CH bound all the cells
in the desired range
For the points in the boundary cells that have at
least one coordinate equal to that of an end
cell, the thread further takes a point-level test
to check if the points are really in the query
box

34
Difference from MLGF/Buddy Tree

1.
Both M/B cover only those cells that contain data
points, and matintain a directory entry for each
non-empty cell
Our hierarchical grid covers the entire data
space, and locates cells through shared scales
Relatively simpler and more suitable for bulk
loading in a parallel computing environment
2.
As dynamic maintenance techniques, the two
existing methods split an overflowed bucket into
two at each level, thus the structures contain a
relative large number of levels in the tree or in
the grid.
Our hierarchical grid is a static structure, and
the number of levels of sub-grids in a crowded
cell is relatively small.