InMemory Grid Files on Graphics Processors - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

InMemory Grid Files on Graphics Processors

Description:

In-Memory Grid Files on Graphics Processors. Ke Yang, Bingsheng He, Rui Fang, ... at the NVIDIA CUDA Forum, especially Mark Harris, for their help with the G80 ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 38
Provided by: KEY86
Category:

less

Transcript and Presenter's Notes

Title: InMemory Grid Files on Graphics Processors


1
In-Memory Grid Files on Graphics Processors
  • Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga K.
    Govindaraju, Qiong Luo, Pedro Sander, Jiaoying
    Shi?
  • HKUST keyang, saven, rayfang, mianlu, luo,
    psander_at_cse.ust.hk
  • Microsoft Corporation, USA nagag_at_microsoft.com
  • ?Zhejiang University jyshi_at_cad.zju.edu.cn

2
Outline
  • Introduction
  • Background
  • Grid Files on GPU
  • Hierarchical Grid files
  • Results
  • Conclusions

3
Multidimensional Data
  • Multidimensional Data
  • Points, line segments, polygons, volumes in 2D,
    3D or higher
  • d-D point in space defined by its d coordinates
    along each axis
  • Applications
  • Geosciences, mechanical CAD, robotics, visual
    perception and autonomous navigation,
    environmental protection, and medical imaging
    (Gunther and Buchmann 1990)
  • Query
  • Exact match query record
  • Range query box

4
Multidimensional access methods
  • No total ordering that preserves spatial
    proximity
  • Multidimensional access methods required
  • Hashing-based (Grid files, linear hashing, etc)
  • Tree-structured (K-D-B tree, R-tree, etc)
  • Space-filling curves
  • Given status quo of CPUs, Challenges
  • Structure complex
  • Computation expensive
  • Memory access intensive

New hardware?
5
The Graphics Processing Unit (GPU)
  • Dedicated graphics rendering device
  • Ubiquitous commodity hardware
  • Parallel machine with massive SIMD processors
  • High memory bandwidth
  • Programmable API
  • Much larger aggregated
  • FLOPS than CPU

Speed Comparison of GPU (aggregated) and
CPU Source NVIDIA CUDA Programming Guide
6
Example GeForce 8800GTX
  • 16 multiprocessors, each supporting up to 768
    concurrent threads, and containing
  • 8 processors, each at 1.35GHz
  • 8192 registers
  • Shared memory
  • Constant cache
  • Texture cache
  • Observed performance 330GFLOPS
  • Device memory 768MB, 86GB/sec 

7
Overview of Our Work
  • A GPU-based grid file
  • For static, memory-resident multidimensional
    point data
  • A hierarchical grid file variant
  • To handle data skews
  • Implementation on GPU
  • 2x-8x faster than the CPU in query tests

8
Outline
  • Introduction
  • Background
  • Grid Files on GPU
  • Hierarchical Grid files
  • Results
  • Conclusions

9
Database Processing on GPUs
  • General-purpose computing using GPUs (GPGPU)
  • Existing work
  • 3D graphics pipeline
  • Drawing geometries
  • OpenGL / DirectX programs

10
Programming on New Generation GPUs
  • CUDA NVidias Compute Unified Device
    Architecture
  • G80 series
  • API extension of C
  • Generalizes GPU resources
  • Hides graphics concepts
  • GPU as SIMD multiprocessors
  • kernel programs
  • Workflow

1. Data copy-in
2. Multithread processing
3. Results copy-back
11
Grid Files
  • Hashing-based multidimensional access method
  • Orthogonal grid
  • Cells
  • Splitting planes
  • Scales
  • Directory
  • Dynamic insertion/deletion
  • Split / merge
  • Distribute the buckets evenly
  • Splitting super linear growth
  • Merging deadlocks

P(px7,py5)?
12
Outline
  • Introduction
  • Background
  • Grid Files on GPU
  • Hierarchical Grid files
  • Results
  • Conclusions

13
Adapting the Grid File Structure
  • Adapt traditional grid file to a static,
    memory-resident one
  • Build using CPU-GPU cooperation
  • Resulted structure
  • Scales
  • Directory entries bucket offsets
  • Rearranged R the buckets
  • Copy the structures in GRAM
  • Store scales in constant memory

14
Query Processing
  • A large num. of queries in parallel Q
  • Coalesced read
  • Exact match queries
  • Identify the cell containing the query record
  • Sequential scan in the bucket of that cell
  • Range queries
  • Scan all buckets overlapping the query box
  • Cells on box boundaries further takes
    point-level test

15
Conflict-free Writing of Results
  • Three-step scheme
  • Count
  • Prefix sum
  • Write

thread1
thread2
thread3
thread4
Results
16
Outline
  • Introduction
  • Background
  • Grid Files on GPU
  • Hierarchical Grid Files
  • Results
  • Conclusions

17
Hierarchical Grid Files
  • Buckets in above structure may not be balanced
  • Querying a crowed bucket more expensive
    imbalance
  • Hierarchical grid file recursively divide crowed
    cells
  • Append info of newborn sub-grid to the directory

18
Query Processing
  • Each thread recursively decodes the offset
  • Write recursion to while loop
  • Flow control cause threads to diverge
  • Only a small num. of braches (lt 5 levels)
  • Store 1st level scales in constant memory

19
Outline
  • Introduction
  • Background
  • Grid Files on GPU
  • Hierarchical Grid Files
  • Results
  • Conclusions

20
Experimental setup
  • Record structure uint id uint keyd
  • Hardware configuration
  • CPU Intel P4 Dual-Core, 1GB DRAM
  • GPU GeForce8800GTX, 768MB GRAM
  • Exact match query
  • Uniform data, skewed data (synthetic /
    real-world)
  • Range query
  • Uniform data
  • In each test, time cost on CPU vs. GPU

21
  • Uniform, varying num. of dim
  • GPU 8x-2x faster, decreasingly
  • overhead of locating cells proportional to num.
    of scales
  • Uniform, varying num. of tuples
  • GPU 6x-8x faster, increasingly
  • Storing scales into constant memory w/ 40
    faster than w/o

22
  • Synthetic skew, varying stdev
  • Both suffer when skew severe
  • Less skewed, fewer levels
  • Hierarchical grid file on GPU gt5x faster

23
  • Sphere, varying num. of points
  • GPU faster than CPU
  • Hierarchy of no speedup on on both CPU and GPU 1
    level
  • Dragon, varying num. of points
  • Max level is 3
  • CPU-H 1.3x-1.5x faster than w/o
  • GPU-H 2.3x-4.5x faste than w/o
  • GPU benefits more from load balance than CPU does
  • Best Hierarchy GPU

24
  • Range query, uniform, varying selectivity
  • Both times increase linearly
  • GPU 4x-6x faster

25
Discussion
  • Whats special about GPU?
  • Parallel device with massive (gt 1M) lightweght
    threads
  • Matches query-intensive workloads
  • Simple data structure (as opposed to linklist)
  • Matches array-based GRAM access
  • Single query is simple hierarchy futher improves
    load balance
  • Matches SIMD processing
  • GPU is potentially more preferable to a
    multi-core CPU with a powerful instruction set
    but a small num. of heavyweight threads

26
Outline
  • Introduction
  • Background
  • Grid Files on GPU
  • Hierarchical Grid files
  • Results
  • Conclusions

27
Conclusions
  • In-memory grid files on GPU
  • GPU well-suited for acceleration
  • Hierarchical grid file handle skewed data
  • Future work
  • Dynamic insertion / deletion
  • Spatial access (R-trees) on GPUs
  • Queries on multi-core CPUs

28
Acknowledgements
  • Anonymous reviewers for their insightful comments
    and suggestions
  • People at the NVIDIA CUDA Forum, especially Mark
    Harris, for their help with the G80
    implementation issues
  • Dr. Lidan Shou of Zhejiang University for his
    lectures on multidimensional access methods

29
Thanks!
30
Backup
31
Details of building
  • Build a grid file from a given data set, R
  • For each dimension, sort R along that dimension,
    sample quantiles as the scale
  • Partition the data space in order to balance the
    bucket size of each cell as much as possible
  • Build a histogram of num. of records in each
    bucket
  • For each record, use the scales to identify the
    bucket it belongs to
  • Prefix sum of the histogram bucket offsets in
    the rearranged R
  • Scatter records into corresponding buckets with
    given offsets

32
Building Time
  • 16Million 2D records
  • Pure CPU-based building 12 sec
  • 8 sec for sorting
  • GPU bitonic sort Govindaraju Sigmod05
  • -gt 3 sec for sorting

33
Range query details
  • Given 2 end points, L and H, of the major
    diagonal of the box
  • Obtain the two corresponding end cells, CL and CH
  • The coordinates of CL and CH bound all the cells
    in the desired range
  • For the points in the boundary cells that have at
    least one coordinate equal to that of an end
    cell, the thread further takes a point-level test
    to check if the points are really in the query
    box

34
Difference from MLGF/Buddy Tree
  • 1.
  • Both M/B cover only those cells that contain data
    points, and matintain a directory entry for each
    non-empty cell
  • Our hierarchical grid covers the entire data
    space, and locates cells through shared scales
  • Relatively simpler and more suitable for bulk
    loading in a parallel computing environment
  • 2.
  • As dynamic maintenance techniques, the two
    existing methods split an overflowed bucket into
    two at each level, thus the structures contain a
    relative large number of levels in the tree or in
    the grid.
  • Our hierarchical grid is a static structure, and
    the number of levels of sub-grids in a crowded
    cell is relatively small.

35
Pseudocodes
36
Structure of a Hierarchical Grid File
37
Experimental setup
  • Relation defaul as 16M tuples
  • Query point query defaul as 1M query keyvalues,
    range query 100 query boxes
Write a Comment
User Comments (0)
About PowerShow.com