An Improved Indexing Scheme for Range Queries - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

An Improved Indexing Scheme for Range Queries

Description:

Partition of attribute data into several buckets. Each bucket is identified by an ID ... Same as the general bucketization method ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 21
Provided by: yvonn54
Category:

less

Transcript and Presenter's Notes

Title: An Improved Indexing Scheme for Range Queries


1
An Improved Indexing Scheme for Range Queries
  • Yvonne Yao
  • Adviser Professor Huiping Guo

2
Database-as-a-Service
  • Business organizations handle a large amount of
    data (TB)
  • Cost of managing and maintaining these data
    onsite is high
  • DAS
  • DBMSs outsourcing
  • Clients rely on service providers for data
    management and maintenance
  • Cost is a lot lowered.
  • But

3
Database-as-a-Service
  • Security of data is not guaranteed
  • Service providers are untrusted
  • Store only an encrypted form of data onto the
    remote server
  • Only users with the correct key(s) can have
    access
  • How then can we query the encrypted data?
  • Retrieve and decrypt the entire table, and apply
    SQL statements on it. Too expensive!
  • A more realistic approach was discovered

4
Database-as-a-Service
5
Bucketization
  • Various approaches to build meta-data B-tree
    based, hash-based, and bucket-based
  • What is bucketization?
  • Partition of attribute data into several buckets
  • Each bucket is identified by an ID
  • Bucket IDs are stored, along with encrypted data,
    on the remote server
  • Client keeps partition information as meta-data
  • General bucketization approach
  • Equi-width
  • Equi-depth

6
Example 1
7
Example 1
8
Example 1
  • User query
  • SELECT
  • FROM grades
  • WHERE gpa lt 3.0
  • Qserver
  • SELECT
  • FROM egrades
  • WHERE gpaID Bucket_1 OR
  • gpaID Bucket_2 OR
  • gpaID Bucket_3
  • Size of superset is 29, of which 7 of them are
    false positives

9
Query Optimal Bucketization
  • General idea minimizing the bucket cost of each
    bucket
  • Input ltD (V, F), Mgt
  • V v1, v2, v3, , vn where v1 lt v2 lt v3 lt
    ltvn
  • F Frequency of each value
  • M Number of buckets to fill
  • Output a matrix indicating the boundary of each
    bucket

10
Query Optimal Bucketization
  • QOB
  • Finds optimum solutions to two smaller
    sub-problems
  • one contains the leftmost M-1 buckets covering
    the (n-i) smallest points
  • Another contains the rightmost single bucket
    covering the remaining i points
  • V v1, v2, v3, v4, v5, v6, , vn-3, vn-2, vn-1,
    vn
  • n-i points go to last i points go to
  • M-1 buckets last bucket

11
Example 2
12
Example 2
  • Qserver
  • SELECT
  • FROM egrades
  • WHERE gpaID Bucket_1 OR
  • gpaID Bucket_2 OR
  • gpaID Bucket_3
  • Same as the general bucketization method
  • In most cases, QOB can outperform the
    conventional bucketization strategy, but not
    always

13
Deviation Bucketization
  • Built upon QOB, takes the same parameters
  • Has two levels of buckets
  • First level same as those produced by QOB
  • Second level bucketization of deviation values,
    the difference between the value itself to the
    average of the bucket
  • Each first-level-bucket has at most M second
    level buckets
  • QOB has at most M buckets, while DB has at most
    M2 buckets

14
Deviation Bucketization
  • DB
  • Run QOB (D, M)
  • Construct First-Level-Buckets from boundary
    matrix
  • For each First-Level-Bucket
  • Initialize empty datasets vi and fi
  • For each vi in the bucket
  • vi vi ? vi avg()
  • fi fi ? 1
  • Create a new dataset di (vi, fi)
  • Run QOB(di, M)

15
Example 3
16
Example 3
  • Qserver
  • SELECT
  • FROM egrades
  • WHERE gpaID Bucket_1 OR
  • gpaID Bucket_2 OR
  • gpaID Bucket_3_1 OR
  • gpaID Bucket_3_2
  • In this case, no false positives are returned
  • Generally, false positives will still be
    returned, just the number of them will be greatly
    reduced

17
Experiments
  • Two datasets
  • Synthetic dataset 105 integers from 0, 999
  • Real dataset 103 data points from the Aspect
    column of the Forest CoverType database in UCIs
    KDD Archive
  • Two sets of queries
  • Qsyn
  • Qreal

18
Experiment 1
19
Experiment 2
20
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com