Boolean Ranking: Querying a Database by K-Constrained Optimization - PowerPoint PPT Presentation

About This Presentation

Title:

Boolean Ranking: Querying a Database by K-Constrained Optimization

Description:

Zhen Zhang Seung-won Hwang Kevin C. Chang Min Wang Christian A. Lang Yuan-chi Chang Presented ACM SIGMOD Conference (SIGMOD 2006), Chicago, June 2006 – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 46

Provided by: OfficeofI77

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Boolean Ranking: Querying a Database by K-Constrained Optimization

1
Boolean Ranking Querying a Database by
K-Constrained Optimization

Zhen Zhang
Seung-won Hwang
Kevin C. Chang
Min Wang
Christian A. Lang
Yuan-chi Chang
Presented ACM SIGMOD Conference (SIGMOD 2006),
Chicago, June 2006

Presented By Pavan Kumar M.K. (1000618890)
Aditya Mangipudi (1000649172)
2
Outline

Introduction
Motivation
A Search Algorithm
A-Driven State Space Construction
Optimization Driven Configuration
OPT Search Algorithm
Experiments
Conclusion

3
Motivation

The wide spread of databases for managing
structured data, compounded with the expanded
reach of the Internet, has brought forward
interesting data retrieval and analysis scenarios
to RDBMS
Only the Top-K results are of interest to the
user.

4
K-Constrained Optimization Query
QUERY Select the Top-5 2nd year students in CSE
with highest GPA
Boolean query dept CSE and year 2
Qualifying constraint
Find top answers

B dept CSE and year 2
O GPA
Ranking query Top 5 ranked by GPA
Quantifying function
5
K-Constrained Optimization Query

Query Q (G, k)
G - Goal Function
G B . O
k Retrieval Size

6
What is the query evaluation mechanism?
Ranking query
Boolean query

How to answer?
7
Current techniques lack of global search mechanism

If evaluated as separate operators
If search by an overall goal function G as a
ranking function

Boolean query B
Ranking query R
Boolean query B
Ranking query R

Current techniques optimize only
condition-by-condition

8
Threshold Algorithm
Att 1 Att 2

9
Assumptions

Threshold Algorithm essentially relies on a rigid
assumption that G functions are Monotonic.
The monotonicity requires G to be decreasing if
all its parameters are decreasing.

10
Non-Monotonic Functions

Consider the example query as below to find
houses in a certain price range with good
price/sqrft ratio
The function G here in Non-Monotonic.

Select h.address from House h, Where h.price
200k ? h.price 400k Order by h.size/h.price-300
k
11
New Algorithm
Att 1 Att 2

12
Need for encoding as a search problem

Existing algorithms build upon their
problem-specific assumptions on the goal
functions or index traversals.
For example, Threshold Algorithm assumes the
monotonicity of G and the use of sorted accesses
(interleaf navigation), based on which the search
is implicitly hardwired.
In a Boolean Query like B price gt 100K, such a
search is straightforward as the constraint
expressions B explicitly suggests how to carry
out a focused search, eg., visiting only the
nodes with locality potentially satisfying B.

13
Need for encoding as a search problem

In contrast, for a general k-constrained
optimization query potentially involving
arbitrary ranking combined with Boolean
conditions and joining multiple relations, eg.. Q
maximizing size/price ratio, it is no longer
clear how to focus the search.
By encoding into a generic search with no
assumptions on G, the search is generalized to
support arbitrary G over potentially multiple
indices and a combination of both hierarchical
and interleaf traversals.

14
A Algorithm

A is a well known search algorithm that finds
the Shortest Path, given an initial and a
designated goal state.
Widely used in the field of Artificial
Intelligence.
Uses Best-First Search Traversal.
Uses heuristic information to carry out the
search in a guided manner.
A is guaranteed to find the correct answer
(Correctness) by visiting the least number of
states (Optimality)
Ex GPS, Google Maps, A lot of puzzles, games etc.

15
Goal Function

For a tuple t with m attribute values, Goal
Function G(t) maps the tuple to a positive
numeric score.

R(t) if B(t) is true 0 if B(t) is false
G(t) B(t)R(t)
(ie, lowest score)
16
Query Model
Addr Price Size
1. Oak park, Chicago 600K 4500
2. Mattis, Champaign 350K 2000
3. 150K 1000
4. 250K 2000
5. 300K 3500
6. 80K 500
Score
15
0
6.67
0
0
2.27
Select h.address from House h, Where h.price
200k ? h.price 400k Order by h.size/h.price-300
k
17
Landscape of Score Function - G
Addr Price Size
1. Oak park, Chicago 600K 4500
2. Mattis, Champaign 350K 2000
3. 150K 1000
4. 250K 2000
5. 300K 3500
6. 80K 500
Score
15
0
6.67
0
0
2.27
18
OPT Framework

To realize k-constrained optimization over
databases, this paper develops the OPT
framework.
Objective To Optimize G with the help of indices
as access methods over tuples in D.
Discrete State Search From the view of using
indices, we are to search the maximizing tuples
on the index nodes as discrete states.
Continuous Function Optimization From the view
of maximizing goal functions, we are to optimize
G.

19
Evaluate query as its nature suggests!
Function optimization of G
Optimize G over D
Discrete state search over D
20
B Tree Structure
Indices
Value Space
21
Some definitions first..

States States in a search graph represent
localities of values at different granularity
from coarse to fine, and eventually reach tuples
in the database.
Region State
Tuple State
Transitions While states of space give
locations in the map, transitions further
capture possible paths followed to reach our
destination of query answers.
Example for two states u and v, there is a
transition (u, v) if v ? Next(u)

22
We view compound index as discrete space
Price (k)
600
1
350
2
5
250
4
3
100
6
size
3000
1500
4000
4500
23
We view compound index as discrete space
Price (k)
Mij (ai, bj)
b1
250-600
0-250
600
b3
b2
M11
1
350
100-250
0-100
350-600
250-350
M32
M23
M33
b6
2
b7
M22
5

250
5
2
1
4
3

100

M76
M66
M77
M55
M56
M75
6
size
3000
1500
4000
4500
1
5
4
2
a1
M67
3000-4500
0-3000
a3
a2
1500-3000
0-1500
4000-6000
3000-4000
a6
a7

5
1
24
We view compound index as discrete space
conceptually, combined space
Price (k)
Mij (ai, bj)
b1
250-600
0-250
600
b3
b2
M11
1
350
100-250
0-100
350-600
250-350
b6
2
b7
5

250
5
2
1
4
3
100
M66
M77
M67
M76
M55
M56
M75

6
size
3000
1500
4000
4500
4
1
5
2
a1
3000-4500
0-3000
a3
a2
1500-3000
0-1500
4000-6000
3000-4000
a6
a7

5
1
25

Challenge 1 What is the search mechanism?

26
Encoding the problem into shortest path is
challenging
K-constrained optimization
Find a tuple with maximal score
A Shortest path
Find a path with minimal distance
gt A Gives Shortest Path to testable goal. gt
The goal is to find optimal tuple states with
maximal G-Score.
27
Transformation needed.

How to encode a tuple to a path?
Adding a virtual target t only reachable through
tuples
How to encode maximal tuple with minimal path?
Quality of path depends solely on the tuple it
passes by
For tuple state t
D(t, t) - G(t)
For two states r, u
D(r, u) 0

M11
0
0
M22
M32
M23
M33
0
0

M66
M67
M76
M77
M75
M56
M55
0
0
1
5
4
2
- G(1)
- G(4)
t
28

Challenge 2 How to guide the search?

29
Functional Optimization perspective

Function optimization measures quality of states
Function optimization aspects
Defines Proper Heuristics
Identifies a set of initial states to start
search.

30
Structure of Procedure OPT

Input G(x1,,xm) and domain of values dom xi
e xi1,xi2
Output ltO,Ugt OPT(G,dom)
where Ogives local optima
UUpper Bound Score
OPTPOINT gives O Component of OPT
OPTMAX gives U Component of OPT

Approaches
Analytical Method
Seach based (ExHill
Climbing)
Template Based

31
States and Transitions
High Medium Low
Figure illustrates different states have
different promises. Search should favor the
choice of M77 over M67 because its more promising.
32
1. Define admissible heuristics Measure tightest
upper bound

To guarantee completeness
A requires admissible heuristics, i.e., estimate
optimistically
To ensure admissible heuristics
Function optimization gives tightest upper bound
Analytical approaches
Numeric analysis package

H(region) OPTMAX(G, region) i.e., maximal value
of G in the region
33
Consider Example
600
1
M77
M67
350
2
5
250
4
3
100
6
3000
1500
4000
4500

h(M67) gives U0
However if we follow the link from M67 to M77, we
can reach Tuple 1 with score 15.

34
2. Configure descending space disconnect uphills

To guarantee optimality
A requires descending heuristics
To ensure descending heuristics
Remove uphill links

M11

M66
M77
M67
M76
M55
M75
M56
4
1
5
2
35
Find right start point Start from local optima

To guarantee correctness
Every tuple state must be reachable from start
states
Taking only downhills requires start with high
points
To ensure reachability
Initial states should contain all local optima

M11

M66
M55
M75
M56
M77
M67
M76
4
1
2
5
36
Putting together Executing A on the
configured space
top-down
M11
M22
M32
M23
M33

M67
M76
M57
M66
M77
M55
M75
M56
4
1
5
2

Search is implemented as priority queue driven
traversal

37
Need of States and Transitions

Example . Given a set of states constructed from
the set of index graph I, the search, in
principle, should follow those transitions to
look for the tuple states maximizing the goal
function.. The search may follow the path
M11 ? M33 ? M77 ? 1 ? Top-down search
M57 ? M77 ? 1 ? Bottom-Up Search

38
OPT Search Algorithm
M11
M66
M55
M75
M56
M77
M67
M76
4
1
2
5
39
Optimality of OPT

OPT may result in different costs if started at
different initial states.
Top down-gt More hops Bottom up-gtLess hops
Preference goes to Bottom Up but what if
Goal functions G1/(X-Y)21, any value
satisfying
XY maximizes the function.

40
Experiments

Comparison vs.
Boolean then ranking
Ranking then boolean
Metrics node accessed Nl Nt
Settings
Benchmark queries over real dataset
Controlled queries over synthetic dataset

41
Benchmark queries

Datasets
19,706 real estate listing crawled online
Queries
Q1 size bedrms/ price-450k
40kltpricelt50k
Q2 size ebedrms / price-350k
pricelt400ksizegt4000
Q3 size/price bedrms3 ? bedrms4

Q1
Q2
Q3
42
Controlled queries

Datasets
Three randomly generated datasets of 100k points
Uniform, gaussian, logvariatenormal
Queries
Linear average queries (eg, 0.4a 0.6b)
Nearest neighbor queries (eg, (x-3)2 (y-4)2)
Join queries (0.4R.a 0.6S.b R.cR.d)

43
Conclusion

Problem
Study K-constrained optimization queries as
boolean ranking
Abstraction
Encode K-constrained optimization into shortest
path problem
Framework
Develop OPT to process K-constrained optimization

References
Boolean Ranking Querying a Database by
K-Constrained Optimization. Z. Zhang, S. Hwang,
K. C.-C. Chang, M. Wang, C. Lang, and Y. Chang.
In Proceedings of the 2006 ACM SIGMOD Conference
(SIGMOD 2006), pages 359-370, Chicago, June 2006
www.wikipedia.org

45
Thank you!