Managing Uncertainty in ConstantlyEvolving Environments - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Managing Uncertainty in ConstantlyEvolving Environments

Description:

... ckcheng. Department of Computer Science. PhD Oral Defense ... Channel. user. queries. results. Goal: data retrieval in a correct, efficient and scalable manner ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 65

Provided by: Clif76

Category:

more less

Transcript and Presenter's Notes

Title: Managing Uncertainty in ConstantlyEvolving Environments

1
Managing Uncertainty inConstantly-Evolving
Environments

Reynold Cheng (ckcheng_at_cs.purdue.edu)
http//www.cs.purdue.edu/homes/ckcheng
Department of Computer Science
PhD Oral Defense
Major Advisor Prof. Sunil Prabhakar
March 31st, 2005

2
Sensor Databases
Goal data retrieval in a correct, efficient and
scalable manner
3
Data Uncertainty

Due to limited network bandwidth and battery
power, readings are sampled
The value of the entity being monitored (e.g.,
temperature, location) is changing
Most of the time the database stores old values
Query results can be incorrect!

4
Answering Minimum Query with Database Readings
Recorded Temperature
30
Current Temperature
x1
y0
20

Database X
Correct answer Y

10
x0
y1
0
oF
x
y
5
Bounding Uncertainty with Dead-Reckoning

Data values cannot change drastically
The system negotiates a bound d with the sensor

v-d,vd
System
(v, d)
sensor
v

Trade-off between data uncertainty and update
frequency

6
Answering Minimum Query with Error-Bounded
Readings
Recorded Temperature
30
Bound for Current Temperature
y0
20

x certainly gives the minimum temperature reading

10
x0
0
oF
x
y
7
Answering Minimum Query with Error-Bounded
Readings
Recorded Temperature
uncertainty pdf
30
Bound for Current Temperature
y0
20

(X,0.7), (Y,0.3)
Answers augmented with probabilistic guarantees
Measurement error also a source of uncertainty

10
x0
0
oF
x
y
8
Probabilistic Queries

We represent the imprecision in the value of the
data as an interval with associated pdf
Query answers are augmented with probabilities
Probabilistic queries give us a correct (possibly
less precise) answer, instead of a potentially
incorrect answer

9
Related Work

Wolfson et al. DPD99 and Pfoser et al. ISSD99
discussed probabilistic range queries for moving
objects
Olston and Widom SIGMOD02 discussed tradeoff
between precision and performance of querying
replicated data
Probabilistic queries ICDE03,SIGMOD03,TKDE04,VLDB
04c
Deshpande et al. presented probabilistic
prediction for sensor values VLDB04b
Recent proposals CIDR05a,CIDR05b for
probabilistic modeling for uncertain data

10
Probabilistic Databases

Probabilistic Database each tuple is augmented
with a probability value (Barbara,Garcia-Molina
Porter) TKDE92.
Dalvi Suciu VLDB04a studied efficient query
evaluation with ranked results.
Probabilistic database in semi-structured model
XML data (Nierman Jagadish) VLDB02
Acyclic data structure (Hung,Getoor
Subrahmanian) ICDE03

11
My Dissertation

Characterization of uncertainty in sensor and
mobile databases
Classification, evaluation and quality of
probabilistic queries
Efficient techniques for supporting uncertainty
management in databases

12
Outline

Introduction
Related Work
Uncertain Data Probabilistic Queries
Indexing for Probabilistic Range Queries
Probabilistic Join Processing
Experimental Results
Prototype

13
Outline

Introduction
Related Work
Uncertain Data Probabilistic Queries
Indexing for Probabilistic Range Queries
Probabilistic Join Processing
Experimental Results
Prototype

14
Uncertainty Model
fi(x) uncertainty pdf
Ti.z
Li
Ri
uncertainty interval

Ti.z dynamic attribute (e.g., temperature,
locations)
Used in various application domains, e.g.,
location uncertainty DPD99, ISSD99
DNA microarray data error NAS02

15
Classification of Probabilistic Queries

Nature of answer
Value-based returns a single value
e.g., Average query (l,u, pdf)
Entity-based returns a set of objects
e.g., Range query ((Ti,pi), pigt0)
Aggregation
Aggregate interplay between objects decides
result e.g., Nearest-Neighbor query
Non-aggregate whether an object satisfies a
query is independent of others
e.g., Range query

16
Classification of Probabilistic Queries

Only probabilistic range query(entity-based
non-aggregate class) studied before WS99,ISSD99

17
Quality of Probabilistic Result

Probabilistic queries notion of result "quality"
Motivating example range query (is Ti.z in range
l, u?)
regular range query
"yes" or "no"
probabilistic range query
Propose metrics for each class of probabilistic
queries

18
Quality for Value- Aggregate Queries

Query result l,u, p(x) x ? l,u
U3,4 less ambiguous than U1,100
Differential entropy
Measures uncertainty associated with r.v. X with
pdf p
max(H(X)) log2(u-l) iff XUl,u (most
uncertain)

19
Outline

Introduction
Related Work
Uncertain Data Probabilistic Queries
Indexing for Probabilistic Range Queries
Probabilistic Join Processing
Experimental Results
Prototype

20
Probabilistic Range Query
Recorded Temperature
Uncertainty for Current Temperature
30
20

(T1,0.2),(T2,0.8)

10
0
oF
T1
T2
21
Probabilistic Threshold Range Query (PTRQ)

Users are likely to be concerned with results
with a high probability
Retrieve sensor ids with readings between 10oF to
25oF with probability 0.7
PTRQ Given a,b and p, return Ti where
Prob(value of Ti is inside a,b) p

22
Solving PTRQ with Interval Indexes

Use R-tree or interval index FOCS96, JCSS96,
ADI00 to find intervals intersecting a,b
For each object retrieved, evaluate its
probability of being within a,b
Return objects with probability p

23
The Problem of Current Indexes

Current Interval indexes do not consider
probabilities during search
Many irrelevant objects (probability lt p) may be
processed.
Our approach
Probability Threshold Indexing (PTI)
1D interval R-tree with uncertainty
Variance-based Clustering
Transform intervals to 2D points and index based
on variance

24
Pruning in a 1D R-Tree

Some intervals in the MBR may satisfy Q
Need to retrieve the contents of the MBR and
evaluate

25
x-bounds in a PTI Node
left-0.2-bound
right-0.2-bound
? 0.2
0.8
26
x-bounds in a PTI Node
left-0-bound (MBR)
right-0-bound (MBR)
27
Pruning with x-bounds
left-0.2-bound
right-0.2-bound

An MBR is not retrieved if there exists an
x-bound
p gt x
b on the left of left-x-bound

An MBR is not retrieved if there exists an
x-bound
p gt x
a on the right of right-x-bound

28
Implementation of PTI
29
Drawback of PTI

Extra overhead in storing x-bounds
Small intervals near edges limit gains

right-0.2-bound
left-0.2-bound
30
Clustering 2D points
cluster of large intervals
yRi

When 2D points are clustered, intervals of
different variances are separated

Points clustered based on means and variances
(variance-based clustering)

xy
(Li,Ri)
cluster of smaller intervals
xLi
31
Answering PTRQ with 2D R-Tree

Construct a 2D R-tree over uncertain data by
indexing (meani,variancei)
Query the 2D R-Tree
For uniform pdf, a PTRQ can be converted to a
2D-range query

32
Querying Uniform pdf
y Ri
Li
Ri
xy
Q (p 0.75)
b
a
y(1-p)xp a Intervals containing a
a ltx lt y lt b Intervals in a,b
x(1-p)yp ? b Intervals containing b
b-a p(y-x) Intervals containing a,b
a
b
x Li
a
b
1D View (Uniform pdf)
2D View
33
Outline

Introduction
Related Work
Uncertain Data Probabilistic Queries
Indexing for Probabilistic Range Queries
Probabilistic Join Processing
Experimental Results
Prototype

34
Table Join

Join is an important database problem
Join operator , ?, gt, lt

35
Join over Uncertainty

How do we define comparison operators for
uncertain data?

36
Semantics of Comparison Operators

We studied the meaning of ,?, gt,lt
Imprecision about two values satisfying a
comparison is expressed by a probability value

37
Equality Join

In continuous domain, 2 real values are equal at
a point with zero probability
Resolution c a is equal to b if they are within
c of each other.

38
Efficient Join Processing

Computing joins is costly
Probability Threshold p can help
3 pruning techniques between
2 Items,
2 Pages, and
2 Indices
Example Equality join with resolution c

39
Item-Level Pruning

Assume cumulative distribution function Fi(x)
Let la,b,c be max(La- c, Lb - c) ,
Let ua,b,c be min(Ra c, Rb c)
P(a c b) is at most
min(Fa(ua,b,c) - Fa(la,b,c), Fb(ua,b,c) -
Fb(la,b,c)) (?)
Equation (?) is easy to compute
If Equation (?) lt p, then a and b can be pruned

40
Page-Level Pruning

Goal Prune pages R and S without examining
individual items
Solution Place x-bounds on R and S, and perform
4 tests with x-bounds

41
Page-Level Pruning

Goal Prune pages R and S without examining
individual items
Solution Place x-bounds on R and S, and perform
4 tests with x-bounds

R.left-x-bound
R.right-x-bound
Page R
42
BNLJ and U-BNLJ

BNLJ
Block-Nested-Loop Join
Joins 2 lists of unordered pages
U-BNLJ
Place x-bounds on every page
Use page-level pruning techniques

43
Index-Level Pruning

Extension of page-level pruning
Construct PTI on the relation
Perform page-level pruning on the page storing
the node (consists of x-bounds for children nodes)

44
INLJ and U-INLJ

INLJ
Index-Nested-Loop Join
Construct a R-tree or interval index for inner
relation
U-INLJ
Construct PTI for inner relation

45
Outline

Introduction
Related Work
Uncertain Data Probabilistic Queries
Indexing for Probabilistic Range Queries
Probabilistic Join Processing
Experimental Results
Prototype

46
Experiment 1 Threshold Range Query

Compare number of I/Os between
1D R-tree on intervals only
PTI (1D R-tree with probability thresholds)
2D variance-based clustering (called Extensive)

47
Simulation Model

100K uncertain data, with length uniformly
distributed in 0,10000 and uniform uncertainty
pdf
10K PTRQs with length of a,b normally
distributed and p ? 0.1,1
Each PTI node contains five x-bounds, where x
?0.1,0.3,0.5,0.7,0.9

48
Scalability of Indexes

Both PTI and Extensive outperform R-tree
Answering PTRQ with R-tree requires more
computation
Extensive needs about 50 less I/Os than PTI

49
Query Probability Threshold

R-tree does not benefit from the increasing value
of p
When p is 0.5, Extensive is 4 times better than
PTI

50
Experiment 2 Threshold Join

2 tables of uncertain data are generated
Compare interval joins and x-bound-enhanced
equality joins
BNLJ vs. U-BNLJ
INLJ vs. U-INLJ
Measure no. of candidate pairs

51
BNLJ vs. U-BNLJ
52
INLJ vs. U-INLJ
53
Effect of Resolution c
54
Outline

Introduction
Related Work
Uncertain Data Probabilistic Queries
Indexing for Probabilistic Range Queries
Probabilistic Join Processing
Experimental Results
Prototype

55
U-DBMS Prototype

A system for handling uncertain data
Meta-queries for specifying data uncertainty
(e.g., uncertainty interval, type of uncertainty
pdf,)
Extension of SQL operators to support different
probabilistic query classes
Measurement of probabilistic answer quality
Allows easy addition of new uncertain data types
(e.g., uncertain pdf) and query operators

56
Architecture of U-DBMS

UNCERTAIN class data structures (e.g.,
histogram) and access methods (e.g., find
variance of uncertainty)
New data types and operators do not interfere
with the original DBMS
Implemented with PL/SQL and external C code

57
Example Queries
58
Summary

Data staleness in sensor and mobile databases can
render incorrect query answers
We propose uncertainty interval as a trade-off
between data imprecision and update frequency
We perform an extensive study of uncertainty
management classification, quality and efficient
evaluation of probabilistic queries, indexing,
joins
Prototype implementation using PostgreSQL

59
Related Publications

TKDE04 R. Cheng, D. V. Kalashnikov, and S.
Prabhakar. Querying imprecise data in moving
object environments. In IEEE TKDE, 2004.
VLDB04c R. Cheng, Y. Xia, S. Prabhakar, R.
Shah, and J. S. Vitter. Efficient indexing
methods for probabilistic threshold queries over
uncertain data. In VLDB 2004.
SIGMOD03 R. Cheng, D. Kalashnikov, and S.
Prabhakar. Evaluating probabilistic queries over
imprecise data. In ACM SIGMOD 2003.
ICDE03 R. Cheng, S. Prabhakar, and D. V.
Kalashnikov. Querying imprecise data in moving
object environments. In IEEE ICDE 2003.
HCI04 R. Cheng and S. Prabhakar. Using
Uncertainty to Provide Privacy-Preserving and
High-Quality Location-Based Services. In Workshop
on Location Systems Privacy and Control, Mobile
HCI04.
VSSN04 K.Y. Lam, R. Cheng, B. Liang and J.
Chau. Sensor Node Selection for Execution of
Continuous Probabilistic Threshold Queries in
Wireless Sensor Networks. In VSSN, ACM Multimedia
2004.

60
References

FOCS96 L. Arge and J. S. Vitter. On dynamic
interval management in external memory (extended
abstract). In FOCS, p. 560-569, 1996.
TKDE92 D. Barbara, H. Garcia-Molina and D.
Porter. The management of probabilistic data.
IEEE TKDE, 4(5)487-502, 1992.
NAS02 Significance and statistical errors in
the analysis of DNA microarray data J. Brody,
B. Williams,? B. Wold,? and S. Quake. Proc. Natl.
Acad. Sci., U S A., 2002, 199(20).
CH89 C. Chatfield. The analysis of time series
an introduction. Chapman and Hall, 1989.
VLDB04a N. Dalvi and D. Suciu. Efficient Query
Evaluation on Probabilistic Databases. In VLDB
2004.
VLDB04b A. Deshpande, C. Guestrin, S. Madden,
J. Hellerstein and W. Hong. Model-Driven Data
Acquisition in Sensor Networks. In VLDB, 2004.
CIDR05a A. Deshpande, C. Guestrin and S.
Madden. Using Probabilistic Models for Data
Management in Acquisitional Environments. In
CIDR, 2005.
ICDE03 E. Hung, L. Getoor and V. S.
Subrahmanian. PXML A Probabilistic
Semistructured Data Model and Algebra. In ICDE
2003.

61
References

JCSS96 P. C. Kanellakis, S. Ramaswamy, D.
Vengroff, and J. S. Vitter. Indexing for data
models with constraints and classes. In J. Comp.
Syst. Sci, 52(3)589-612, 1996.
ADI00 Y. Manolopoulos, Y. Theodoridis, and V.
J. Tsotras. Chapter 4 Access methods for
intervals. In Advanced Database Indexing, Kluwer,
2000.
VLDB02 A. Nierman and H. V. Jagadish. ProTDB
Probabilistic Data in XML. In VLDB 2002.
DPD99 O. Wolfson, P. Sistla, S. Chamberlain,
and Y. Yesha. Updating and querying databases
that track mobile units. Distributed and Parallel
Databases, 7(3), 1999.
SIGMOD02 C. Olston and J. Widom. Best-effort
cache synchronization with source cooperation.
In Proc. Of the ACM SIGMOD 2002.
ISSD99 D. Pfoser and C. S. Jensen. Capturing
the Uncertainty of Moving-Object Representations,
in Proc. of the Sixth International Symposium on
Spatio Databases, Hong Kong, July 20-23, 1999,
pp. 111-132.
CIDR05b J. Widom. Trio A system for integrated
management of data, accuracy and lineage. In
CIDR, 2005.

62
Thank You

Imprecise data is prevalent in increasing number
of applications. I believe uncertainty management
is an emerging and important area.

Life is Uncertain... Eat Dessert First! - S.
Gordon and H. Brecher
63
How to define uncertainty pdf?

The form of uncertainty pdf depends on the
application e.g., Gaussian distribution models
measurement error.
If no information about pdf is known, a simple
way is to assume uniform pdf a pessimistic
estimation
Can also use more sophisticated techniques, based
on time-series analysis on past data for pdf
derivation CH89

64
Classical Decomposition