Title: Managing Uncertainty in ConstantlyEvolving Environments
1Managing Uncertainty inConstantly-Evolving
Environments
- Reynold Cheng (ckcheng_at_cs.purdue.edu)
- http//www.cs.purdue.edu/homes/ckcheng
- Department of Computer Science
- PhD Oral Defense
- Major Advisor Prof. Sunil Prabhakar
- March 31st, 2005
2Sensor Databases
Goal data retrieval in a correct, efficient and
scalable manner
3Data Uncertainty
- Due to limited network bandwidth and battery
power, readings are sampled - The value of the entity being monitored (e.g.,
temperature, location) is changing - Most of the time the database stores old values
- Query results can be incorrect!
4Answering Minimum Query with Database Readings
Recorded Temperature
30
Current Temperature
x1
y0
20
- Database X
- Correct answer Y
10
x0
y1
0
oF
x
y
5Bounding Uncertainty with Dead-Reckoning
- Data values cannot change drastically
- The system negotiates a bound d with the sensor
v-d,vd
System
(v, d)
sensor
v
- Trade-off between data uncertainty and update
frequency
6Answering Minimum Query with Error-Bounded
Readings
Recorded Temperature
30
Bound for Current Temperature
y0
20
- x certainly gives the minimum temperature reading
10
x0
0
oF
x
y
7Answering Minimum Query with Error-Bounded
Readings
Recorded Temperature
uncertainty pdf
30
Bound for Current Temperature
y0
20
- (X,0.7), (Y,0.3)
- Answers augmented with probabilistic guarantees
- Measurement error also a source of uncertainty
10
x0
0
oF
x
y
8Probabilistic Queries
- We represent the imprecision in the value of the
data as an interval with associated pdf - Query answers are augmented with probabilities
- Probabilistic queries give us a correct (possibly
less precise) answer, instead of a potentially
incorrect answer
9Related Work
- Wolfson et al. DPD99 and Pfoser et al. ISSD99
discussed probabilistic range queries for moving
objects - Olston and Widom SIGMOD02 discussed tradeoff
between precision and performance of querying
replicated data - Probabilistic queries ICDE03,SIGMOD03,TKDE04,VLDB
04c - Deshpande et al. presented probabilistic
prediction for sensor values VLDB04b - Recent proposals CIDR05a,CIDR05b for
probabilistic modeling for uncertain data
10Probabilistic Databases
- Probabilistic Database each tuple is augmented
with a probability value (Barbara,Garcia-Molina
Porter) TKDE92. - Dalvi Suciu VLDB04a studied efficient query
evaluation with ranked results. - Probabilistic database in semi-structured model
- XML data (Nierman Jagadish) VLDB02
- Acyclic data structure (Hung,Getoor
Subrahmanian) ICDE03
11My Dissertation
- Characterization of uncertainty in sensor and
mobile databases - Classification, evaluation and quality of
probabilistic queries - Efficient techniques for supporting uncertainty
management in databases
12Outline
- Introduction
- Related Work
- Uncertain Data Probabilistic Queries
- Indexing for Probabilistic Range Queries
- Probabilistic Join Processing
- Experimental Results
- Prototype
13Outline
- Introduction
- Related Work
- Uncertain Data Probabilistic Queries
- Indexing for Probabilistic Range Queries
- Probabilistic Join Processing
- Experimental Results
- Prototype
14Uncertainty Model
fi(x) uncertainty pdf
Ti.z
Li
Ri
uncertainty interval
- Ti.z dynamic attribute (e.g., temperature,
locations) - Used in various application domains, e.g.,
- location uncertainty DPD99, ISSD99
- DNA microarray data error NAS02
15Classification of Probabilistic Queries
- Nature of answer
- Value-based returns a single value
- e.g., Average query (l,u, pdf)
- Entity-based returns a set of objects
- e.g., Range query ((Ti,pi), pigt0)
- Aggregation
- Aggregate interplay between objects decides
result e.g., Nearest-Neighbor query - Non-aggregate whether an object satisfies a
query is independent of others - e.g., Range query
16Classification of Probabilistic Queries
- Only probabilistic range query(entity-based
non-aggregate class) studied before WS99,ISSD99
17Quality of Probabilistic Result
- Probabilistic queries notion of result "quality"
- Motivating example range query (is Ti.z in range
l, u?) - regular range query
- "yes" or "no"
- probabilistic range query
- Propose metrics for each class of probabilistic
queries
18Quality for Value- Aggregate Queries
- Query result l,u, p(x) x ? l,u
- U3,4 less ambiguous than U1,100
- Differential entropy
- Measures uncertainty associated with r.v. X with
pdf p - max(H(X)) log2(u-l) iff XUl,u (most
uncertain)
19Outline
- Introduction
- Related Work
- Uncertain Data Probabilistic Queries
- Indexing for Probabilistic Range Queries
- Probabilistic Join Processing
- Experimental Results
- Prototype
20Probabilistic Range Query
Recorded Temperature
Uncertainty for Current Temperature
30
20
10
0
oF
T1
T2
21Probabilistic Threshold Range Query (PTRQ)
- Users are likely to be concerned with results
with a high probability - Retrieve sensor ids with readings between 10oF to
25oF with probability 0.7 - PTRQ Given a,b and p, return Ti where
Prob(value of Ti is inside a,b) p
22Solving PTRQ with Interval Indexes
- Use R-tree or interval index FOCS96, JCSS96,
ADI00 to find intervals intersecting a,b - For each object retrieved, evaluate its
probability of being within a,b - Return objects with probability p
23The Problem of Current Indexes
- Current Interval indexes do not consider
probabilities during search - Many irrelevant objects (probability lt p) may be
processed. - Our approach
- Probability Threshold Indexing (PTI)
- 1D interval R-tree with uncertainty
- Variance-based Clustering
- Transform intervals to 2D points and index based
on variance
24Pruning in a 1D R-Tree
- Some intervals in the MBR may satisfy Q
- Need to retrieve the contents of the MBR and
evaluate
25x-bounds in a PTI Node
left-0.2-bound
right-0.2-bound
? 0.2
0.8
26x-bounds in a PTI Node
left-0-bound (MBR)
right-0-bound (MBR)
27Pruning with x-bounds
left-0.2-bound
right-0.2-bound
- An MBR is not retrieved if there exists an
x-bound - p gt x
- b on the left of left-x-bound
- An MBR is not retrieved if there exists an
x-bound - p gt x
- a on the right of right-x-bound
28Implementation of PTI
29Drawback of PTI
- Extra overhead in storing x-bounds
- Small intervals near edges limit gains
right-0.2-bound
left-0.2-bound
30Clustering 2D points
cluster of large intervals
yRi
- When 2D points are clustered, intervals of
different variances are separated
- Points clustered based on means and variances
(variance-based clustering)
xy
(Li,Ri)
cluster of smaller intervals
xLi
31Answering PTRQ with 2D R-Tree
- Construct a 2D R-tree over uncertain data by
indexing (meani,variancei) - Query the 2D R-Tree
- For uniform pdf, a PTRQ can be converted to a
2D-range query
32Querying Uniform pdf
y Ri
Li
Ri
xy
Q (p 0.75)
b
a
y(1-p)xp a Intervals containing a
a ltx lt y lt b Intervals in a,b
x(1-p)yp ? b Intervals containing b
b-a p(y-x) Intervals containing a,b
a
b
x Li
a
b
1D View (Uniform pdf)
2D View
33Outline
- Introduction
- Related Work
- Uncertain Data Probabilistic Queries
- Indexing for Probabilistic Range Queries
- Probabilistic Join Processing
- Experimental Results
- Prototype
34Table Join
- Join is an important database problem
- Join operator , ?, gt, lt
35Join over Uncertainty
- How do we define comparison operators for
uncertain data?
36 Semantics of Comparison Operators
- We studied the meaning of ,?, gt,lt
- Imprecision about two values satisfying a
comparison is expressed by a probability value
37Equality Join
- In continuous domain, 2 real values are equal at
a point with zero probability - Resolution c a is equal to b if they are within
c of each other.
38Efficient Join Processing
- Computing joins is costly
- Probability Threshold p can help
- 3 pruning techniques between
- 2 Items,
- 2 Pages, and
- 2 Indices
- Example Equality join with resolution c
39Item-Level Pruning
- Assume cumulative distribution function Fi(x)
- Let la,b,c be max(La- c, Lb - c) ,
- Let ua,b,c be min(Ra c, Rb c)
- P(a c b) is at most
- min(Fa(ua,b,c) - Fa(la,b,c), Fb(ua,b,c) -
Fb(la,b,c)) (?) - Equation (?) is easy to compute
- If Equation (?) lt p, then a and b can be pruned
40Page-Level Pruning
- Goal Prune pages R and S without examining
individual items - Solution Place x-bounds on R and S, and perform
4 tests with x-bounds
41Page-Level Pruning
- Goal Prune pages R and S without examining
individual items - Solution Place x-bounds on R and S, and perform
4 tests with x-bounds
R.left-x-bound
R.right-x-bound
Page R
42BNLJ and U-BNLJ
- BNLJ
- Block-Nested-Loop Join
- Joins 2 lists of unordered pages
- U-BNLJ
- Place x-bounds on every page
- Use page-level pruning techniques
43Index-Level Pruning
- Extension of page-level pruning
- Construct PTI on the relation
- Perform page-level pruning on the page storing
the node (consists of x-bounds for children nodes)
44INLJ and U-INLJ
- INLJ
- Index-Nested-Loop Join
- Construct a R-tree or interval index for inner
relation - U-INLJ
- Construct PTI for inner relation
45Outline
- Introduction
- Related Work
- Uncertain Data Probabilistic Queries
- Indexing for Probabilistic Range Queries
- Probabilistic Join Processing
- Experimental Results
- Prototype
46Experiment 1 Threshold Range Query
- Compare number of I/Os between
- 1D R-tree on intervals only
- PTI (1D R-tree with probability thresholds)
- 2D variance-based clustering (called Extensive)
47Simulation Model
- 100K uncertain data, with length uniformly
distributed in 0,10000 and uniform uncertainty
pdf - 10K PTRQs with length of a,b normally
distributed and p ? 0.1,1 - Each PTI node contains five x-bounds, where x
?0.1,0.3,0.5,0.7,0.9
48Scalability of Indexes
- Both PTI and Extensive outperform R-tree
- Answering PTRQ with R-tree requires more
computation - Extensive needs about 50 less I/Os than PTI
49Query Probability Threshold
- R-tree does not benefit from the increasing value
of p - When p is 0.5, Extensive is 4 times better than
PTI
50Experiment 2 Threshold Join
- 2 tables of uncertain data are generated
- Compare interval joins and x-bound-enhanced
equality joins - BNLJ vs. U-BNLJ
- INLJ vs. U-INLJ
- Measure no. of candidate pairs
51BNLJ vs. U-BNLJ
52INLJ vs. U-INLJ
53Effect of Resolution c
54Outline
- Introduction
- Related Work
- Uncertain Data Probabilistic Queries
- Indexing for Probabilistic Range Queries
- Probabilistic Join Processing
- Experimental Results
- Prototype
55U-DBMS Prototype
- A system for handling uncertain data
- Meta-queries for specifying data uncertainty
(e.g., uncertainty interval, type of uncertainty
pdf,) - Extension of SQL operators to support different
probabilistic query classes - Measurement of probabilistic answer quality
- Allows easy addition of new uncertain data types
(e.g., uncertain pdf) and query operators
56Architecture of U-DBMS
- UNCERTAIN class data structures (e.g.,
histogram) and access methods (e.g., find
variance of uncertainty) - New data types and operators do not interfere
with the original DBMS - Implemented with PL/SQL and external C code
57Example Queries
58Summary
- Data staleness in sensor and mobile databases can
render incorrect query answers - We propose uncertainty interval as a trade-off
between data imprecision and update frequency - We perform an extensive study of uncertainty
management classification, quality and efficient
evaluation of probabilistic queries, indexing,
joins - Prototype implementation using PostgreSQL
59Related Publications
- TKDE04 R. Cheng, D. V. Kalashnikov, and S.
Prabhakar. Querying imprecise data in moving
object environments. In IEEE TKDE, 2004. - VLDB04c R. Cheng, Y. Xia, S. Prabhakar, R.
Shah, and J. S. Vitter. Efficient indexing
methods for probabilistic threshold queries over
uncertain data. In VLDB 2004. - SIGMOD03 R. Cheng, D. Kalashnikov, and S.
Prabhakar. Evaluating probabilistic queries over
imprecise data. In ACM SIGMOD 2003. - ICDE03 R. Cheng, S. Prabhakar, and D. V.
Kalashnikov. Querying imprecise data in moving
object environments. In IEEE ICDE 2003. - HCI04 R. Cheng and S. Prabhakar. Using
Uncertainty to Provide Privacy-Preserving and
High-Quality Location-Based Services. In Workshop
on Location Systems Privacy and Control, Mobile
HCI04. - VSSN04 K.Y. Lam, R. Cheng, B. Liang and J.
Chau. Sensor Node Selection for Execution of
Continuous Probabilistic Threshold Queries in
Wireless Sensor Networks. In VSSN, ACM Multimedia
2004.
60References
- FOCS96 L. Arge and J. S. Vitter. On dynamic
interval management in external memory (extended
abstract). In FOCS, p. 560-569, 1996. - TKDE92 D. Barbara, H. Garcia-Molina and D.
Porter. The management of probabilistic data.
IEEE TKDE, 4(5)487-502, 1992. - NAS02 Significance and statistical errors in
the analysis of DNA microarray data J. Brody,
B. Williams,? B. Wold,? and S. Quake. Proc. Natl.
Acad. Sci., U S A., 2002, 199(20). - CH89 C. Chatfield. The analysis of time series
an introduction. Chapman and Hall, 1989. - VLDB04a N. Dalvi and D. Suciu. Efficient Query
Evaluation on Probabilistic Databases. In VLDB
2004. - VLDB04b A. Deshpande, C. Guestrin, S. Madden,
J. Hellerstein and W. Hong. Model-Driven Data
Acquisition in Sensor Networks. In VLDB, 2004. - CIDR05a A. Deshpande, C. Guestrin and S.
Madden. Using Probabilistic Models for Data
Management in Acquisitional Environments. In
CIDR, 2005. - ICDE03 E. Hung, L. Getoor and V. S.
Subrahmanian. PXML A Probabilistic
Semistructured Data Model and Algebra. In ICDE
2003.
61References
- JCSS96 P. C. Kanellakis, S. Ramaswamy, D.
Vengroff, and J. S. Vitter. Indexing for data
models with constraints and classes. In J. Comp.
Syst. Sci, 52(3)589-612, 1996. - ADI00 Y. Manolopoulos, Y. Theodoridis, and V.
J. Tsotras. Chapter 4 Access methods for
intervals. In Advanced Database Indexing, Kluwer,
2000. - VLDB02 A. Nierman and H. V. Jagadish. ProTDB
Probabilistic Data in XML. In VLDB 2002. - DPD99 O. Wolfson, P. Sistla, S. Chamberlain,
and Y. Yesha. Updating and querying databases
that track mobile units. Distributed and Parallel
Databases, 7(3), 1999. - SIGMOD02 C. Olston and J. Widom. Best-effort
cache synchronization with source cooperation.
In Proc. Of the ACM SIGMOD 2002. - ISSD99 D. Pfoser and C. S. Jensen. Capturing
the Uncertainty of Moving-Object Representations,
in Proc. of the Sixth International Symposium on
Spatio Databases, Hong Kong, July 20-23, 1999,
pp. 111-132. - CIDR05b J. Widom. Trio A system for integrated
management of data, accuracy and lineage. In
CIDR, 2005.
62Thank You
- Imprecise data is prevalent in increasing number
of applications. I believe uncertainty management
is an emerging and important area.
Life is Uncertain... Eat Dessert First! - S.
Gordon and H. Brecher
63How to define uncertainty pdf?
- The form of uncertainty pdf depends on the
application e.g., Gaussian distribution models
measurement error. - If no information about pdf is known, a simple
way is to assume uniform pdf a pessimistic
estimation - Can also use more sophisticated techniques, based
on time-series analysis on past data for pdf
derivation CH89
64Classical Decomposition
- For a discrete time series, let Xt be a random
variable at time t - Xt mt st Yt
- mt trend, a slowly-moving function
- moving-average filter, exponential smoothing,
curve fitting/regression - st seasonal component periodic function
- Yt noise component
- Example mt2t1,stsin(t),YtN(0,1)
- pdf(100) N(201sin(100),1)