Managing Uncertainty in ConstantlyEvolving Environments - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Managing Uncertainty in ConstantlyEvolving Environments

Description:

... ckcheng. Department of Computer Science. PhD Oral Defense ... Channel. user. queries. results. Goal: data retrieval in a correct, efficient and scalable manner ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 65
Provided by: Clif76
Category:

less

Transcript and Presenter's Notes

Title: Managing Uncertainty in ConstantlyEvolving Environments


1
Managing Uncertainty inConstantly-Evolving
Environments
  • Reynold Cheng (ckcheng_at_cs.purdue.edu)
  • http//www.cs.purdue.edu/homes/ckcheng
  • Department of Computer Science
  • PhD Oral Defense
  • Major Advisor Prof. Sunil Prabhakar
  • March 31st, 2005

2
Sensor Databases
Goal data retrieval in a correct, efficient and
scalable manner
3
Data Uncertainty
  • Due to limited network bandwidth and battery
    power, readings are sampled
  • The value of the entity being monitored (e.g.,
    temperature, location) is changing
  • Most of the time the database stores old values
  • Query results can be incorrect!

4
Answering Minimum Query with Database Readings
Recorded Temperature
30
Current Temperature
x1
y0
20
  • Database X
  • Correct answer Y

10
x0
y1
0
oF
x
y
5
Bounding Uncertainty with Dead-Reckoning
  • Data values cannot change drastically
  • The system negotiates a bound d with the sensor

v-d,vd
System
(v, d)
sensor
v
  • Trade-off between data uncertainty and update
    frequency

6
Answering Minimum Query with Error-Bounded
Readings
Recorded Temperature
30
Bound for Current Temperature
y0
20
  • x certainly gives the minimum temperature reading

10
x0
0
oF
x
y
7
Answering Minimum Query with Error-Bounded
Readings
Recorded Temperature
uncertainty pdf
30
Bound for Current Temperature
y0
20
  • (X,0.7), (Y,0.3)
  • Answers augmented with probabilistic guarantees
  • Measurement error also a source of uncertainty

10
x0
0
oF
x
y
8
Probabilistic Queries
  • We represent the imprecision in the value of the
    data as an interval with associated pdf
  • Query answers are augmented with probabilities
  • Probabilistic queries give us a correct (possibly
    less precise) answer, instead of a potentially
    incorrect answer

9
Related Work
  • Wolfson et al. DPD99 and Pfoser et al. ISSD99
    discussed probabilistic range queries for moving
    objects
  • Olston and Widom SIGMOD02 discussed tradeoff
    between precision and performance of querying
    replicated data
  • Probabilistic queries ICDE03,SIGMOD03,TKDE04,VLDB
    04c
  • Deshpande et al. presented probabilistic
    prediction for sensor values VLDB04b
  • Recent proposals CIDR05a,CIDR05b for
    probabilistic modeling for uncertain data

10
Probabilistic Databases
  • Probabilistic Database each tuple is augmented
    with a probability value (Barbara,Garcia-Molina
    Porter) TKDE92.
  • Dalvi Suciu VLDB04a studied efficient query
    evaluation with ranked results.
  • Probabilistic database in semi-structured model
  • XML data (Nierman Jagadish) VLDB02
  • Acyclic data structure (Hung,Getoor
    Subrahmanian) ICDE03

11
My Dissertation
  • Characterization of uncertainty in sensor and
    mobile databases
  • Classification, evaluation and quality of
    probabilistic queries
  • Efficient techniques for supporting uncertainty
    management in databases

12
Outline
  • Introduction
  • Related Work
  • Uncertain Data Probabilistic Queries
  • Indexing for Probabilistic Range Queries
  • Probabilistic Join Processing
  • Experimental Results
  • Prototype

13
Outline
  • Introduction
  • Related Work
  • Uncertain Data Probabilistic Queries
  • Indexing for Probabilistic Range Queries
  • Probabilistic Join Processing
  • Experimental Results
  • Prototype

14
Uncertainty Model
fi(x) uncertainty pdf
Ti.z
Li
Ri
uncertainty interval
  • Ti.z dynamic attribute (e.g., temperature,
    locations)
  • Used in various application domains, e.g.,
  • location uncertainty DPD99, ISSD99
  • DNA microarray data error NAS02

15
Classification of Probabilistic Queries
  • Nature of answer
  • Value-based returns a single value
  • e.g., Average query (l,u, pdf)
  • Entity-based returns a set of objects
  • e.g., Range query ((Ti,pi), pigt0)
  • Aggregation
  • Aggregate interplay between objects decides
    result e.g., Nearest-Neighbor query
  • Non-aggregate whether an object satisfies a
    query is independent of others
  • e.g., Range query

16
Classification of Probabilistic Queries
  • Only probabilistic range query(entity-based
    non-aggregate class) studied before WS99,ISSD99

17
Quality of Probabilistic Result
  • Probabilistic queries notion of result "quality"
  • Motivating example range query (is Ti.z in range
    l, u?)
  • regular range query
  • "yes" or "no"
  • probabilistic range query
  • Propose metrics for each class of probabilistic
    queries

18
Quality for Value- Aggregate Queries
  • Query result l,u, p(x) x ? l,u
  • U3,4 less ambiguous than U1,100
  • Differential entropy
  • Measures uncertainty associated with r.v. X with
    pdf p
  • max(H(X)) log2(u-l) iff XUl,u (most
    uncertain)

19
Outline
  • Introduction
  • Related Work
  • Uncertain Data Probabilistic Queries
  • Indexing for Probabilistic Range Queries
  • Probabilistic Join Processing
  • Experimental Results
  • Prototype

20
Probabilistic Range Query
Recorded Temperature
Uncertainty for Current Temperature
30
20
  • (T1,0.2),(T2,0.8)

10
0
oF
T1
T2
21
Probabilistic Threshold Range Query (PTRQ)
  • Users are likely to be concerned with results
    with a high probability
  • Retrieve sensor ids with readings between 10oF to
    25oF with probability 0.7
  • PTRQ Given a,b and p, return Ti where
    Prob(value of Ti is inside a,b) p

22
Solving PTRQ with Interval Indexes
  • Use R-tree or interval index FOCS96, JCSS96,
    ADI00 to find intervals intersecting a,b
  • For each object retrieved, evaluate its
    probability of being within a,b
  • Return objects with probability p

23
The Problem of Current Indexes
  • Current Interval indexes do not consider
    probabilities during search
  • Many irrelevant objects (probability lt p) may be
    processed.
  • Our approach
  • Probability Threshold Indexing (PTI)
  • 1D interval R-tree with uncertainty
  • Variance-based Clustering
  • Transform intervals to 2D points and index based
    on variance

24
Pruning in a 1D R-Tree
  • Some intervals in the MBR may satisfy Q
  • Need to retrieve the contents of the MBR and
    evaluate

25
x-bounds in a PTI Node
left-0.2-bound
right-0.2-bound
? 0.2
0.8
26
x-bounds in a PTI Node
left-0-bound (MBR)
right-0-bound (MBR)
27
Pruning with x-bounds
left-0.2-bound
right-0.2-bound
  • An MBR is not retrieved if there exists an
    x-bound
  • p gt x
  • b on the left of left-x-bound
  • An MBR is not retrieved if there exists an
    x-bound
  • p gt x
  • a on the right of right-x-bound

28
Implementation of PTI
29
Drawback of PTI
  • Extra overhead in storing x-bounds
  • Small intervals near edges limit gains

right-0.2-bound
left-0.2-bound
30
Clustering 2D points
cluster of large intervals
yRi
  • When 2D points are clustered, intervals of
    different variances are separated
  • Points clustered based on means and variances
    (variance-based clustering)

xy
(Li,Ri)
cluster of smaller intervals
xLi
31
Answering PTRQ with 2D R-Tree
  • Construct a 2D R-tree over uncertain data by
    indexing (meani,variancei)
  • Query the 2D R-Tree
  • For uniform pdf, a PTRQ can be converted to a
    2D-range query

32
Querying Uniform pdf
y Ri
Li
Ri
xy
Q (p 0.75)
b
a
y(1-p)xp a Intervals containing a
a ltx lt y lt b Intervals in a,b
x(1-p)yp ? b Intervals containing b
b-a p(y-x) Intervals containing a,b
a
b
x Li
a
b
1D View (Uniform pdf)
2D View
33
Outline
  • Introduction
  • Related Work
  • Uncertain Data Probabilistic Queries
  • Indexing for Probabilistic Range Queries
  • Probabilistic Join Processing
  • Experimental Results
  • Prototype

34
Table Join
  • Join is an important database problem
  • Join operator , ?, gt, lt

35
Join over Uncertainty
  • How do we define comparison operators for
    uncertain data?

36
Semantics of Comparison Operators
  • We studied the meaning of ,?, gt,lt
  • Imprecision about two values satisfying a
    comparison is expressed by a probability value

37
Equality Join
  • In continuous domain, 2 real values are equal at
    a point with zero probability
  • Resolution c a is equal to b if they are within
    c of each other.

38
Efficient Join Processing
  • Computing joins is costly
  • Probability Threshold p can help
  • 3 pruning techniques between
  • 2 Items,
  • 2 Pages, and
  • 2 Indices
  • Example Equality join with resolution c

39
Item-Level Pruning
  • Assume cumulative distribution function Fi(x)
  • Let la,b,c be max(La- c, Lb - c) ,
  • Let ua,b,c be min(Ra c, Rb c)
  • P(a c b) is at most
  • min(Fa(ua,b,c) - Fa(la,b,c), Fb(ua,b,c) -
    Fb(la,b,c)) (?)
  • Equation (?) is easy to compute
  • If Equation (?) lt p, then a and b can be pruned

40
Page-Level Pruning
  • Goal Prune pages R and S without examining
    individual items
  • Solution Place x-bounds on R and S, and perform
    4 tests with x-bounds

41
Page-Level Pruning
  • Goal Prune pages R and S without examining
    individual items
  • Solution Place x-bounds on R and S, and perform
    4 tests with x-bounds

R.left-x-bound
R.right-x-bound
Page R
42
BNLJ and U-BNLJ
  • BNLJ
  • Block-Nested-Loop Join
  • Joins 2 lists of unordered pages
  • U-BNLJ
  • Place x-bounds on every page
  • Use page-level pruning techniques

43
Index-Level Pruning
  • Extension of page-level pruning
  • Construct PTI on the relation
  • Perform page-level pruning on the page storing
    the node (consists of x-bounds for children nodes)

44
INLJ and U-INLJ
  • INLJ
  • Index-Nested-Loop Join
  • Construct a R-tree or interval index for inner
    relation
  • U-INLJ
  • Construct PTI for inner relation

45
Outline
  • Introduction
  • Related Work
  • Uncertain Data Probabilistic Queries
  • Indexing for Probabilistic Range Queries
  • Probabilistic Join Processing
  • Experimental Results
  • Prototype

46
Experiment 1 Threshold Range Query
  • Compare number of I/Os between
  • 1D R-tree on intervals only
  • PTI (1D R-tree with probability thresholds)
  • 2D variance-based clustering (called Extensive)

47
Simulation Model
  • 100K uncertain data, with length uniformly
    distributed in 0,10000 and uniform uncertainty
    pdf
  • 10K PTRQs with length of a,b normally
    distributed and p ? 0.1,1
  • Each PTI node contains five x-bounds, where x
    ?0.1,0.3,0.5,0.7,0.9

48
Scalability of Indexes
  • Both PTI and Extensive outperform R-tree
  • Answering PTRQ with R-tree requires more
    computation
  • Extensive needs about 50 less I/Os than PTI

49
Query Probability Threshold
  • R-tree does not benefit from the increasing value
    of p
  • When p is 0.5, Extensive is 4 times better than
    PTI

50
Experiment 2 Threshold Join
  • 2 tables of uncertain data are generated
  • Compare interval joins and x-bound-enhanced
    equality joins
  • BNLJ vs. U-BNLJ
  • INLJ vs. U-INLJ
  • Measure no. of candidate pairs

51
BNLJ vs. U-BNLJ
52
INLJ vs. U-INLJ
53
Effect of Resolution c
54
Outline
  • Introduction
  • Related Work
  • Uncertain Data Probabilistic Queries
  • Indexing for Probabilistic Range Queries
  • Probabilistic Join Processing
  • Experimental Results
  • Prototype

55
U-DBMS Prototype
  • A system for handling uncertain data
  • Meta-queries for specifying data uncertainty
    (e.g., uncertainty interval, type of uncertainty
    pdf,)
  • Extension of SQL operators to support different
    probabilistic query classes
  • Measurement of probabilistic answer quality
  • Allows easy addition of new uncertain data types
    (e.g., uncertain pdf) and query operators

56
Architecture of U-DBMS
  • UNCERTAIN class data structures (e.g.,
    histogram) and access methods (e.g., find
    variance of uncertainty)
  • New data types and operators do not interfere
    with the original DBMS
  • Implemented with PL/SQL and external C code

57
Example Queries
58
Summary
  • Data staleness in sensor and mobile databases can
    render incorrect query answers
  • We propose uncertainty interval as a trade-off
    between data imprecision and update frequency
  • We perform an extensive study of uncertainty
    management classification, quality and efficient
    evaluation of probabilistic queries, indexing,
    joins
  • Prototype implementation using PostgreSQL

59
Related Publications
  • TKDE04 R. Cheng, D. V. Kalashnikov, and S.
    Prabhakar. Querying imprecise data in moving
    object environments. In IEEE TKDE, 2004.
  • VLDB04c R. Cheng, Y. Xia, S. Prabhakar, R.
    Shah, and J. S. Vitter. Efficient indexing
    methods for probabilistic threshold queries over
    uncertain data. In VLDB 2004.
  • SIGMOD03 R. Cheng, D. Kalashnikov, and S.
    Prabhakar. Evaluating probabilistic queries over
    imprecise data. In ACM SIGMOD 2003.
  • ICDE03 R. Cheng, S. Prabhakar, and D. V.
    Kalashnikov. Querying imprecise data in moving
    object environments. In IEEE ICDE 2003.
  • HCI04 R. Cheng and S. Prabhakar. Using
    Uncertainty to Provide Privacy-Preserving and
    High-Quality Location-Based Services. In Workshop
    on Location Systems Privacy and Control, Mobile
    HCI04.
  • VSSN04 K.Y. Lam, R. Cheng, B. Liang and J.
    Chau. Sensor Node Selection for Execution of
    Continuous Probabilistic Threshold Queries in
    Wireless Sensor Networks. In VSSN, ACM Multimedia
    2004.

60
References
  • FOCS96 L. Arge and J. S. Vitter. On dynamic
    interval management in external memory (extended
    abstract). In FOCS, p. 560-569, 1996.
  • TKDE92 D. Barbara, H. Garcia-Molina and D.
    Porter. The management of probabilistic data.
    IEEE TKDE, 4(5)487-502, 1992.
  • NAS02 Significance and statistical errors in
    the analysis of DNA microarray data J. Brody, 
    B. Williams,? B. Wold,? and S. Quake. Proc. Natl.
    Acad. Sci., U S A., 2002, 199(20).
  • CH89 C. Chatfield. The analysis of time series
    an introduction. Chapman and Hall, 1989.
  • VLDB04a N. Dalvi and D. Suciu. Efficient Query
    Evaluation on Probabilistic Databases. In VLDB
    2004.
  • VLDB04b A. Deshpande, C. Guestrin, S. Madden,
    J. Hellerstein and W. Hong. Model-Driven Data
    Acquisition in Sensor Networks. In VLDB, 2004.
  • CIDR05a A. Deshpande, C. Guestrin and S.
    Madden. Using Probabilistic Models for Data
    Management in Acquisitional Environments. In
    CIDR, 2005.
  • ICDE03 E. Hung, L. Getoor and V. S.
    Subrahmanian. PXML A Probabilistic
    Semistructured Data Model and Algebra. In ICDE
    2003.

61
References
  • JCSS96 P. C. Kanellakis, S. Ramaswamy, D.
    Vengroff, and J. S. Vitter. Indexing for data
    models with constraints and classes. In J. Comp.
    Syst. Sci, 52(3)589-612, 1996.
  • ADI00 Y. Manolopoulos, Y. Theodoridis, and V.
    J. Tsotras. Chapter 4 Access methods for
    intervals. In Advanced Database Indexing, Kluwer,
    2000.
  • VLDB02 A. Nierman and H. V. Jagadish. ProTDB
    Probabilistic Data in XML. In VLDB 2002.
  • DPD99 O. Wolfson, P. Sistla, S. Chamberlain,
    and Y. Yesha. Updating and querying databases
    that track mobile units. Distributed and Parallel
    Databases, 7(3), 1999.
  • SIGMOD02 C. Olston and J. Widom. Best-effort
    cache synchronization with source cooperation.
    In Proc. Of the ACM SIGMOD 2002.
  • ISSD99 D. Pfoser and C. S. Jensen. Capturing
    the Uncertainty of Moving-Object Representations,
    in Proc. of the Sixth International Symposium on
    Spatio Databases, Hong Kong, July 20-23, 1999,
    pp. 111-132.
  • CIDR05b J. Widom. Trio A system for integrated
    management of data, accuracy and lineage. In
    CIDR, 2005.

62
Thank You
  • Imprecise data is prevalent in increasing number
    of applications. I believe uncertainty management
    is an emerging and important area.

Life is Uncertain... Eat Dessert First! - S.
Gordon and H. Brecher
63
How to define uncertainty pdf?
  • The form of uncertainty pdf depends on the
    application e.g., Gaussian distribution models
    measurement error.
  • If no information about pdf is known, a simple
    way is to assume uniform pdf a pessimistic
    estimation
  • Can also use more sophisticated techniques, based
    on time-series analysis on past data for pdf
    derivation CH89

64
Classical Decomposition
  • For a discrete time series, let Xt be a random
    variable at time t
  • Xt mt st Yt
  • mt trend, a slowly-moving function
  • moving-average filter, exponential smoothing,
    curve fitting/regression
  • st seasonal component periodic function
  • Yt noise component
  • Example mt2t1,stsin(t),YtN(0,1)
  • pdf(100) N(201sin(100),1)
Write a Comment
User Comments (0)
About PowerShow.com