Probabilistic Data - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilistic Data

Description:

An item is in the database or not. A tuple is an answer to the query or not ... Cavallo&Pitarelli:1987. Barbara,Garcia-Molina, Porter:1992 ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 52
Provided by: dbCsBe
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Data


1
Probabilistic Data
  • Dan Suciu
  • University of Washington

joint with Nilesh Dalvi and Gerome Miklau
2
Probabilistic Data
  • Deterministic data
  • An item is in the database or not
  • A tuple is an answer to the query or not
  • Today ALL data is deterministic Relational, XML,
    ...
  • Probabilistic data
  • An item is in the data is a probabilistic event
  • A tuple is an answer to query is a prob. event
  • May be applied to all data Relational, XML, ...

3
Long History
  • CavalloPitarelli1987
  • Barbara,Garcia-Molina, Porter1992
  • Lakshmanan,Leone,RossSubrahmanian1997
  • FuhrRoellke1997
  • DalviS2004
  • Widom2005

4
Why Now ?
  • Application pullThe need to manage imprecisions
    in data
  • Technology pushConvergence of several
    techniques for probabilistic query processing

5
Application Pull
  • Imprecisions in data non-matching data values,
    imprecise queries, inconsistent data, misaligned
    schemas, information extracted from text, etc.

6
Technology Push
This talk
Probabilistic data is fundamentally complex
  • Contributing technologies
  • Top-k algorithms
  • Efficient Monte Carlo Simulations
  • Derived from 0/1 Laws for sparse graphs
  • Efficient query processing on prob. data

This talk
See also SIGMOD05 Tutorial on Prob. DB
7
Outline
  • Applications
  • A probabilistic data model
  • Representation with explicit tuples
  • Representation with implicit tuples
  • Summary, Conclusions

8
App. 1 Similarity Predicates
SELECT DISTINCT F.title, F.year FROM Actor
A, Film F, Casts C WHERE C.filmID F.filmID
and C.actorID A.actorID and
A.name Copolla and F.year 1995
and F.title rain man
Motro88,Agrawal03,Dalvi04
9
Imprecision in the query answer
DalviS 04
10
Probabilistic db ? semantics for complex queries
SELECT FROM Actor A WHERE A.name Kevin
and 1995 SELECT MIN(F.year) FROM
Film F, Casts C WHERE C.filmid
F.filmid and C.actorid
A.actorid and F.rating 'high'
11
App 2 Inconsistent Data
Q Find products bought by both John and
Sue Repair semantics
A No certain answers
Name ? City is violated
Imprecision in the data
BertosiChomicki 03
12
Probabilistic db ? Lower Precision, Higher Recall
Q Find products bought by both John and Sue
A Gizmo prob 1/3 Camera prob 1/6
Name ? City
13
(No Transcript)
14
Probabilistic db ? use statistics to increase
recall
Data Source V1
Data Source V2
Assume avg(Buildings/Dept) 5
A Larry Big with prob 0.2
DalviS 05
15
App. 4 Information Leakage
  • Secret S, public view V
  • Privacy preserving data mining
  • Publishing views
  • k-anonymity

Imprecision here is GOOD protects privacy
SweneySmarati,Efvimiefski03,Miklau04,Miklau05
16
Probabilistic db ? already being used
PrS a priori probability of secret S
PrS V a posterori probability of S
17
Summary of Applications
  • Many other applications record linkage, quality
    in data integration, sensor data
  • more details in SIGMOD05 Tutorial
  • Imprecisions in database or in query answers
  • Need complex probabilistic models
  • tuple correlations
  • statistics

18
Outline
  • Applications
  • A probabilistic data model
  • Representation with explicit tuples
  • Representation with implicit tuples
  • Summary, Conclusions

19
Probabilistic Database
  • Ip probability distribution on all instances

Pr Inst ? 0,1
? PrI 1
A possible world is an instance I s.t. PrI gt 0
20
(No Transcript)
21
Query Semantics
Gven query Q, probabilistic data Ip
Q(Ip) a probability function on tuples
Prt ? Q ?t?Q(I) PrI
EVERY query has a semantics Return to user top-k
answers
22
Select distinct x.product From Sales x, Sales
y Where x.name John and y.name Sue
and x.product y.product
PrI1 1/2
PrI2 1/12
PrI3 1/3
PrI4 1/12
23
Summary on Data Semantics
  • Possible worlds semantics very general, allows
    arbitrary tuple correlations
  • Query semantics EVERY query has a simple, well
    defined semantics

Problems how do we represent a probabilistic DB
? How do we evaluate queries ?
24
Outline
  • Applications
  • A probabilistic data model
  • Representation with explicit tuples
  • Representation with implicit tuples
  • Summary, Conclusions

25
(No Transcript)
26
(No Transcript)
27
App. 1 Similarity Predicates
Film
Cast
Step 1 compute prob
SELECT DISTINCT year FROM Film, Cast WHERE
Film.filmID Cast.FilmID
and title rain and actorName Kevin
28
App. 1 Similarity Predicates
Film
Cast
Step 1 compute prob
SELECT DISTINCT year FROM Film, Cast WHERE
Film.filmID Cast.FilmID
Step 2 execute query
and title rain and actorName Kevin
29
Disjoint/Independent Tuples
For any t1, t2 one of the following holds
t1 and t2 are disjoint t1 and t2 are
independent t1 and t2 are the same event
R.I oid representing the independence
class R.D oid representing the disjoint
event R.P the probability
30
Example
31
App2 Constraint Violations
Independent AND disjoint tuples
Functional dependency Name ? City violated
32
App2 Constraint Violations
Independent AND disjoint tuples
Step 1 compute prob
Step 2 evaluate query
Functional dependency Name ? City violated
33
Query Evaluation
  • Need algorithm for evaluating Q(Ip)
  • Will discuss only tuple-independent Ip
  • Idea compute tuple probabilities

34
(No Transcript)
35
Extensional Query Plans
  • Each tuple t has a probability t.P
  • Algebra operators compute t.P
  • Data complexity PTIME

36
Extensional Query Plans
?
s
?
37
(No Transcript)
38
Safe Plan Optimization 2/2
Optimization Algorithm. 0. Start with Q, proceed
top-down 1. Try a projection Q ?A(Q) for
any attribute A for which ?A(Q) is safe if
possible, continue optimizing Q 2. Try a join Q
Q ? Q (note all attrs in the join are in
head(Q)) 3. If not possible, then FAIL.
39
Safe Plan Optimization 1/2
Joins, selections -- always safe
Proejction -- ?A1,...,Ak(Q) is safe iff for
every relation R in Q A1, ..., Ak,
R.key ? head(Q)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
Outline
  • Applications
  • A probabilistic data model
  • Representation with explicit tuples
  • Representation with implicit tuples
  • Summary, Conclusions

44
Implicit Tuples
  • Motivation in reasoning about statistics, and in
    information leakage, we do not have the set of
    possible tuples
  • Roots random graphs and 0/1 laws for sparse
    graphs

ErdosRenyi60, ShelahSpencer88, Lynch92
45
Binomial Distribution (n, S)
Domain D, D n
Expected size of R S
46
(No Transcript)
47
Query Evaluation Theorem
Theorem For each conjunctive Q, ? c, d
PrQ c / nd O(1/nd1)
Corollary limnPrQ V always exists
Miklau,DalviS. 05
48
Example
n D p S/n2
Compute PrQ by brute force.
Q R(x,b), R(a,y) (x,yvars, a,bconsts)
The n2 tuples are (a,b), (a,c2),(a,c3),...,(a,cn
), (c2,b),...,(cn,b), (c2,c2), ...
PrQ 1 (1 p)1 (1(1-p)n-1)2 1
(1p)(1((n-1)pO(1/n2))2 1 (1p)(1
(n-1)2p2 O(1/n3)) p n2p2 O(1/n3) (S
S2)/n2 O(1/n3)
49
Examples
PrQ1 n S/n2 S/n
Q1 R(a,x)
Q2 R(a,x),R(x,y),R(y,b)
PrQ2 n2 S3/n6 S3/n4
Q3 R(a,b,x),R(y,b,c)
PrQ3 n2 S2/n6
Definition Let Q be a conjunctive query D(Q)
total arity number of variables C(Q)
product of all Ss
d ? c ?
50
Main Theorem (contd)
Definition. Let Q be conjunctive query. A unifier
is h(Q), hhomomorphism UQ(Q) set of all
unifiers
Theorem. d min D(Q0) Q0 in UQ(Q) c
? C(Q0) Q0 in UQ(Q), D(Q0)d
51
Example
Q R(a,y), R(x,b) (x,yvars, a,bconsts)
What is PrQ ?
PrQ (S S2)/n2 O(1/n3)
52
Main Result (end)
  • 0 when d(QV) gt d(V)
  • c(QV)/c(V) when d(QV) d(V)

limn(Q V )
53
(No Transcript)
54
PrV (SS2)/n2 O(1/n3)
PrQV S/n2 O(1/n3)
PrQ V PrQV/PrV ? 1/(1S) 1/51
LAV statistics not quite certain answers
Miklau,DalviS. 05
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
Outline
  • Applications
  • A probabilistic data model
  • Representation with explicit tuples
  • Representation with implicit tuples
  • Summary, Conclusions

60
Summary
  • Data today lots of mprecisions
  • ProbDB uniform techniqe to handle most
  • ProbDB are more difficult to manage than DB
  • Great opportunity for research that is both deep
    and relevant

61
Open Research Problems
  • Query processing break P barrier
  • Representation formalisms correlation, lineage
  • Theory imprecise mappings
  • Explanation / visualization / smart-GUI

UW
Stanford
Penn
nobody
62
SIGMOD05 Tutorial on probdb www.cs.washington.edu
/homes/suciu
The End
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com