Title: Probabilistic Data
1Probabilistic Data
- Dan Suciu
- University of Washington
joint with Nilesh Dalvi and Gerome Miklau
2Probabilistic Data
- Deterministic data
- An item is in the database or not
- A tuple is an answer to the query or not
- Today ALL data is deterministic Relational, XML,
... - Probabilistic data
- An item is in the data is a probabilistic event
- A tuple is an answer to query is a prob. event
- May be applied to all data Relational, XML, ...
3Long History
- CavalloPitarelli1987
- Barbara,Garcia-Molina, Porter1992
- Lakshmanan,Leone,RossSubrahmanian1997
- FuhrRoellke1997
- DalviS2004
- Widom2005
4Why Now ?
- Application pullThe need to manage imprecisions
in data - Technology pushConvergence of several
techniques for probabilistic query processing
5Application Pull
- Imprecisions in data non-matching data values,
imprecise queries, inconsistent data, misaligned
schemas, information extracted from text, etc.
6Technology Push
This talk
Probabilistic data is fundamentally complex
- Contributing technologies
- Top-k algorithms
- Efficient Monte Carlo Simulations
- Derived from 0/1 Laws for sparse graphs
- Efficient query processing on prob. data
This talk
See also SIGMOD05 Tutorial on Prob. DB
7Outline
- Applications
- A probabilistic data model
- Representation with explicit tuples
- Representation with implicit tuples
- Summary, Conclusions
8App. 1 Similarity Predicates
SELECT DISTINCT F.title, F.year FROM Actor
A, Film F, Casts C WHERE C.filmID F.filmID
and C.actorID A.actorID and
A.name Copolla and F.year 1995
and F.title rain man
Motro88,Agrawal03,Dalvi04
9Imprecision in the query answer
DalviS 04
10Probabilistic db ? semantics for complex queries
SELECT FROM Actor A WHERE A.name Kevin
and 1995 SELECT MIN(F.year) FROM
Film F, Casts C WHERE C.filmid
F.filmid and C.actorid
A.actorid and F.rating 'high'
11App 2 Inconsistent Data
Q Find products bought by both John and
Sue Repair semantics
A No certain answers
Name ? City is violated
Imprecision in the data
BertosiChomicki 03
12Probabilistic db ? Lower Precision, Higher Recall
Q Find products bought by both John and Sue
A Gizmo prob 1/3 Camera prob 1/6
Name ? City
13(No Transcript)
14Probabilistic db ? use statistics to increase
recall
Data Source V1
Data Source V2
Assume avg(Buildings/Dept) 5
A Larry Big with prob 0.2
DalviS 05
15App. 4 Information Leakage
- Secret S, public view V
- Privacy preserving data mining
- Publishing views
- k-anonymity
Imprecision here is GOOD protects privacy
SweneySmarati,Efvimiefski03,Miklau04,Miklau05
16Probabilistic db ? already being used
PrS a priori probability of secret S
PrS V a posterori probability of S
17Summary of Applications
- Many other applications record linkage, quality
in data integration, sensor data - more details in SIGMOD05 Tutorial
- Imprecisions in database or in query answers
- Need complex probabilistic models
- tuple correlations
- statistics
18Outline
- Applications
- A probabilistic data model
- Representation with explicit tuples
- Representation with implicit tuples
- Summary, Conclusions
19Probabilistic Database
- Ip probability distribution on all instances
Pr Inst ? 0,1
? PrI 1
A possible world is an instance I s.t. PrI gt 0
20(No Transcript)
21Query Semantics
Gven query Q, probabilistic data Ip
Q(Ip) a probability function on tuples
Prt ? Q ?t?Q(I) PrI
EVERY query has a semantics Return to user top-k
answers
22Select distinct x.product From Sales x, Sales
y Where x.name John and y.name Sue
and x.product y.product
PrI1 1/2
PrI2 1/12
PrI3 1/3
PrI4 1/12
23Summary on Data Semantics
- Possible worlds semantics very general, allows
arbitrary tuple correlations - Query semantics EVERY query has a simple, well
defined semantics
Problems how do we represent a probabilistic DB
? How do we evaluate queries ?
24Outline
- Applications
- A probabilistic data model
- Representation with explicit tuples
- Representation with implicit tuples
- Summary, Conclusions
25(No Transcript)
26(No Transcript)
27App. 1 Similarity Predicates
Film
Cast
Step 1 compute prob
SELECT DISTINCT year FROM Film, Cast WHERE
Film.filmID Cast.FilmID
and title rain and actorName Kevin
28App. 1 Similarity Predicates
Film
Cast
Step 1 compute prob
SELECT DISTINCT year FROM Film, Cast WHERE
Film.filmID Cast.FilmID
Step 2 execute query
and title rain and actorName Kevin
29Disjoint/Independent Tuples
For any t1, t2 one of the following holds
t1 and t2 are disjoint t1 and t2 are
independent t1 and t2 are the same event
R.I oid representing the independence
class R.D oid representing the disjoint
event R.P the probability
30Example
31App2 Constraint Violations
Independent AND disjoint tuples
Functional dependency Name ? City violated
32App2 Constraint Violations
Independent AND disjoint tuples
Step 1 compute prob
Step 2 evaluate query
Functional dependency Name ? City violated
33Query Evaluation
- Need algorithm for evaluating Q(Ip)
- Will discuss only tuple-independent Ip
- Idea compute tuple probabilities
34(No Transcript)
35Extensional Query Plans
- Each tuple t has a probability t.P
- Algebra operators compute t.P
- Data complexity PTIME
36Extensional Query Plans
?
s
?
37(No Transcript)
38Safe Plan Optimization 2/2
Optimization Algorithm. 0. Start with Q, proceed
top-down 1. Try a projection Q ?A(Q) for
any attribute A for which ?A(Q) is safe if
possible, continue optimizing Q 2. Try a join Q
Q ? Q (note all attrs in the join are in
head(Q)) 3. If not possible, then FAIL.
39Safe Plan Optimization 1/2
Joins, selections -- always safe
Proejction -- ?A1,...,Ak(Q) is safe iff for
every relation R in Q A1, ..., Ak,
R.key ? head(Q)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43Outline
- Applications
- A probabilistic data model
- Representation with explicit tuples
- Representation with implicit tuples
- Summary, Conclusions
44Implicit Tuples
- Motivation in reasoning about statistics, and in
information leakage, we do not have the set of
possible tuples - Roots random graphs and 0/1 laws for sparse
graphs
ErdosRenyi60, ShelahSpencer88, Lynch92
45Binomial Distribution (n, S)
Domain D, D n
Expected size of R S
46(No Transcript)
47Query Evaluation Theorem
Theorem For each conjunctive Q, ? c, d
PrQ c / nd O(1/nd1)
Corollary limnPrQ V always exists
Miklau,DalviS. 05
48Example
n D p S/n2
Compute PrQ by brute force.
Q R(x,b), R(a,y) (x,yvars, a,bconsts)
The n2 tuples are (a,b), (a,c2),(a,c3),...,(a,cn
), (c2,b),...,(cn,b), (c2,c2), ...
PrQ 1 (1 p)1 (1(1-p)n-1)2 1
(1p)(1((n-1)pO(1/n2))2 1 (1p)(1
(n-1)2p2 O(1/n3)) p n2p2 O(1/n3) (S
S2)/n2 O(1/n3)
49Examples
PrQ1 n S/n2 S/n
Q1 R(a,x)
Q2 R(a,x),R(x,y),R(y,b)
PrQ2 n2 S3/n6 S3/n4
Q3 R(a,b,x),R(y,b,c)
PrQ3 n2 S2/n6
Definition Let Q be a conjunctive query D(Q)
total arity number of variables C(Q)
product of all Ss
d ? c ?
50Main Theorem (contd)
Definition. Let Q be conjunctive query. A unifier
is h(Q), hhomomorphism UQ(Q) set of all
unifiers
Theorem. d min D(Q0) Q0 in UQ(Q) c
? C(Q0) Q0 in UQ(Q), D(Q0)d
51Example
Q R(a,y), R(x,b) (x,yvars, a,bconsts)
What is PrQ ?
PrQ (S S2)/n2 O(1/n3)
52Main Result (end)
- 0 when d(QV) gt d(V)
- c(QV)/c(V) when d(QV) d(V)
limn(Q V )
53(No Transcript)
54PrV (SS2)/n2 O(1/n3)
PrQV S/n2 O(1/n3)
PrQ V PrQV/PrV ? 1/(1S) 1/51
LAV statistics not quite certain answers
Miklau,DalviS. 05
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Outline
- Applications
- A probabilistic data model
- Representation with explicit tuples
- Representation with implicit tuples
- Summary, Conclusions
60Summary
- Data today lots of mprecisions
- ProbDB uniform techniqe to handle most
- ProbDB are more difficult to manage than DB
- Great opportunity for research that is both deep
and relevant
61Open Research Problems
- Query processing break P barrier
- Representation formalisms correlation, lineage
- Theory imprecise mappings
- Explanation / visualization / smart-GUI
UW
Stanford
Penn
nobody
62SIGMOD05 Tutorial on probdb www.cs.washington.edu
/homes/suciu
The End
Questions ?