Management of Probabilistic Data: Foundations and Challenges - PowerPoint PPT Presentation

About This Presentation

Title:

Management of Probabilistic Data: Foundations and Challenges

Description:

Mumbai. Goregaon West. 52. 1. P. City. Street. House-No. ID [Gupta&Sarawagi'2006] ...52 A Goregaon West Mumbai ... Here probabilities are meaningful. 20% of such ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 43

Provided by: DANS154

Learn more at: https://homes.cs.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Management of Probabilistic Data: Foundations and Challenges

1
Management of Probabilistic Data Foundations
and Challenges

Nilesh Dalvi and Dan Suciu
Univerisity of Washington

2
Databases Are Deterministic

Applications since 1970s required precise
semantics
Accounting, inventory
Database tools are deterministic
A tuple is an answer or is not
Underlying theory assumes determinism
FO (First Order Logic)

3
Future of Data Management

We need to cope with uncertainties !
Represent uncertainties as probabilities
Extend data management tools to handle
probabilistic data
Major paradigm shift affecting both foundations
and systems

4
Uncertainties Everywhere

In the schema mappings
Data spaces
Pay as you go data integration
In the data mapping
Life science data integration
Object reconciliation, fuzzy joins
In the data itself
Data by the masses
Information Extraction
RFID data, sensor data

Halevy2007
?PhilippiKohler2006
Arasu06
GuptaSarawagi2006
Welbourne2007
5
Example 1Data Integration in Life Sciences
B.Louie et al.2007

U2 integrates several biological databases

Example find functional annotations of ABCD1
EntrezProtein,Pfam,TIGRFAM,NCBI
Blast,EntrezGene
User types Gene ?ABCD1 U2 finds 80 related
proteins Ranks them by uncertainty score Correct
9 functions are among top 11
Need to represent uncertainties explicitly
6
Example 2Information Extraction
?...52 A Goregaon West Mumbai ...
GuptaSarawagi2006
20 of suchextractionsare correct
Here probabilities are meaningful
7
Example 3RFID Ecosystem at UW
Welbourne2007
8

RFID data noisy
SIGHTING(tagID, antennaID, time)
Derived data Probabilistic
John entered Room 524 at 915 prob0.6
John carried laptop x77 at 1103 prob0.8
. . .
Queries
Which people were in Room 478 yesterday ?

Massive amounts of probabilistic data from RFIDs,
sensors
9
A Model for Uncertainties

Data is probabilistic
Queries formulated in a standard language
Answers are annotated with probabilities

This talk Probabilistic Databases
10
Probabilistic databasesLong History

CavalloPitarelli1987
Barbara,Garcia-Molina, Porter1992
Lakshmanan,Leone,RossSubrahmanian1997
FuhrRoellke1997
DalviS2004
Widom2005

Focus today the Query Evaluation Problem
11
Has this been solved by AI ?
Fix qInput DB
Input KB
12
Outline

Data model
Query evaluation
Challenges

13
What is a Probabilistic Database (PDB) ?
Barbara et al.1992
Probability
Keys
Non-keys
HasObjectp
What does it mean ?
14
Background
Finite probability space (?, P)

?1, . . ., ?n set of outcomes
P ? ? 0,1
P(?1) . . . P(?n) 1

Event E ? ?, P(E) ???E P(?)
Independent P(E1 E2) P(E1)
P(E2) Mutual exclusive or disjoint
P(E1E2) 0
15
Possible Worlds Semantics
PDB
?

Possibleworlds
p1p3
p1p4
p1(1- p3-p4-p5)
16
Definitions
Definition A tuple-disjoint/independent table is
R(A1, A2, , Am, B1, , Bn, P)
Definition A tuple-independent table is
R(A1, A2, , Am, P)
Definition Semantics is given by possible worlds
17
HasObject(Object, Time, Person, P)
Disjoint
Inde- pen-
dent
Disjoint
Meets(Person1, Person2, Time, P)
Independent
18
Query Semantics
A boolean query q is an event ? ? q
P(q) ?? q P(?)
Did someone take MyBook to the CoffeeRoom ?
q
HasObject(MyBook,x,t), EnterRoom(x,CoffeeRoom,
t)
?
P(q) 0.96
(meaning quite likely !)
19
Discussion of Data Model

Tuple-disjoint/independent tables
Simple model, can store in any DBMS
More advanced models
Symbolic boolean expressions
Trio add lineage
Probabilistic Relational Models
Graphical models

Fuhr and Roellke
Widom05, Das Sarma06, Benjelloun 06
Getoor2006
SenDesphande07
20
Outline

Data model
Query evaluation
Probability of Boolean expressions
From queries to Boolean expressions
Data complexity of query evaluation
Challenges

21
Probability of Boolean Expressions
? X1X2 Ç X1X3 Ç X2X3
P(X1) p1 , P(X2) p2, P(X3) p3
Compute P(?)
?
Pr(?)(1-p1)p2p3 p1(1-p2)p3
p1p2(1-p3) p1p2p3
22
Background
Fix P(X1) P(X2) . . . P(Xn) 1/2
23
Query q Database PDB ? ?
R(x, y), S(x, z)
q
Sp
Rp
PDB
?
X1Y1 Ç X1Y2 Ç X2Y3 Ç X2Y4 Ç X2Y5
?
24
Application to Query Evaluation
Corollary Fix FO query qExact evaluation of
Pr(q) on input PDB is in P
Corollary Fix a conjunctive query
q.Approximation of Pr(q) on input PDB is in
PTIME(FPTRAS)
Graedel,Gurevitch,Hirsch1998
25
BackgroundProbabilistic Networks
R(x, y), S(x, z)
? X1Y1ÇX1Y2ÇX2Y3ÇX2Y4ÇX2Y5

Inference hard in general
KR techniques exploit local properties
E.g. bounded treewidth ? PTIME

Ç
Ç
Ç
?ZabiyakaDarwiche06
Æ
Æ
Æ
Æ
Æ
Note for this querythe treewidth isunbounded
X1
X2
Y2
Y1
Y3
Y4
Y5
p1
p2
q1
q2
q3
q4
q5
26
DS2004
safe plan
q
R(x, y), S(x, z)
The data complexityof this query is PTIME
27
Dichotomy Theorem
Let q be a conjunctive query without self-joins

Theorem One of the following holds
Either q is in PTIME
Or q is P hard

DS2004
In Case (1) q can be computed by a safe plan
and wecall it a safe query
Andritsos et al2006
28
P-Hard Queries
PTIME Queries
h1 R(x), S(x, y), T(y)
R(x, y), S(x, z)
h2 R(x,y), S(y)
R(x, y), S(y), T(a, y)
h3 R(x,y), S(x,y)
R(x), S(x, y), T(y), U(u, y), W(a, u)
. . .
. . .
How do we decide if a query is in PTIME or P
hard ?
29
Hierarchical Queries
sg(x) set of subgoals containing the variable x
in a key position
Definition A query q is hierarchical if forall
x, y sg(x) ? sg(y) or sg(x) ?
sg(y) or sg(x) ? sg(y) ?
30
Case 1 Independent Tuples Only
DS2004
PTIME Queries
Fact If q is hierarchical then q is in PTIME

The hierarchy gives the safe plan !
Root variable u ? ?-u
Connected components ? Join

31
Case 1 Independent Tuples Only
DS2004
P-hard Queries
h1 R(x), S(x, y), T(y)
Recall
h1 is P-hard (reduction from Partitioned
Positive 2DNF)
ProvanBall83
Fact If q is non-hierarchical then it is P-hard.
Proof it contains h1q . . . R(x, . ..),
S(x, y, . . .), T(y, . . .) . . .
Theorem Testing if q is PTIME or P-hard is in AC0
32
Case 2 Independent/disjoint Tuples
?-uD
PTIME Queries
Joinu
R(x), S(x, y), T(y), U(u, y), W(a, u)
?-yD
Wp(a,u)
y
x
T
Joiny
S
R
?-xI
Tp(y)
Up(u,y)
W
U
u
Joinx
Independentproject

Root variable ? ?I
CCs ? Join
Constant key attrs ? ?D

Rp(x)
Sp(x,y)
33
Case 2 Independent/disjoint Tuples
P-hard Queries
Recall
h1 R(x), S(x, y), T(y)
h2 R(x,y), S(y)
P-hard by reduction from PERMANENT
h3 R(x,y), S(x,y)
If the safe-plan algorithm fails on q, then q can
be rewritten to either h1 or h2 or h3 and
hence is P-hard(see paper for details)
Theorem Testing if q is PTIME or P-hard is PTIME
complete
34
Summary on Query Evaluation

We understand completely only queries w/o
self-joins
Lessons learned from our system MystiQ
When the query is safe
Evaluate it exactly, in the database engine
Performance close to regular SQL
When the query is unsafe
Approximate it, compute only top-k
Performance one or two orders of magnitude worse

Re2007
35
Outline

Data model
Query evaluation
Challenges

36
Query Optimization
Re2007,Re2007b

Even a P-hard query often has subqueries that
are in PTIME. Needed
Combine safe plans probabilistic inference
Interesting indepence/disjointness
Model a probabilistic engine as black-box

CHALLENGE Integrate a black-box probabilistic
inference in a query processor.
37
Probabilistic Inference Algorithms

Open the box ! Logical to physical
Examine specific algorithms from KR
Variable elimination
Junction trees
Bounded treewidth

SenDeshpande2007
BravoRamakrishnan2007
CHALLENGE (1) Study the space of optimization
alternatives. (2) Estimate the cost of specific
probabilistic inference algorithms.
38
Open Theory Problems

Self-joins are much harder to study
Solved only for independent tuples
Extend to richer query language
Unions, predicates (lt , , ?), aggregates
Do hardness results still hold for Pr 1/2 ?

DS2007
CHALLENGE Complete the analysis of the query
complexity over probabilistic databases
39
Complex Probabilistic Model

Independent and disjoint tuples are insufficient
for real applications
Capturing complex correlations
Lineage
Graphical models

Das Sarma06,Benjelloum06
Getoor06,SenDeshpande07
CHALLENGE Explore the connection between complex
models and views
VermaPearl1990
40
Constraints
Shen06, Andritsos06, Richardson06,Chaudhuri07