Title: Probabilistic answers to relational queries (PARQ)
1Probabilistic answers to relational queries (PARQ)
- Octavian Udrea
- Yu Deng
- Edward Hung
- V. S. Subrahmanian
2Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
3Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
4Motivation
- Query algebras do not take semantics into account
when computing answers - Data is not always precise
- Ambiguity, insufficient information
- Goal Use probabilistic ontologies to improve
query answer recall and quality
5The probabilistic solution
- Compute and return answers with high probability
( gt pthr) - Keep probabilities hidden from the user
- Problems
- How do we assign a probability to each data item?
- How do we choose pthr?
6Concepts
- Constraint probabilistic ontologies
- Is-a graph with edges labeled with probabilities
- Including conditional probabilities
- Disjoint decompositions
- Ontologies associated with terms in a data source
- Attributes in a relation/XML
- Propositional entities in text sources
7Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
8Running example
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
9Example decompositions
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
10Example probability labels
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
11Example conditional probabilities
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
12Running example Sample queries
- Ed Masters opposed the new marketing policy
during the board meeting...Eric claimed Ed was
not aware of the situation in the financial
unit... - What type of board meeting is being discussed?
- Since Ed Masters is present, there is a 75
probability it is a board of directors meeting - What type of financial unit is referenced?
- Since the subject is marketing policy, there is a
65 probability it is the Financial Review Board.
13Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
14Technical preliminaries POB
- POB schema
- C is a finite set of classes
- is a directed acyclic graph
- me produces clusters (disjoint decompositions)
for each node - me(OrganizationUnit) Comittee, Board, Team,
Department, Legal, Executive, Financial,
Marketing - maps each edge in to a positive
rational number in 0,1 -
15Back to the example
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
16Constraint probabilities
- Simple constraints
-
- Only for entities NOT represented in the current
ontology - Nil constraint
- Constraint probabilities
- Pair , with p in 0,1 and a
conjunction of simple constraints
17Labeling
- Labeling should not be arbitrary
- Invalid labeling may lead to time-consuming
consistency algorithms - And to ambiguity in interpreting query answers
- Valid labeling
- No constraint refers to the entities associated
with this ontology - There is exactly one nil constraint probability
on each edge
18Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
19The CPO model
- CPO
- C is a finite set of classes
- is a directed acyclic graph
- me produces clusters (disjoint decompositions)
for each node - is a valid labeling for
- Note there is no condition on the
probabilities....yet!
20CPO enhanced data sources
- Associate CPOs with some attributes of a
relation. - Associate CPOs with elements in an XML data
store. - Associate CPOs with some keywords for text files.
- CPOk
- At most k probabilities on each edge
- CPO1 is a POB
21Answering queries
- Ed Masters opposed the new marketing policy
during the board meeting...Eric claimed Ed was
not aware of the situation in the financial
unit... - What type of board meeting is being discussed?
- Since Ed Masters is present, there is a 75
probability it is a board of directors meeting - Goal Associate probabilities with possible
answers.
22Probability path
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
23Probability path
- if
- c gt c1 gt c2 gt gt ck gt d
- f is a function defined on the chain
-
-
- f selects one probability on each edge
- is the set of constraints selected
by f along with the probabilities
24CPO consistency
- CPO
- An arbitrary universe of objects O
- Interpretation e is a mapping from C to 2O
- e is a taxonomic model iff
- We assign objects to each class
- Objects cannot be shared between classes in the
same cluster - gt edges imply subset relations on the sets of
objects assigned to each class - If A gt B is labeled with probability p, at least
p percent of objects in A are also assigned to B
25CPO consistency (contd)
- CPO consistent ? it has a taxonomic probabilistic
model - Deciding if a CPO is consistent is NP-complete
- The weight formula satisfiability problem.
- A non-deterministic algorithm for consistency
checking is straightforward.
26Consistency approach
- Identify a subclass of CPOs for which we can
check consistency - Two parts
- Pseudoconsistency this was done for POBs
- Well-structuredness particular to CPOs
27Pseudoconsistent CPO
- CPO
- No two classes in the same cluster have a common
subclass - The graph is rooted
- For every immediate distinct subclasses of c,
they either - Have no common subclass
- Have a greatest common subclass different from
them - No cycles
- If c inherits from multiple clusters, all paths
from descendants of c to the root go through c
28Pseudoconsistency
29Weight factor
- A set P of not-nil constraint probabilities
- If P is the empty set, wf(P) 0
- If P (p,?), wf(P) p
- wf(P U Q) wf(P) wf(Q) wf(P) wf(Q)
- Intuitive meaning how many objects from class A
do I have to assign to class B and satisfy the
constraints?
30More weight factors
- CPO
- c gt d an edge
- We write
- We define
- Result Conditions of taxonomic interpretation
can be satisfied by selecting at most w(c,d)Od
objects from d into c.
31Well-structured CPO
- Conditional constraints on edges from the same
cluster must be disjoint - Otherwise, impossible to cpumte a weight factor
for the cluster edges. - The sum of the weight factors for edges in a
cluster is 1
32Well-structuredness
33Consistent CPOs revisited
- A pseudoconsistent and well-structured CPO is
consistent - Pseudoconsistency accounts for most of the
conditions in the taxonomic interpretation - Well-structuredness accounts for the the
assignment of objects to subclasses
34Consistency checking algorithm
- Pseudoconsistency is O(n2e) and
well-structuredness is O(n2k2) - n number of classes
- e the number of edges
- k the order of the CPO
- Algorithm based on
- Topological sort
- Dijskstra and derivatives
35CPO enhanced algebras
- CPO enhanced algebras formally defined for
- Relational data sources
- XML data stores
- Selection, projection, product, join, etc.
- Ongoing work
- RDF ehanced query algebra
- Directly related to RDF extraction from text.
36Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
37CPO integration motivation
ACME corp. CPO
EVIL corp. CPO
Email from ACME corp. to EVIL corp. During you
last FO board meeting, the rising costs of
quality assurance were not addressed. We would
like to include this in our next auditing
comittee meeting....
38Merging CPOs
- Two scenarios
- One data source that refers to similar entities
but from different application domains. - Example ACME EVIL correspondence
- Queries across multiple data sources
- Example Two different CPOs associated with
distinct relations during a join query.
39Interoperation constraints
- Since the CPOs being merged refer to similar
entities, some classes may be euqivalent - Equality constraints c1c2
- Possiblity immediate subclassing constraints
- Not really used hardly feasible
40The integration problem
- Two CPOs S1 (C1, gt1, me1, f1), S2 (C2, gt2,
me2, f2) - Set of interoperation constraints I
- An integration witness is another CPO S (C, gt,
me, f) that satisifes S1, S2 and I
41Integration witness
- Every class c in C1 U C2
- Appears in C OR
- cd appears in I and d ? C
- i.e. no classes get lost
- Similarly, no edges are lost
- No constraints are lost
- If two identical constraint probabilities are on
the same edge in both CPOs, take a probability p
between the two
42Integration witness
- Immediate subclassing constraints add edges to S
- No cluster can be split as a result of merging
- S is pseudoconsistent and well-structured (if
its not, its of no use) - Open problem If it is not, how can we minimally
change it such that it has these properties?
43CPOmerge algorithm
- CPOmerge produces an integration witness if
exists - O(n3) costly
- In pratice, much more efficient through
- Caching
- Some properties are preserved if the original
ontologies are pseudo-consistent and
well-structured
44Who writes the interop constraints?
- User not feasible
- How to infer them?
- Intuitive solution If enough neighbours are in
equality constraints, then infer respective nodes
should be equivalent. - But we still need some equivalence constraints to
get started use lexical distance - How many neighbors are enough?
45ICI Simple solution
- Neighbor parent, immediate child, sibling from
the same cluster - We define
- ne number of neighbors in equality constraints
- nc,d number of neighbors of c,d
- Why? Number of equal neighbors / Total number of
neighbors (including self). - Always lt 1
- ICI algorithm if pe exceeds threshold, assume
they are equal - Start with lexical distance
46Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
47Give me a CPO
- Very little work so far on probabilistic
ontologies. - Nothing resembling CPOs around
- How do we infer them
- How do we build disjoint decompositions?
- How do we infer probabilities?
48Building disjoint decompositions
- Take regular ontologies from the Web
- Many sources daml.org, SchemaWeb, OntoBroker
- Modify CPOmerge to ignore labeling
- The merge result will contain disjoint
decompositions - Equality constraints can be inferred through ICI
49Infer probabilities simple methods
- Simple methods
- Distribute probabilities uniformly within each
cluster - For each cluster L in me(c), dgtc,
- For any distance function (lexical or otherwise)
50Advanced methods
- Probabilistic relational models with structural
uncertainty - Work by Dr. Getoor et. al
- Classification approach
- Feature extraction determines entities of
interest - Create conditional probabilities on those
entities - User feedback approach
- General, applicable to any of the above
- (ongoing work)
51Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
52Experimental setup
- Java implementation
- CPO enhanced relational DB
- Movies database maintained by Dr. Wiederhold
- IMDB data
- IMDB to estimate recall
- Classifications from the Web to build initial CPO
53Consistency check inference
54Recall
55Precision
56Answer quality
57Query running time
58ICI quality
59Bottomline
- Clear improvement in query answer quality
- Some time penalty, but reasonable
- Very little user intervention
- CPOs are suited for a wide variety of data
sources - Potentially, they can be used to convey semantics
across heterogenous data sources
60Content
- Motivation and goals
- Running example
- Technical preliminaries
- CPO model
- CPO integration
- CPO inference algorithms
- Experimental results
- Ongoing work
61Current experimental setup
- DBLP data
- over 60 years of scientific publications
- XML data set
- CPOs from complex ontologies
- DBLP classification
- ACM classification of subjects
62Goals (1)
- Determine the efficiency of advanced CPO
inference methods - Experimentally determine the best approach in
terms of minimizing user feedback
63Goals (2)
- Use CPOs with RDF databases
- For extracting RDF from text as a means of using
semantic information - For answering queries from RDF databases
- Benefits
- Probabilistic model is clearly formalized
- Proven improvement in answer quality
- Experimentally determine what the probability
threshold may be for various domains
64