Probabilistic answers to relational queries (PARQ) - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Probabilistic answers to relational queries (PARQ)

Description:

Query algebras do not take semantics into account when computing answers ... Movies database maintained by Dr. Wiederhold. IMDB data. IMDB to estimate recall ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 65
Provided by: octavia
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic answers to relational queries (PARQ)


1
Probabilistic answers to relational queries (PARQ)
  • Octavian Udrea
  • Yu Deng
  • Edward Hung
  • V. S. Subrahmanian

2
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

3
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

4
Motivation
  • Query algebras do not take semantics into account
    when computing answers
  • Data is not always precise
  • Ambiguity, insufficient information
  • Goal Use probabilistic ontologies to improve
    query answer recall and quality

5
The probabilistic solution
  • Compute and return answers with high probability
    ( gt pthr)
  • Keep probabilities hidden from the user
  • Problems
  • How do we assign a probability to each data item?
  • How do we choose pthr?

6
Concepts
  • Constraint probabilistic ontologies
  • Is-a graph with edges labeled with probabilities
  • Including conditional probabilities
  • Disjoint decompositions
  • Ontologies associated with terms in a data source
  • Attributes in a relation/XML
  • Propositional entities in text sources

7
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

8
Running example
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
9
Example decompositions
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
10
Example probability labels
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
11
Example conditional probabilities
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
12
Running example Sample queries
  • Ed Masters opposed the new marketing policy
    during the board meeting...Eric claimed Ed was
    not aware of the situation in the financial
    unit...
  • What type of board meeting is being discussed?
  • Since Ed Masters is present, there is a 75
    probability it is a board of directors meeting
  • What type of financial unit is referenced?
  • Since the subject is marketing policy, there is a
    65 probability it is the Financial Review Board.

13
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

14
Technical preliminaries POB
  • POB schema
  • C is a finite set of classes
  • is a directed acyclic graph
  • me produces clusters (disjoint decompositions)
    for each node
  • me(OrganizationUnit) Comittee, Board, Team,
    Department, Legal, Executive, Financial,
    Marketing
  • maps each edge in to a positive
    rational number in 0,1

15
Back to the example
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
16
Constraint probabilities
  • Simple constraints
  • Only for entities NOT represented in the current
    ontology
  • Nil constraint
  • Constraint probabilities
  • Pair , with p in 0,1 and a
    conjunction of simple constraints

17
Labeling
  • Labeling should not be arbitrary
  • Invalid labeling may lead to time-consuming
    consistency algorithms
  • And to ambiguity in interpreting query answers
  • Valid labeling
  • No constraint refers to the entities associated
    with this ontology
  • There is exactly one nil constraint probability
    on each edge

18
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

19
The CPO model
  • CPO
  • C is a finite set of classes
  • is a directed acyclic graph
  • me produces clusters (disjoint decompositions)
    for each node
  • is a valid labeling for
  • Note there is no condition on the
    probabilities....yet!

20
CPO enhanced data sources
  • Associate CPOs with some attributes of a
    relation.
  • Associate CPOs with elements in an XML data
    store.
  • Associate CPOs with some keywords for text files.
  • CPOk
  • At most k probabilities on each edge
  • CPO1 is a POB

21
Answering queries
  • Ed Masters opposed the new marketing policy
    during the board meeting...Eric claimed Ed was
    not aware of the situation in the financial
    unit...
  • What type of board meeting is being discussed?
  • Since Ed Masters is present, there is a 75
    probability it is a board of directors meeting
  • Goal Associate probabilities with possible
    answers.

22
Probability path
Email fragment Ed Masters opposed the new
marketing policy during the board meeting...Eric
claimed Ed was not aware of the situation in the
financial unit...
23
Probability path
  • if
  • c gt c1 gt c2 gt gt ck gt d
  • f is a function defined on the chain
  • f selects one probability on each edge
  • is the set of constraints selected
    by f along with the probabilities

24
CPO consistency
  • CPO
  • An arbitrary universe of objects O
  • Interpretation e is a mapping from C to 2O
  • e is a taxonomic model iff
  • We assign objects to each class
  • Objects cannot be shared between classes in the
    same cluster
  • gt edges imply subset relations on the sets of
    objects assigned to each class
  • If A gt B is labeled with probability p, at least
    p percent of objects in A are also assigned to B

25
CPO consistency (contd)
  • CPO consistent ? it has a taxonomic probabilistic
    model
  • Deciding if a CPO is consistent is NP-complete
  • The weight formula satisfiability problem.
  • A non-deterministic algorithm for consistency
    checking is straightforward.

26
Consistency approach
  • Identify a subclass of CPOs for which we can
    check consistency
  • Two parts
  • Pseudoconsistency this was done for POBs
  • Well-structuredness particular to CPOs

27
Pseudoconsistent CPO
  • CPO
  • No two classes in the same cluster have a common
    subclass
  • The graph is rooted
  • For every immediate distinct subclasses of c,
    they either
  • Have no common subclass
  • Have a greatest common subclass different from
    them
  • No cycles
  • If c inherits from multiple clusters, all paths
    from descendants of c to the root go through c

28
Pseudoconsistency
29
Weight factor
  • A set P of not-nil constraint probabilities
  • If P is the empty set, wf(P) 0
  • If P (p,?), wf(P) p
  • wf(P U Q) wf(P) wf(Q) wf(P) wf(Q)
  • Intuitive meaning how many objects from class A
    do I have to assign to class B and satisfy the
    constraints?

30
More weight factors
  • CPO
  • c gt d an edge
  • We write
  • We define
  • Result Conditions of taxonomic interpretation
    can be satisfied by selecting at most w(c,d)Od
    objects from d into c.

31
Well-structured CPO
  • Conditional constraints on edges from the same
    cluster must be disjoint
  • Otherwise, impossible to cpumte a weight factor
    for the cluster edges.
  • The sum of the weight factors for edges in a
    cluster is 1

32
Well-structuredness
33
Consistent CPOs revisited
  • A pseudoconsistent and well-structured CPO is
    consistent
  • Pseudoconsistency accounts for most of the
    conditions in the taxonomic interpretation
  • Well-structuredness accounts for the the
    assignment of objects to subclasses

34
Consistency checking algorithm
  • Pseudoconsistency is O(n2e) and
    well-structuredness is O(n2k2)
  • n number of classes
  • e the number of edges
  • k the order of the CPO
  • Algorithm based on
  • Topological sort
  • Dijskstra and derivatives

35
CPO enhanced algebras
  • CPO enhanced algebras formally defined for
  • Relational data sources
  • XML data stores
  • Selection, projection, product, join, etc.
  • Ongoing work
  • RDF ehanced query algebra
  • Directly related to RDF extraction from text.

36
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

37
CPO integration motivation
ACME corp. CPO
EVIL corp. CPO
Email from ACME corp. to EVIL corp. During you
last FO board meeting, the rising costs of
quality assurance were not addressed. We would
like to include this in our next auditing
comittee meeting....
38
Merging CPOs
  • Two scenarios
  • One data source that refers to similar entities
    but from different application domains.
  • Example ACME EVIL correspondence
  • Queries across multiple data sources
  • Example Two different CPOs associated with
    distinct relations during a join query.

39
Interoperation constraints
  • Since the CPOs being merged refer to similar
    entities, some classes may be euqivalent
  • Equality constraints c1c2
  • Possiblity immediate subclassing constraints
  • Not really used hardly feasible

40
The integration problem
  • Two CPOs S1 (C1, gt1, me1, f1), S2 (C2, gt2,
    me2, f2)
  • Set of interoperation constraints I
  • An integration witness is another CPO S (C, gt,
    me, f) that satisifes S1, S2 and I

41
Integration witness
  • Every class c in C1 U C2
  • Appears in C OR
  • cd appears in I and d ? C
  • i.e. no classes get lost
  • Similarly, no edges are lost
  • No constraints are lost
  • If two identical constraint probabilities are on
    the same edge in both CPOs, take a probability p
    between the two

42
Integration witness
  • Immediate subclassing constraints add edges to S
  • No cluster can be split as a result of merging
  • S is pseudoconsistent and well-structured (if
    its not, its of no use)
  • Open problem If it is not, how can we minimally
    change it such that it has these properties?

43
CPOmerge algorithm
  • CPOmerge produces an integration witness if
    exists
  • O(n3) costly
  • In pratice, much more efficient through
  • Caching
  • Some properties are preserved if the original
    ontologies are pseudo-consistent and
    well-structured

44
Who writes the interop constraints?
  • User not feasible
  • How to infer them?
  • Intuitive solution If enough neighbours are in
    equality constraints, then infer respective nodes
    should be equivalent.
  • But we still need some equivalence constraints to
    get started use lexical distance
  • How many neighbors are enough?

45
ICI Simple solution
  • Neighbor parent, immediate child, sibling from
    the same cluster
  • We define
  • ne number of neighbors in equality constraints
  • nc,d number of neighbors of c,d
  • Why? Number of equal neighbors / Total number of
    neighbors (including self).
  • Always lt 1
  • ICI algorithm if pe exceeds threshold, assume
    they are equal
  • Start with lexical distance

46
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

47
Give me a CPO
  • Very little work so far on probabilistic
    ontologies.
  • Nothing resembling CPOs around
  • How do we infer them
  • How do we build disjoint decompositions?
  • How do we infer probabilities?

48
Building disjoint decompositions
  • Take regular ontologies from the Web
  • Many sources daml.org, SchemaWeb, OntoBroker
  • Modify CPOmerge to ignore labeling
  • The merge result will contain disjoint
    decompositions
  • Equality constraints can be inferred through ICI

49
Infer probabilities simple methods
  • Simple methods
  • Distribute probabilities uniformly within each
    cluster
  • For each cluster L in me(c), dgtc,
  • For any distance function (lexical or otherwise)

50
Advanced methods
  • Probabilistic relational models with structural
    uncertainty
  • Work by Dr. Getoor et. al
  • Classification approach
  • Feature extraction determines entities of
    interest
  • Create conditional probabilities on those
    entities
  • User feedback approach
  • General, applicable to any of the above
  • (ongoing work)

51
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

52
Experimental setup
  • Java implementation
  • CPO enhanced relational DB
  • Movies database maintained by Dr. Wiederhold
  • IMDB data
  • IMDB to estimate recall
  • Classifications from the Web to build initial CPO

53
Consistency check inference
54
Recall
55
Precision
56
Answer quality
57
Query running time
58
ICI quality
59
Bottomline
  • Clear improvement in query answer quality
  • Some time penalty, but reasonable
  • Very little user intervention
  • CPOs are suited for a wide variety of data
    sources
  • Potentially, they can be used to convey semantics
    across heterogenous data sources

60
Content
  • Motivation and goals
  • Running example
  • Technical preliminaries
  • CPO model
  • CPO integration
  • CPO inference algorithms
  • Experimental results
  • Ongoing work

61
Current experimental setup
  • DBLP data
  • over 60 years of scientific publications
  • XML data set
  • CPOs from complex ontologies
  • DBLP classification
  • ACM classification of subjects

62
Goals (1)
  • Determine the efficiency of advanced CPO
    inference methods
  • Experimentally determine the best approach in
    terms of minimizing user feedback

63
Goals (2)
  • Use CPOs with RDF databases
  • For extracting RDF from text as a means of using
    semantic information
  • For answering queries from RDF databases
  • Benefits
  • Probabilistic model is clearly formalized
  • Proven improvement in answer quality
  • Experimentally determine what the probability
    threshold may be for various domains

64
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com