Databases with Uncertainty and Lineage - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Databases with Uncertainty and Lineage

Description:

... an Acura, Betty saw a Mazda, and Hank does not drive an Acura has confidence 0.8 ... First compute the data(Hank)(ID:71) Then compute the confidence value: ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 57
Provided by: admisFu
Category:

less

Transcript and Presenter's Notes

Title: Databases with Uncertainty and Lineage


1
Databases with Uncertainty and Lineage
  • Omar Benjelloun, Anish Das Sarma, Alon Halevy,
    Martin Theobald, Jennifer Widom

2
Abstract
  • This paper introduces ULDBs,an extension of
    relational databases with simple yet expressive
    constructs for representing and manipulating both
    lineage and uncertainty.
  • The ULDB representation is complete, and it
    permits straightforward implementation of many
    relational operations.
  • Study minimization of ULDB representations under
    both data-minimality and lineage-minimality.

3
Abstract
  • Provide an algorithm for the new operation of
    extracting a database subset in the presence of
    interconnected uncertainty.
  • Show how ULDBs enable a new approach to query
    processing in probabilistic databases.
  • Describe the current state of the Trio system,
    the implementation of ULDBs under development at
    Stanford.

4
Outline
  • Introduction
  • Preliminaries
  • Combining Lineage and Uncertainty
  • Querying ULDBs
  • Confidences and Probabilistic Data
  • The Trio System
  • Related Work
  • Conclusion and Future Work

5
Introduction
  • Motivated by a diverse set of applications(data
    integration, deduplication, scientific data
    management),we became interested in the
    combination of uncertainty and lineage.
  • Relationship between them
  • 1.lineage can be used for understanding and
    resolving uncertainty (e.g. web search).
  • 2.lineage is also important for uncertainty
    within a single database.(correlates uncertainty
    in source and result)e.g.

6
Uncertainty, Lineage and Data Integration
  • In this paper, we highlight the need for modeling
    uncertainty and lineage in the context of data
    integration system.
  • Three kinds of uncertainty the data, the mapping
    between the schemas and the mapping between data
    objects in different sources.
  • Lineage keeps track of the origins of data,
    giving a powerful tool to manage, explain and
    potentially correct the uncertain truth.

7
Preliminaries
  • Databases with Lineage(LDB)
  • D a database
  • a set of relations
  • a multiset of tuples
  • all identifiers in relations
  • Lineage of a tuple indentifies the data from
    which it was derived(e.g.queries)
  • Lineage of derived tuples consists of references
    to other tuples via their unique identifiers.
  • The set of symbols known by an LDB is
  • (E external symbols)

8
(No Transcript)
9
Databases with Lineage
  • When operations are performed,there is often an
    obvious lineage function for the tuples in the
    result.(example1join)
  • The operations we consider in this paper all have
    simple lineage functions.(neglationaggregationle
    ss obvious)
  • An important aspect of LDBs we cannot consider
    each relation is isolation.(Query)
  • Even though a relation may have duplicates, each
    tuple has its own lineage.(e.g.41,42)
  • Lineage is particularly important in data
    integration settings.

10
Uncertain Databases An uncertain database
represents a set of possible instances, each of
which is one possible state of the database.
11
Uncertain databases
Cant represent as x-relations. Cant express the
fact that if Amy accuses Jimmy(Mazda),then she
must accuse Billy.
  • uncertainty causes source itself, process of
    extracting, contradictory sources(e.g.)
  • Complete if the formailism can represent any
    finite set of possible instances.
  • x-relations are not complete.(e.g.)

12
Combining Lineage and Uncertainty
Semantics of a ULDB a set of possible instances,
where each instance is a LDB.
13
(No Transcript)
14
(No Transcript)
15
Completeness
1.add?(Pj)(j), 2.mimic the lineage in
Pj, 3.remove PW but retain symbols as E. 4.The
construction of D is DONE!
16
Well-Behaved Lineage
Base x-tuple an x-tuple with empty lineage. An
property of well-behaved lineage the possible
instances of a well-behaved ULDB are determined
entirely by the base x-tuples.
17
Proof
Obviously (by definition)
S(i,j) has the minimum distance. Contradiction!
Assume D1D2 pick the same alternatives or?for
every base x-tuples, and D1!D2.
18
Uncertainty Considered by Well-behaved Lineage
  • A finite set of base facts that are either
    mutually exclusive or independent.
  • Possibly correlated data derived from these base
    facts in a way that propagates uncertainty but
    does not affect.
  • Thus,well-bahaved lineage is well suited for
    databases queries.

19
Querying ULDBs
20
DL-monotonic Queries
21
DL-monotonic Queries
  • Intuitively, any operation that can produce its
    results in a tuple-by-tuple fashion is
    DL-monotonic.
  • Selection, projection, join, and union are all
    DL-monotonic.
  • Aggregation, duplicate-elimination, and some set
    operations are not.
  • In the remainder of this section, we assume all
    queries Q to be DL-monotonic.
  • In follow-on work, we are extending the approach
    to other operations.

22
Applying a Query to a ULDB
23
Based on evaluating Q over a conventional
database,complexity does not increase due to
uncertainty. We can implement this algorithm
readily using a standard DBMS.
24
(No Transcript)
25
Proof(sketch)
DiDi Rqi Q(Di)
If D is well-behaved , then Dis well-behaved.
26
ULDB Minimality
  • Data minimality

x-tuple 6 is impossible, Extraneous!
27
t appears in a possible instance with D, NOT
EXTRANEOUS!
Proof
(Obviously)
28
the number of alternatives in x-tuple ti, that
are not extraneous
the set of base x-tuples, from which ti is
derived
29
(No Transcript)
30
Lineage Minimality
31
Membership Queries
32
Extraction
1.After querying, we may not want the original
set D. 2.Constraints across relations make it
interesting. 3.Important in data integration.
33
Extraction
  • Ensure the possible instances of the extracted
    relations are preserved.
  • Lineage that is not within the extracted
    relations is converted from internal to external.

34
Extraction
35
Confidences and Probabilistic Data
  • Confidence Values
  • We assume ULDBs to be well-behaved and
    D-minimized.

36
Confidence Values
  • Example

For example, the possible instance where Amy saw
an Acura, Betty saw a Mazda, and Hank does not
drive an Acura has confidence 0.80.6(1-0.6)0.20
  • Two properties
  • The sum of probabilities of its possible
    instances is 1.
  • The confidence of a base alternative a (resp.
    ?on an x-tuple t) equals the sum of the
    confidences of the possible instances where a
    (resp. no alternatives of t) is picked.

37
Query processing
  • Lineage allows us to decouple ULDB query
    processing with confidences into two steps Data
    computation Confidence computation.
  • The confidence value for every result alternative
    a is a function of the confidence values for the
    base alternatives reachable by as transitive
    lineage. Thus, we can compute the confidence
    afterward.

38
Confidence Computation
  • Example

39
Confidence Computation
  • Example

Assuming tuple(61,1) and (62,1) are independent.
40
Confidence Computation
  • Example

Query plan 3 decoupled approach First compute
the data(Hank)(ID71) Then compute the confidence
value ?(71,1)((51,1) ?((11,1)
?(12,1))) Pr(71,1)Pr((51,1) ?((11,1) ?(12,1)))
0.60.880.528
  • Advantages
  • The data computation step has the flexibility to
    use the most efficient execution plan.
  • The confidence values can be computed selectively
    and on-demand.

41
Methods for optimizing the computation
  • The confidence value for a derived alternative
    can be computed from a set of closest
    independent descendants(CIDs).
  • CIDs also enable memorization, which avoids
    performing redundant confidence
    computations.(Trio)
  • If transitive lineage is maintained, it can then
    also be applied to speed up confidence
    computation.
  • In the case where we wish to compute confidence
    for an x-tuple or an entire x-relation, batch
    techniques can be used based on the structure
    guaranteed by well-behaved lineage.(Trio)

42
The Trio System
  • A relational DBMS that supports uncertainty and
    lineage.
  • Based on the ULDB data model, and accepts queries
    in TriQL language(extension of SQL with
    uncertainty and lineage-specific features).
  • The current incarnation of the Trio
    system(Trio-One),is primarily layered on top of a
    conventional relational DBMS.

43
General architecture
44
(No Transcript)
45
Encoding ULDB Data
an x-relation whose x-tuples may have confidence
and lineage
aid a unique alternative identifier (across the
table) xid identifies the x-tuple that this
alternative belongs to (across the table) conf
the confidence of the alternative num a
nonnegative integer that tracks whether the
alternatives x-tuple has a ?
Lineage information is stored in a separate table
lin_T(aid,src_aid,src_table) A tuple(a1,a2,T2)in
lin_T Ts alternative a1 has alternative from
table T2 in its lineage.
46
Encoding ULDB Data
back
47
Trio Queries
Two versions transient stored
48
Basic Rewriting Scheme
  • Two phases
  • Translation phase execution phase

Tfetchcall to original TriQL query Sfetchcall
to SQL query
A Tfetch may call several Sfetches.
num field
example
For stored version create table insert
49
Built-In Predicates and Functions
  • Conf()
  • Maybe()
  • true if only if the x-tuple has a?
  • Lineage()
  • lineage(x,y)is true whenever y is reachable
    from x

50
Querying uncertainty
51
Query-Defined Result Confidence
  • Confidence is computed on-demand
  • (A COMPUTE CONFIDENCE clause can be added to a
    query)
  • A query can override the default result
    confidence values

52
Additional Trio Features
  • Lineage
  • Maintains a schema-level lineage graph.
  • ExplainLineage(), BaseLineage()
  • Confidence Computation
  • invoke BaseLineage(a), and compute the
    confidence based on base alternatives
  • Coexistence Checks
  • Extraneous Data Removal

They are interconnected, sharing code.
53
Related Work
  • There are not any previously proposed
    representation that integrates both lineage and
    uncertainty.

54
Conclusions and Future Work
  • ULDBs a representation for databases with both
    uncertainty and lineage.
  • DL-monotonic queries and their lineage.
  • The Trio system.
  • The main features of ULDBs are crucial for data
    integration applications.
  • Questions The ULDB are not expressive enough to
    fully represent the effects on data of some of
    its complex operations.
  • Another challenging question in data integration
    the relationship between the data in a ULDB and
    the data in the data sources.

55
Current and Future Directions of Work in ULDBs
  • Updates
  • primitives, algorithms
  • Implementation
  • data layout, indexing, partitioning
  • likely to entail modifying the prototype
    inside a DBMS
  • Theory
  • dependency theory, query containment,
    sampling, and statistics
  • Long-Term Goals
  • uncertainty in the form of continuous
    distributions, incomplete relations, and
    versioning of data, uncertainty and lineage.

56
THE END!
Thank you all !
Write a Comment
User Comments (0)
About PowerShow.com