Title: Databases with Uncertainty and Lineage
1Databases with Uncertainty and Lineage
- Omar Benjelloun, Anish Das Sarma, Alon Halevy,
Martin Theobald, Jennifer Widom
2Abstract
- This paper introduces ULDBs,an extension of
relational databases with simple yet expressive
constructs for representing and manipulating both
lineage and uncertainty. - The ULDB representation is complete, and it
permits straightforward implementation of many
relational operations. - Study minimization of ULDB representations under
both data-minimality and lineage-minimality.
3Abstract
- Provide an algorithm for the new operation of
extracting a database subset in the presence of
interconnected uncertainty. - Show how ULDBs enable a new approach to query
processing in probabilistic databases. - Describe the current state of the Trio system,
the implementation of ULDBs under development at
Stanford.
4Outline
- Introduction
- Preliminaries
- Combining Lineage and Uncertainty
- Querying ULDBs
- Confidences and Probabilistic Data
- The Trio System
- Related Work
- Conclusion and Future Work
5Introduction
- Motivated by a diverse set of applications(data
integration, deduplication, scientific data
management),we became interested in the
combination of uncertainty and lineage. - Relationship between them
- 1.lineage can be used for understanding and
resolving uncertainty (e.g. web search). - 2.lineage is also important for uncertainty
within a single database.(correlates uncertainty
in source and result)e.g.
6Uncertainty, Lineage and Data Integration
- In this paper, we highlight the need for modeling
uncertainty and lineage in the context of data
integration system. - Three kinds of uncertainty the data, the mapping
between the schemas and the mapping between data
objects in different sources. - Lineage keeps track of the origins of data,
giving a powerful tool to manage, explain and
potentially correct the uncertain truth.
7Preliminaries
- Databases with Lineage(LDB)
- D a database
- a set of relations
- a multiset of tuples
- all identifiers in relations
- Lineage of a tuple indentifies the data from
which it was derived(e.g.queries) - Lineage of derived tuples consists of references
to other tuples via their unique identifiers. - The set of symbols known by an LDB is
- (E external symbols)
8(No Transcript)
9Databases with Lineage
- When operations are performed,there is often an
obvious lineage function for the tuples in the
result.(example1join) - The operations we consider in this paper all have
simple lineage functions.(neglationaggregationle
ss obvious) - An important aspect of LDBs we cannot consider
each relation is isolation.(Query) - Even though a relation may have duplicates, each
tuple has its own lineage.(e.g.41,42) - Lineage is particularly important in data
integration settings.
10Uncertain Databases An uncertain database
represents a set of possible instances, each of
which is one possible state of the database.
11Uncertain databases
Cant represent as x-relations. Cant express the
fact that if Amy accuses Jimmy(Mazda),then she
must accuse Billy.
- uncertainty causes source itself, process of
extracting, contradictory sources(e.g.) - Complete if the formailism can represent any
finite set of possible instances. - x-relations are not complete.(e.g.)
12Combining Lineage and Uncertainty
Semantics of a ULDB a set of possible instances,
where each instance is a LDB.
13(No Transcript)
14(No Transcript)
15Completeness
1.add?(Pj)(j), 2.mimic the lineage in
Pj, 3.remove PW but retain symbols as E. 4.The
construction of D is DONE!
16Well-Behaved Lineage
Base x-tuple an x-tuple with empty lineage. An
property of well-behaved lineage the possible
instances of a well-behaved ULDB are determined
entirely by the base x-tuples.
17Proof
Obviously (by definition)
S(i,j) has the minimum distance. Contradiction!
Assume D1D2 pick the same alternatives or?for
every base x-tuples, and D1!D2.
18Uncertainty Considered by Well-behaved Lineage
- A finite set of base facts that are either
mutually exclusive or independent. - Possibly correlated data derived from these base
facts in a way that propagates uncertainty but
does not affect. - Thus,well-bahaved lineage is well suited for
databases queries.
19Querying ULDBs
20DL-monotonic Queries
21DL-monotonic Queries
- Intuitively, any operation that can produce its
results in a tuple-by-tuple fashion is
DL-monotonic. - Selection, projection, join, and union are all
DL-monotonic. - Aggregation, duplicate-elimination, and some set
operations are not. - In the remainder of this section, we assume all
queries Q to be DL-monotonic. - In follow-on work, we are extending the approach
to other operations.
22Applying a Query to a ULDB
23Based on evaluating Q over a conventional
database,complexity does not increase due to
uncertainty. We can implement this algorithm
readily using a standard DBMS.
24(No Transcript)
25Proof(sketch)
DiDi Rqi Q(Di)
If D is well-behaved , then Dis well-behaved.
26ULDB Minimality
x-tuple 6 is impossible, Extraneous!
27t appears in a possible instance with D, NOT
EXTRANEOUS!
Proof
(Obviously)
28 the number of alternatives in x-tuple ti, that
are not extraneous
the set of base x-tuples, from which ti is
derived
29(No Transcript)
30Lineage Minimality
31Membership Queries
32Extraction
1.After querying, we may not want the original
set D. 2.Constraints across relations make it
interesting. 3.Important in data integration.
33Extraction
- Ensure the possible instances of the extracted
relations are preserved. - Lineage that is not within the extracted
relations is converted from internal to external.
34Extraction
35Confidences and Probabilistic Data
- Confidence Values
- We assume ULDBs to be well-behaved and
D-minimized.
36Confidence Values
For example, the possible instance where Amy saw
an Acura, Betty saw a Mazda, and Hank does not
drive an Acura has confidence 0.80.6(1-0.6)0.20
- Two properties
- The sum of probabilities of its possible
instances is 1. - The confidence of a base alternative a (resp.
?on an x-tuple t) equals the sum of the
confidences of the possible instances where a
(resp. no alternatives of t) is picked.
37Query processing
- Lineage allows us to decouple ULDB query
processing with confidences into two steps Data
computation Confidence computation. - The confidence value for every result alternative
a is a function of the confidence values for the
base alternatives reachable by as transitive
lineage. Thus, we can compute the confidence
afterward.
38Confidence Computation
39Confidence Computation
Assuming tuple(61,1) and (62,1) are independent.
40Confidence Computation
Query plan 3 decoupled approach First compute
the data(Hank)(ID71) Then compute the confidence
value ?(71,1)((51,1) ?((11,1)
?(12,1))) Pr(71,1)Pr((51,1) ?((11,1) ?(12,1)))
0.60.880.528
- Advantages
- The data computation step has the flexibility to
use the most efficient execution plan. - The confidence values can be computed selectively
and on-demand.
41Methods for optimizing the computation
- The confidence value for a derived alternative
can be computed from a set of closest
independent descendants(CIDs). - CIDs also enable memorization, which avoids
performing redundant confidence
computations.(Trio) - If transitive lineage is maintained, it can then
also be applied to speed up confidence
computation. - In the case where we wish to compute confidence
for an x-tuple or an entire x-relation, batch
techniques can be used based on the structure
guaranteed by well-behaved lineage.(Trio)
42The Trio System
- A relational DBMS that supports uncertainty and
lineage. - Based on the ULDB data model, and accepts queries
in TriQL language(extension of SQL with
uncertainty and lineage-specific features). - The current incarnation of the Trio
system(Trio-One),is primarily layered on top of a
conventional relational DBMS.
43General architecture
44(No Transcript)
45Encoding ULDB Data
an x-relation whose x-tuples may have confidence
and lineage
aid a unique alternative identifier (across the
table) xid identifies the x-tuple that this
alternative belongs to (across the table) conf
the confidence of the alternative num a
nonnegative integer that tracks whether the
alternatives x-tuple has a ?
Lineage information is stored in a separate table
lin_T(aid,src_aid,src_table) A tuple(a1,a2,T2)in
lin_T Ts alternative a1 has alternative from
table T2 in its lineage.
46Encoding ULDB Data
back
47Trio Queries
Two versions transient stored
48Basic Rewriting Scheme
- Two phases
- Translation phase execution phase
Tfetchcall to original TriQL query Sfetchcall
to SQL query
A Tfetch may call several Sfetches.
num field
example
For stored version create table insert
49Built-In Predicates and Functions
- Conf()
- Maybe()
- true if only if the x-tuple has a?
- Lineage()
- lineage(x,y)is true whenever y is reachable
from x
50Querying uncertainty
51Query-Defined Result Confidence
- Confidence is computed on-demand
- (A COMPUTE CONFIDENCE clause can be added to a
query) - A query can override the default result
confidence values
52Additional Trio Features
- Lineage
- Maintains a schema-level lineage graph.
- ExplainLineage(), BaseLineage()
- Confidence Computation
- invoke BaseLineage(a), and compute the
confidence based on base alternatives - Coexistence Checks
- Extraneous Data Removal
They are interconnected, sharing code.
53Related Work
- There are not any previously proposed
representation that integrates both lineage and
uncertainty.
54Conclusions and Future Work
- ULDBs a representation for databases with both
uncertainty and lineage. - DL-monotonic queries and their lineage.
- The Trio system.
- The main features of ULDBs are crucial for data
integration applications. - Questions The ULDB are not expressive enough to
fully represent the effects on data of some of
its complex operations. - Another challenging question in data integration
the relationship between the data in a ULDB and
the data in the data sources.
55Current and Future Directions of Work in ULDBs
- Updates
- primitives, algorithms
- Implementation
- data layout, indexing, partitioning
- likely to entail modifying the prototype
inside a DBMS - Theory
- dependency theory, query containment,
sampling, and statistics - Long-Term Goals
- uncertainty in the form of continuous
distributions, incomplete relations, and
versioning of data, uncertainty and lineage.
56THE END!
Thank you all !