Trio: A System for Data, Uncertainty, and Lineage - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Trio: A System for Data, Uncertainty, and Lineage

Description:

Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 62
Provided by: Jennife1316
Category:

less

Transcript and Presenter's Notes

Title: Trio: A System for Data, Uncertainty, and Lineage


1
Trio A System for Data, Uncertainty, and Lineage
  • Jennifer Widom et al
  • Stanford University

2
Depiction of Trio Project
3
Some Context
  • Trio Project
  • Theyre building a new kind of DBMS in which
  • Data
  • Accuracy
  • Lineage
  • are all first-class interrelated concepts

4
The Trio in Trio
  • Data
  • Student 123 is majoring in Econ (123,Econ) ?
    Major
  • Uncertainty
  • Student 123 is majoring in Econ or CS
  • (123, Econ ? CS) ? Major
  • With confidence 60 student 456 is a CS major
  • (456, CS 0.6) ? Major
  • Lineage
  • 456 ? HardWorker derived from
  • (456, CS) ? Major
  • CS is hard ? some web page

5
Depiction
Data
Uncertainty
Lineage (sourcing)
6
Original Motivation for the Project
  • New Application Domains
  • Many involve data that is uncertain
  • (approximate, probabilistic, inexact, incomplete,
    imprecise, fuzzy, inaccurate,...)
  • Many of the same ones need to track the lineage
    (provenance) of their data

Neither uncertainty nor lineage is supported in
current database systems
7
Importance of Lineage
  • technical Lineage
  • Enables simple and consistent representation of
    uncertain data
  • Correlates uncertainty in query results with
    uncertainty in the input data
  • Can make computation over uncertain data more
    efficient
  • fluffy Applications use lineage to reduce or
    resolve uncertainty

8
Sample Applications
  • Information extraction
  • Find label entities in unstructured text
  • Often probabilistic
  • Information integration
  • Combine data from multiple sources
  • Inconsistencies
  • Scientific experiments
  • Inexact/incomplete data
  • Many levels of derived data products

9
Sample Applications
  • Sensor data management
  • Approximate readings
  • Missing readings
  • Levels of data aggregation
  • Deduplication (data cleaning)
  • Object linkage, entity resolution
  • Often heuristic/probabilistic
  • Approximate query processing
  • Fast but inexact answers

10
The Usual DBMS Features
  • (From first lecture of any Intro to
    Databases class)
  • Efficient,
  • Convenient,
  • Safe,
  • Multi-User storage of and access to
  • Massive amounts of
  • Persistent data

11
Completeness vs. Closure
Completeness blueyellow
All sets-of-instances
Representable sets-of-instances
Closure arrow stays in blue
Op2
Op1
Proposition An incomplete representation is
still interesting if its expressive enough and
closed under all required operations
12
Operations Semantics
  • Easy and natural (re)definition for any standard
  • database operation (call it Op)

D
D
Closure up-arrow always exists
Op direct implementation
possible instances
rep. of instances
Op on each instance
I1, I2, , In
J1, J2, , Jm
Note Completeness ? Closure
13
Incompleteness
person day
Jennifer Monday
Mike Tuesday
person day
Jennifer Monday
person day
Mike Tuesday
Instance3
Instance2
Instance1
person day
Jennifer Monday
Mike Tuesday
generates 4th instance empty relation
?
?
14
Non-Closure Under Join
day food
Monday chicken
Tuesday fish
person day
Mike Monday,Tuesday
?
Result has two possible instances
person day food
Mike Monday chicken
person day food
Mike Tuesday fish
Instance2
Instance1
Not representable with or-sets and ?
15
Another Trio in Trio
  • Data Model
  • Simplest extension to relational model thats
    sufficiently expressive
  • Query Language
  • Simple extension to SQL with well-defined
    semantics and intuitive behavior
  • System
  • A complete open-source DBMS that people
    want to use

16
Another Trio in Trio
  • Data Model
  • Uncertainty-Lineage Databases (ULDBs)
  • Query Language
  • TriQL
  • System
  • Trio-One built on top of standard DBMS

17
Remainder of Talk
  • Data Model
  • Uncertainty-Lineage Databases (ULDBs)
  • Query Language
  • TriQL
  • System
  • Trio-One built on top of standard DBMS
  • 4. Demo

18
Quote from Jennifer
  • We are not about machine learning or
    probabilistic reasoning!
  • We are about efficient and convenient storage,
    manipulation, and retrieval of large data sets
    (with uncertainty and lineage in them)

19
Running Example Crime-Solving
  • Saw (witness, color, car) // may be uncertain
  • Drives (person, color, car) // may be uncertain
  • Suspects (person) pperson(Saw ? Drives)

20
In Standard Relational DBMS
Drives (person, color, car) Drives (person, color, car) Drives (person, color, car)
Jimmy red Toyota
Billy blue Honda
Frank red Mazda
Frank green Mazda
Saw (witness, color, car) Saw (witness, color, car) Saw (witness, color, car)
Amy red Mazda
Betty blue Honda
Carol green Toyota
Create Table Suspects as Select person From Saw,
Drives Where Saw.color Drives.color And
Saw.car Drives.car
Suspects
Frank
Billy

21
Data Model Uncertainty
  • An uncertain database represents a set of
  • possible instances
  • Amy saw either a Honda or a Toyota
  • Jimmy drives a Toyota, a Mazda, or both
  • Betty saw an Acura with confidence 0.5 or a
    Toyota with confidence 0.3
  • Hank is a suspect with confidence 0.7

22
Their Model for Uncertainty
  • 1. Alternatives
  • 2. ? (Maybe) Annotations
  • 3. Confidences

23
Our Model for Uncertainty
  • 1. Alternatives uncertainty about value
  • 2. ? (Maybe) Annotations
  • 3. Confidences

Saw (witness, color, car) Saw (witness, color, car)
Amy red, Honda ? red, Toyota ? orange, Mazda
Three possible instances
24
Our Model for Uncertainty
  • 1. Alternatives
  • 2. ? (Maybe) uncertainty about presence
  • 3. Confidences

Saw (witness, color, car) Saw (witness, color, car)
Amy red, Honda ? red, Toyota ? orange, Mazda
Betty blue, Acura
?
Six possible instances
25
Our Model for Uncertainty
  • 1. Alternatives
  • 2. ? (Maybe) Annotations
  • 3. Confidences weighted uncertainty

Saw (witness, color, car) Saw (witness, color, car)
Amy red, Honda 0.5 ? red, Toyota 0.3 ? orange, Mazda 0.2
Betty blue, Acura 0.6
?
Six possible instances, each with a probability
26
Models for Uncertainty
  • Our model (so far) is not especially new
  • We spent some time exploring the space of models
    for uncertainty
  • Tension between understandability and
    expressiveness
  • Our model is understandable
  • But it is not complete, or even closed under
    common operations

27
Our Model is Not Complete or Closed
Drives (person, car)
Jimmy, Toyota ? Jimmy, Mazda
Billy, Honda ? Frank, Honda
Hank, Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
Suspects pperson(Saw ? Drives)
Suspects
Jimmy
Billy ? Frank
Hank
Does not correctly capture possible instances in
the result
CANNOT
?
?
?
28
to the Rescue
Lineage
  • Lineage (provenance) where data came from
  • Internal lineage
  • External lineage
  • In Trio A function ? from data elements to
  • other data elements (or external sources)

29
Example with Lineage
ID Drives (person, car)
21 Jimmy, Toyota ? Jimmy, Mazda
22 Billy, Honda ? Frank, Honda
23 Hank, Honda
ID Saw (witness, car) Saw (witness, car)
11 Cathy Honda ? Mazda
Suspects pperson(Saw ? Drives)
ID Suspects
31 Jimmy
32 Billy ? Frank
33 Hank
?(31) (11,2),(21,2)
?
?(32,1) (11,1),(22,1) ?(32,2) (11,1),(22,2)
?
?(33) (11,1), 23
?
30
Example with Lineage
ID Drives (person, car)
21 Jimmy, Toyota ? Jimmy, Mazda
22 Billy, Honda ? Frank, Honda
23 Hank, Honda
ID Saw (witness, car) Saw (witness, car)
11 Cathy Honda ? Mazda
Suspects pperson(Saw ? Drives)
ID Suspects
31 Jimmy
32 Billy ? Frank
33 Hank
?
?
?
31
Trio Data Model
Uncertainty-Lineage Databases (ULDBs)
  • Alternatives
  • ? (Maybe) Annotations
  • Confidences
  • Lineage
  • ULDBs are closed and complete

32
Formal Definition of ULDB
33
Formal definition of Completeness
34
Proof of Completeness
  • Proof Construct R with x-relations S1, .., Sn,
    corresponding to R1, .. ,Rn
  • Construct an extra relation PW that encodes the
    possible instances. PW contains exactly one
    x-tuple (1)(2)...(m).
  • Each Si is constructed as follows,
  • For every Pj , each tuple t in Ri forms a maybe
    x-tuple with just one alternative with value t.
  • Duplicates within and across possible instances
    are preserved in Si.
  • We add (j) in PW to the lineage of alternatives
    in tuples copied from Pj .
  • This now exactly encodes the data in each of the
    possible instances.

35
Continuous, Proof
  • The correct lineage is obtained as follows,
  • We look at the lineage j in Pj and mimic it in
    the x-tuples it contributes in S1 through Sn.
  • For example, if j(t1) t2 in Pj, where t1 is a
    subset of R1 and t2 is a subset of R2, then the
    x-tuple that t2 gave in S2 is added to the
    lineage of the x-tuple from t1 in S1.
  • As a final step, we remove the extra relation PW
    but retain its symbols as external lineage.
  • Therefore, each possible LDB of D now has the
    same schema as each Pj , and represents exactly
    the same data and internal lineage.

36
ULDBs Minimality
  • A ULDB relation R represents a set of possible
  • instances
  • Does every tuple in R appear in some possible
    instance? (no extraneous tuples)
  • Does every maybe-tuple in R not appear in some
    possible instance? (no extraneous ?s)
  • Also

Data-minimality
Lineage-minimality
37
Data Minimality Examples
  • Extraneous ?

. . .
20 Billy ? Frank
. . .
?(20,1)(10,1) ?(20,2)(10,2)
?
extraneous
. . .
10 Billy, Honda ? Frank, Mazda
. . .
38
Data Minimality Examples
  • Extraneous tuple

?
Diane
extraneous
?
?
Diane Acura
Diane Mazda
Diane Mazda ? Acura
39
Querying ULDBs
TriQL
  • Simple extension to SQL
  • Formal semantics, intuitive meaning
  • Query uncertainty, confidences, and lineage

40
Simple TriQL Example
ID Drives (person, car)
21 Jimmy, Toyota ? Jimmy, Mazda
22 Billy, Honda ? Frank, Honda
23 Hank, Honda
ID Saw (witness, car) Saw (witness, car)
11 Cathy Honda ? Mazda
Create Table Suspects as Select person From Saw,
Drives Where Saw.car Drives.car
ID Suspects
31 Jimmy
32 Billy ? Frank
33 Hank
?(31)(11,2),(21,2)
?
?(32,1)(11,1),(22,1) ?(32,2)(11,1),(22,2)
?
?
?(33)(11,1),23
41
Formal Semantics
  • Relational (SQL) query Q on ULDB D

implementation of Q
D
D
D Result
operational semantics
possible instances
representation of instances
Q on each instance
D1, D2, , Dn
Q(D1), Q(D2), , Q(Dn)
42
TriQL Querying Confidences
  • Built-in function Conf()
  • SELECT person
  • FROM Saw, Drives
  • WHERE Saw.car Drives.car
  • AND Conf(Saw) gt 0.5 AND Conf(Drives) gt 0.8

43
TriQL Querying Lineage
  • Built-in join predicate Lineage()
  • SELECT Suspects.person
  • FROM Suspects, Saw
  • WHERE Lineage(Suspects,Saw)
  • AND Saw.witness Amy
  • X gt Y shorthand for Lineage(X,Y)

44
Operational Semantics
SELECT attr-list FROM X1, X2, ..., Xn WHERE
predicate
  • Over standard relational database
  • For each tuple in cross-product of X1, X2, ...,
    Xn
  • Evaluate the predicate
  • If true, project attr-list to create result tuple

45
Operational Semantics
SELECT attr-list FROM X1, X2, ..., Xn WHERE
predicate
  • Over ULDB
  • For each tuple in cross-product of X1, X2, ...,
    Xn
  • Create super tuple T from all combinations of
    alternatives
  • Evaluate predicate on each alternative in T
    keep only the true ones
  • Project attr-list on each alternative to create
    result tuple
  • Details ?, lineage, confidences

46
Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
47
Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
48
Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
49
Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
50
Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
(Cathy,Honda,Hank,Honda) ? (Cathy,Mazda,Hank,Honda)
51
Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
(Cathy,Honda,Hank,Honda) ? (Cathy,Mazda,Hank,Honda)
52
Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
(Cathy,Honda,Hank,Honda) ? (Cathy,Mazda,Hank,Honda)
53
Additional TriQL Constructs
  • Horizontal subqueries
  • Refer to tuple alternatives as a relation
  • Aggregations low, high, expected
  • Unmerged (horizontal duplicates)
  • Flatten, GroupAlts
  • NoLineage, NoConf, NoMaybe
  • Query-computed confidences
  • Data modification statements

54
Trio-Specific Additional Features
  • Lineage tracing
  • On-demand confidence computation
  • Coexistence checks
  • Extraneous data removal
  • Interrelated algorithms

55
The Trio System
  • Version 1 (Trio-One)
  • On top of standard DBMS
  • Surprisingly easy and complete, reasonably
    efficient

56
Trio-One Overview
TrioExplorer (GUI client)
  • DDL commands
  • TriQL queries
  • Schema browsing
  • Table browsing
  • Explore lineage
  • On-demand
  • confidence
  • computation

Command-line client
Trio API and translator (Python)
  • Partition and
  • verticalize
  • Shared IDs for
  • alternatives
  • Columns for
  • confidence,?

Standard SQL
  • Table types
  • Schema-level
  • lineage structure

Standard relational DBMS
  • Conf()
  • Lineage()

Encoded Data Tables
Trio Metadata
  • One per result
  • table
  • Uses unique IDs
  • Encodes formulas

Trio Stored Procedures
Lineage Tables
57
Strengths
  • First DBMS with uncertainty and lineage
  • Has many applications like I showed earlier
  • Done by Stanford!

58
Weaknesses
  • Paper has lots of definitions rather than
    explanations.
  • Proofs are written in whole text rather multiple
    lines with math symbols.

59
Future Work
  • More forms of uncertainty
  • Continuous uncertainty (intervals, Gaussians)
  • Correlated uncertainty
  • Incomplete relations
  • More forms of lineage
  • External lineage
  • Update lineage
  • Confidence-based queries
  • Threshold Top-K

60
Future Work
  • Conjunctive lineage sufficient for most
    operations
  • Disjunctive lineage for duplicate-elimination
  • Negative lineage for difference
  • General case after several queries
  • Boolean formula

61
Search stanford trio
Write a Comment
User Comments (0)
About PowerShow.com