Title: Trio: A System for Data, Uncertainty, and Lineage
1Trio A System for Data, Uncertainty, and Lineage
- Jennifer Widom et al
- Stanford University
2Depiction of Trio Project
3Some Context
- Trio Project
- Theyre building a new kind of DBMS in which
- Data
- Accuracy
- Lineage
- are all first-class interrelated concepts
4The Trio in Trio
- Data
- Student 123 is majoring in Econ (123,Econ) ?
Major - Uncertainty
- Student 123 is majoring in Econ or CS
- (123, Econ ? CS) ? Major
- With confidence 60 student 456 is a CS major
- (456, CS 0.6) ? Major
- Lineage
- 456 ? HardWorker derived from
- (456, CS) ? Major
- CS is hard ? some web page
5Depiction
Data
Uncertainty
Lineage (sourcing)
6Original Motivation for the Project
- New Application Domains
- Many involve data that is uncertain
- (approximate, probabilistic, inexact, incomplete,
imprecise, fuzzy, inaccurate,...) - Many of the same ones need to track the lineage
(provenance) of their data
Neither uncertainty nor lineage is supported in
current database systems
7Importance of Lineage
- technical Lineage
- Enables simple and consistent representation of
uncertain data - Correlates uncertainty in query results with
uncertainty in the input data - Can make computation over uncertain data more
efficient - fluffy Applications use lineage to reduce or
resolve uncertainty
8Sample Applications
- Information extraction
- Find label entities in unstructured text
- Often probabilistic
- Information integration
- Combine data from multiple sources
- Inconsistencies
- Scientific experiments
- Inexact/incomplete data
- Many levels of derived data products
9Sample Applications
- Sensor data management
- Approximate readings
- Missing readings
- Levels of data aggregation
- Deduplication (data cleaning)
- Object linkage, entity resolution
- Often heuristic/probabilistic
- Approximate query processing
- Fast but inexact answers
10The Usual DBMS Features
- (From first lecture of any Intro to
Databases class) - Efficient,
- Convenient,
- Safe,
- Multi-User storage of and access to
- Massive amounts of
- Persistent data
11Completeness vs. Closure
Completeness blueyellow
All sets-of-instances
Representable sets-of-instances
Closure arrow stays in blue
Op2
Op1
Proposition An incomplete representation is
still interesting if its expressive enough and
closed under all required operations
12Operations Semantics
- Easy and natural (re)definition for any standard
- database operation (call it Op)
D
D
Closure up-arrow always exists
Op direct implementation
possible instances
rep. of instances
Op on each instance
I1, I2, , In
J1, J2, , Jm
Note Completeness ? Closure
13Incompleteness
person day
Jennifer Monday
Mike Tuesday
person day
Jennifer Monday
person day
Mike Tuesday
Instance3
Instance2
Instance1
person day
Jennifer Monday
Mike Tuesday
generates 4th instance empty relation
?
?
14Non-Closure Under Join
day food
Monday chicken
Tuesday fish
person day
Mike Monday,Tuesday
?
Result has two possible instances
person day food
Mike Monday chicken
person day food
Mike Tuesday fish
Instance2
Instance1
Not representable with or-sets and ?
15Another Trio in Trio
- Data Model
- Simplest extension to relational model thats
sufficiently expressive - Query Language
- Simple extension to SQL with well-defined
semantics and intuitive behavior - System
- A complete open-source DBMS that people
want to use
16Another Trio in Trio
- Data Model
- Uncertainty-Lineage Databases (ULDBs)
- Query Language
- TriQL
- System
- Trio-One built on top of standard DBMS
17Remainder of Talk
- Data Model
- Uncertainty-Lineage Databases (ULDBs)
- Query Language
- TriQL
- System
- Trio-One built on top of standard DBMS
- 4. Demo
18Quote from Jennifer
- We are not about machine learning or
probabilistic reasoning! - We are about efficient and convenient storage,
manipulation, and retrieval of large data sets
(with uncertainty and lineage in them)
19Running Example Crime-Solving
- Saw (witness, color, car) // may be uncertain
- Drives (person, color, car) // may be uncertain
- Suspects (person) pperson(Saw ? Drives)
20In Standard Relational DBMS
Drives (person, color, car) Drives (person, color, car) Drives (person, color, car)
Jimmy red Toyota
Billy blue Honda
Frank red Mazda
Frank green Mazda
Saw (witness, color, car) Saw (witness, color, car) Saw (witness, color, car)
Amy red Mazda
Betty blue Honda
Carol green Toyota
Create Table Suspects as Select person From Saw,
Drives Where Saw.color Drives.color And
Saw.car Drives.car
Suspects
Frank
Billy
21Data Model Uncertainty
- An uncertain database represents a set of
- possible instances
- Amy saw either a Honda or a Toyota
- Jimmy drives a Toyota, a Mazda, or both
- Betty saw an Acura with confidence 0.5 or a
Toyota with confidence 0.3 - Hank is a suspect with confidence 0.7
22Their Model for Uncertainty
- 1. Alternatives
- 2. ? (Maybe) Annotations
- 3. Confidences
23Our Model for Uncertainty
- 1. Alternatives uncertainty about value
- 2. ? (Maybe) Annotations
- 3. Confidences
Saw (witness, color, car) Saw (witness, color, car)
Amy red, Honda ? red, Toyota ? orange, Mazda
Three possible instances
24Our Model for Uncertainty
- 1. Alternatives
- 2. ? (Maybe) uncertainty about presence
- 3. Confidences
Saw (witness, color, car) Saw (witness, color, car)
Amy red, Honda ? red, Toyota ? orange, Mazda
Betty blue, Acura
?
Six possible instances
25Our Model for Uncertainty
- 1. Alternatives
- 2. ? (Maybe) Annotations
- 3. Confidences weighted uncertainty
Saw (witness, color, car) Saw (witness, color, car)
Amy red, Honda 0.5 ? red, Toyota 0.3 ? orange, Mazda 0.2
Betty blue, Acura 0.6
?
Six possible instances, each with a probability
26Models for Uncertainty
- Our model (so far) is not especially new
- We spent some time exploring the space of models
for uncertainty - Tension between understandability and
expressiveness - Our model is understandable
- But it is not complete, or even closed under
common operations
27Our Model is Not Complete or Closed
Drives (person, car)
Jimmy, Toyota ? Jimmy, Mazda
Billy, Honda ? Frank, Honda
Hank, Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
Suspects pperson(Saw ? Drives)
Suspects
Jimmy
Billy ? Frank
Hank
Does not correctly capture possible instances in
the result
CANNOT
?
?
?
28 to the Rescue
Lineage
- Lineage (provenance) where data came from
- Internal lineage
- External lineage
- In Trio A function ? from data elements to
- other data elements (or external sources)
29Example with Lineage
ID Drives (person, car)
21 Jimmy, Toyota ? Jimmy, Mazda
22 Billy, Honda ? Frank, Honda
23 Hank, Honda
ID Saw (witness, car) Saw (witness, car)
11 Cathy Honda ? Mazda
Suspects pperson(Saw ? Drives)
ID Suspects
31 Jimmy
32 Billy ? Frank
33 Hank
?(31) (11,2),(21,2)
?
?(32,1) (11,1),(22,1) ?(32,2) (11,1),(22,2)
?
?(33) (11,1), 23
?
30Example with Lineage
ID Drives (person, car)
21 Jimmy, Toyota ? Jimmy, Mazda
22 Billy, Honda ? Frank, Honda
23 Hank, Honda
ID Saw (witness, car) Saw (witness, car)
11 Cathy Honda ? Mazda
Suspects pperson(Saw ? Drives)
ID Suspects
31 Jimmy
32 Billy ? Frank
33 Hank
?
?
?
31Trio Data Model
Uncertainty-Lineage Databases (ULDBs)
- Alternatives
- ? (Maybe) Annotations
- Confidences
- Lineage
- ULDBs are closed and complete
32Formal Definition of ULDB
33Formal definition of Completeness
34Proof of Completeness
- Proof Construct R with x-relations S1, .., Sn,
corresponding to R1, .. ,Rn - Construct an extra relation PW that encodes the
possible instances. PW contains exactly one
x-tuple (1)(2)...(m). - Each Si is constructed as follows,
- For every Pj , each tuple t in Ri forms a maybe
x-tuple with just one alternative with value t. - Duplicates within and across possible instances
are preserved in Si. - We add (j) in PW to the lineage of alternatives
in tuples copied from Pj . - This now exactly encodes the data in each of the
possible instances.
35Continuous, Proof
- The correct lineage is obtained as follows,
- We look at the lineage j in Pj and mimic it in
the x-tuples it contributes in S1 through Sn. - For example, if j(t1) t2 in Pj, where t1 is a
subset of R1 and t2 is a subset of R2, then the
x-tuple that t2 gave in S2 is added to the
lineage of the x-tuple from t1 in S1. - As a final step, we remove the extra relation PW
but retain its symbols as external lineage. - Therefore, each possible LDB of D now has the
same schema as each Pj , and represents exactly
the same data and internal lineage.
36ULDBs Minimality
- A ULDB relation R represents a set of possible
- instances
- Does every tuple in R appear in some possible
instance? (no extraneous tuples) - Does every maybe-tuple in R not appear in some
possible instance? (no extraneous ?s) - Also
Data-minimality
Lineage-minimality
37Data Minimality Examples
. . .
20 Billy ? Frank
. . .
?(20,1)(10,1) ?(20,2)(10,2)
?
extraneous
. . .
10 Billy, Honda ? Frank, Mazda
. . .
38Data Minimality Examples
?
Diane
extraneous
?
?
Diane Acura
Diane Mazda
Diane Mazda ? Acura
39Querying ULDBs
TriQL
- Simple extension to SQL
- Formal semantics, intuitive meaning
- Query uncertainty, confidences, and lineage
40Simple TriQL Example
ID Drives (person, car)
21 Jimmy, Toyota ? Jimmy, Mazda
22 Billy, Honda ? Frank, Honda
23 Hank, Honda
ID Saw (witness, car) Saw (witness, car)
11 Cathy Honda ? Mazda
Create Table Suspects as Select person From Saw,
Drives Where Saw.car Drives.car
ID Suspects
31 Jimmy
32 Billy ? Frank
33 Hank
?(31)(11,2),(21,2)
?
?(32,1)(11,1),(22,1) ?(32,2)(11,1),(22,2)
?
?
?(33)(11,1),23
41Formal Semantics
- Relational (SQL) query Q on ULDB D
implementation of Q
D
D
D Result
operational semantics
possible instances
representation of instances
Q on each instance
D1, D2, , Dn
Q(D1), Q(D2), , Q(Dn)
42TriQL Querying Confidences
- Built-in function Conf()
- SELECT person
- FROM Saw, Drives
- WHERE Saw.car Drives.car
- AND Conf(Saw) gt 0.5 AND Conf(Drives) gt 0.8
43TriQL Querying Lineage
- Built-in join predicate Lineage()
- SELECT Suspects.person
- FROM Suspects, Saw
- WHERE Lineage(Suspects,Saw)
- AND Saw.witness Amy
- X gt Y shorthand for Lineage(X,Y)
44Operational Semantics
SELECT attr-list FROM X1, X2, ..., Xn WHERE
predicate
- Over standard relational database
- For each tuple in cross-product of X1, X2, ...,
Xn - Evaluate the predicate
- If true, project attr-list to create result tuple
45Operational Semantics
SELECT attr-list FROM X1, X2, ..., Xn WHERE
predicate
- Over ULDB
- For each tuple in cross-product of X1, X2, ...,
Xn - Create super tuple T from all combinations of
alternatives - Evaluate predicate on each alternative in T
keep only the true ones - Project attr-list on each alternative to create
result tuple - Details ?, lineage, confidences
46Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
47Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
48Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
49Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
50Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
(Cathy,Honda,Hank,Honda) ? (Cathy,Mazda,Hank,Honda)
51Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
(Cathy,Honda,Hank,Honda) ? (Cathy,Mazda,Hank,Honda)
52Operational Semantics Example
SELECT person FROM Saw, Drives WHERE Saw.car
Drives.car
Drives (person, car) Drives (person, car)
Jim ? Bill Mazda
Hank Honda
Saw (witness, car) Saw (witness, car)
Cathy Honda ? Mazda
(Cathy,Honda,Jim,Mazda)?(Cathy,Honda,Bill,Mazda)?(Cathy,Mazda,Jim,Mazda)?(Cathy,Mazda,Bill,Mazda)
(Cathy,Honda,Hank,Honda) ? (Cathy,Mazda,Hank,Honda)
53Additional TriQL Constructs
- Horizontal subqueries
- Refer to tuple alternatives as a relation
- Aggregations low, high, expected
- Unmerged (horizontal duplicates)
- Flatten, GroupAlts
- NoLineage, NoConf, NoMaybe
- Query-computed confidences
- Data modification statements
54Trio-Specific Additional Features
- Lineage tracing
- On-demand confidence computation
- Coexistence checks
- Extraneous data removal
- Interrelated algorithms
55The Trio System
- Version 1 (Trio-One)
- On top of standard DBMS
- Surprisingly easy and complete, reasonably
efficient
56Trio-One Overview
TrioExplorer (GUI client)
- DDL commands
- TriQL queries
- Schema browsing
- Table browsing
- Explore lineage
- On-demand
- confidence
- computation
Command-line client
Trio API and translator (Python)
- Partition and
- verticalize
- Shared IDs for
- alternatives
- Columns for
- confidence,?
Standard SQL
- Table types
- Schema-level
- lineage structure
Standard relational DBMS
Encoded Data Tables
Trio Metadata
- One per result
- table
- Uses unique IDs
- Encodes formulas
Trio Stored Procedures
Lineage Tables
57Strengths
- First DBMS with uncertainty and lineage
- Has many applications like I showed earlier
- Done by Stanford!
58Weaknesses
- Paper has lots of definitions rather than
explanations. - Proofs are written in whole text rather multiple
lines with math symbols.
59Future Work
- More forms of uncertainty
- Continuous uncertainty (intervals, Gaussians)
- Correlated uncertainty
- Incomplete relations
- More forms of lineage
- External lineage
- Update lineage
- Confidence-based queries
- Threshold Top-K
60Future Work
- Conjunctive lineage sufficient for most
operations - Disjunctive lineage for duplicate-elimination
- Negative lineage for difference
- General case after several queries
- Boolean formula
61Search stanford trio