Title: Trio: A System for Data, Uncertainty, and Lineage
1Trio A System for Data, Uncertainty, and Lineage
- Search stanford trio
- http//i.stanford.edu/trio
2People
- Current
- Jennifer Widom (faculty)
- Omar Benjelloun (post-doc)
- Parag Agrawal, Anish Das Sarma, Shubha Nabar
(PhD) - Michi Mutsuzaki (MS)
- Tomoe Sugihara (visitor)
- Incoming
- Martin Theobald (post-doc)
- Raghu Murthy (MS)
- Ander de Keijzer (visitor)
- Alums
- Alon Halevy, Ashok Chandra (visitors)
- Chris Hayworth (MS)
3Why Uncertainty Lineage?
- Many applications seem to need both
- From a technical standpoint, it turns out that
- lineage...
- Enables simple and consistent representation of
uncertain data - Correlates uncertainty in query results with
uncertainty in the input data - Can make computation over uncertain data more
efficient
4Trio Components
- Data Model
- ULDBs (Uncertainty-Lineage Databases)
- Simple extension to relational model
- Query Language
- TriQL Simple extension to SQL, well-defined
semantics and intuitive behavior - System
- Version 1 Complete system and GUI built
on top of conventional DBMS
5Running Example Crime-Solving
- Saw(witness,car) // may be uncertain
- Drives(person,car) // may be uncertain
- Suspects(person) pperson(Saw ? Drives)
6Our Model for Uncertainty
- 1. Alternatives
- 2. ? (Maybe) Annotations
- 3. Confidences
7Our Model for Uncertainty
- 1. Alternatives uncertainty about value
- 2. ? (Maybe) Annotations
- 3. Confidences
Three possible instances
8Our Model for Uncertainty
- 1. Alternatives
- 2. ? (Maybe) uncertainty about presence
- 3. Confidences
?
Six possible instances
9Our Model for Uncertainty
- 1. Alternatives
- 2. ? (Maybe) Annotations
- 3. Confidences weighted uncertainty
?
Six possible instances, each with a probability
10Models for Uncertainty
- Our model (so far) is not especially new
- We spent some time exploring the space of models
for uncertainty ICDE 06, journal - Tension between understandability and
expressiveness - Our model is understandable
- But it is not complete, or even closed under
common operations
11Our Model is Not Closed
Suspects pperson(Saw ? Drives)
Does not correctly capture possible instances in
the result
CANNOT
?
?
?
12Lineage to the Rescue
- Lineage
- Captures where data came from
- In Trio A function ? from alternatives to other
alternatives (or external sources)
13Example with Lineage
Suspects pperson(Saw ? Drives)
?(31) (11,2),(21,2)
?
?(32,1) (11,1),(22,1) ?(32,2) (11,1),(22,2)
?
?
?(33) (11,1), 23
14Uncertainty-Lineage Databases (ULDBs)
- 1. Alternatives
- 2. ? (Maybe) Annotations
- 3. Confidences
- 4. Lineage
- ULDBs are closed and complete
- VLDB 06
15ULDBs Lineage
- Conjunctive lineage sufficient for most
operations - Duplicate-elimination Disjunctive lineage
- Difference Negative lineage
- General case after multiple operations/queries
Boolean formula
16ULDBs Interesting Questions
- Data-minimality extraneous alternatives,
extraneous ? - Lineage-minimality harder
- Membership tuple and table, some-instance and
all-instances - Coexistence multiple tuples
- Extraction remove tables, retain
possible-instances
17Example Extraneous Data
?
extraneous
?
?
18Example Coexistence
?
Cant coexist
?
?
?
19Querying ULDBs Semantics
implementation of Q
D
D
D Result
operational semantics
possible instances
representation of instances
Q on each instance
D1, D2, , Dn
Q(D1), Q(D2), , Q(Dn)
20Querying ULDBs TriQL
- Basic TriQL SQL with new semantics
- Obeys commutative diagram for uncertain data
- Tracks lineage
- Query results new table or on-the-fly
- Implemented TriQL also built-in predicates
conf(), lineage(), lineage()
21Additional TriQL Constructs
- Language manual on web site
- Horizontal subqueries
- Refer to tuple alternatives as a relation
- Unmerged (horizontal duplicates)
- Flatten, GroupAlts
- NoLineage, NoConf, NoMaybe
- Query-specified confidences done
- Data modification statements
22Confidence Computation
- Confidences computed on-demand based on lineage
- Confidence of alternative A is function of
confidences in ?(A) - Permits any query plan for data computation
- Default probabilistic interpretation, but queries
can override
SELECT person, min(conf(Saw),conf(Drives)) as
conf FROM Saw, Drives WHERE Saw.car Drives.car
23Trio System Version 1
TrioExplorer (GUI client)
- DDL commands
- TriQL queries
- Schema browsing
- Table browsing
- Explore lineage
- On-demand
- confidence
- computation
Command-line client
Trio API and translator (Python)
- Verticalize
- Shared IDs for
- alternatives
- Columns for
- confidence,?
Standard SQL
- Table types
- Schema-level
- lineage structure
Standard relational DBMS
- conf()
- lineage()
- lineage()
Encoded Data Tables
Trio Metadata
- One per result
- table
- Uses unique IDs
Trio Stored Procedures
Lineage Tables
24Current Future Topics
- Algorithms confidence computation, coexistence
- extraneous data
- Minimize lineage traversal
- Memoization
- Batch operations
- System
- Full query language
- More internal processing ?
- Storage and indexing
- Statistics and query optimization
25Current Future Topics
- Top-K by confidence
- Extend basic uncertainty model
- Incomplete relations
- Continuous uncertainty
- Correlated uncertainty ?
- External lineage,
- update lineage,
- versioning