Data integration - PowerPoint PPT Presentation

About This Presentation
Title:

Data integration

Description:

Different data models: relational, object-oriented, XML, ... 'Keanu Reeves' or 'Reeves, Keanu' or 'Reeves, K.' etc. Describe source contents ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 30
Provided by: jiang79
Category:

less

Transcript and Presenter's Notes

Title: Data integration


1
Data integration
  • Most slides are borrowed from Li Chen

2
Motivation
Biblio sever
Legacy database
Plain text files
Support seamless access to autonomous and
heterogeneous information sources.
3
Applications
  • Comparison shopping

Lowest price of the DVD The Matrix?
  • Supply-chain management

Buyer 1
Supplier 1
Buyer 2
Supplier 2
Integrator


Supplier M
Buyer M
4
Mediation architecture
Mediator
Wrapper
Wrapper
Wrapper
Source 1
Source 2
Source n
TSIMMIS (Stanford), Garlic (IBM), Infomaster
(Stanford), Disco (INRIA), Information Manifold
(ATT), Hermes(UMD), Tukwila (UW), InfoSleuth
(MCC),
5
Challenges
  • Sources are heterogeneous
  • Different data models relational,
    object-oriented, XML,
  • Different schemas and representations. E.g.,
  • Keanu Reeves or Reeves, Keanu or Reeves,
    K. etc.
  • Describe source contents
  • Use source data to answer queries
  • Sources have limited query capabilities
  • Data quality

6
Outline
  • Basics theories of conjunctive queries
  • Global-as-view (GAV) approach to data
    integration
  • Local-as-view (LAV) approach to data integration

7
Conjunctive Queries (CQs) in Datalog
  • Most common form of query equivalent to
    select-project-join (SPJ) queries
  • Useful for data integration
  • Form q(X) - p1(X1), p2(X2),, pn(Xn).
  • Head q(X) represents the query answers
  • Body p1(X1), p2(X2),, pn(Xn) represents the
    query conditions
  • Each pi(Xi) is called a subgoal
  • Shared variables represent join conditions
  • Constants represent Attributeconst selection
    conditions
  • A relation can appear in multiple predicates
    (subgoals)

8
Conjunctive Queries example
  • Schema
  • student(name, courseNum), course(number,
    Instructor)
  • SQL
  • SELECT name
  • FROM student, course
  • WHERE student.courseNumcourse.number
    AND instructorLi
  • Equal to
  • ans(SN) - student(SN, CN), course(CN,Li).
  • Predicates student and course correspond to
    relations names
  • Two subgoals student(SN, CN) and course(CN,Li)
  • Variables SN, CN. Constant Li
  • Shared variable, CN, corresponds to
    student.courseNumcourse.number
  • Variable SN in the head the answer to the query

9
Answer to a CQ
  • For a CQ Q on database D, the answer Q(D) is a
    set of heads of Q if we
  • Substitute constants for variables in the body of
    Q in all possible ways
  • Require all subgoals to be true
  • Example ans(SN) - student(SN, CN),
    course(CN,Li).
  • Tuples are also called EDB (external database)
    facts student(Jack, 184), student(Tom,215), ,
    course(184,Li), course(215,Li),
  • Answer Jack SN?Jack,CN?184
  • Answer Tom SN?Tom,CN?215
  • Answer Jack SN?Jack,CN?215 (duplicate
    eliminated)

Course
Student
10
Query containment
  • For two queries Q1 and Q2, we say Q1 is contained
    in Q2, denoted Q1?Q2, if any database D, we have
    Q1(D) ?Q2(D).
  • We say Q1 and Q2 are equivalent, denoted Q1?Q2,
    if Q1(D) ?Q2(D) and Q1(D) ? Q2(D).
  • Example
  • Q1 ans(SN) - student(SN, CN), course(CN, Li).
  • Q2 ans(SN) - student(SN, CN), course(CN, INS).
  • We have Q1(D) ? Q2(D).

11
Another example
  • Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y).
  • Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y).
  • We have Q2 ?Q1
  • Proof
  • For any DB D, suppose p(x,y) is in Q2(D). Then
    there is a w such that r(x,w), b(w,w), and r(w,y)
    are in D.
  • For Q1, consider the substitution X? x, W? w, Z?
    w, Y? y.
  • Thus the head of Q1 becomes p(x,y), meaning that
    p(x,y) is also in Q1(D).
  • In general, how to test containment of CQs?
  • Containment mappings

12
Containment mappings
  • A containment mapping from Q2 to Q1 Map
    variables of Q2 to variables of Q1, such that
  • Head of Q2 becomes head of Q1
  • Each subgoal of Q2 becomes some subgoal of Q1.
  • It is not necessary that every subgoal of Q1 is
    the target of some subgoal of Q2.
  • Q1 ?Q2 iff there is a containment mapping from
    Q2 to Q1.
  • Note that the containment mapping is opposite the
    containment --- it goes from the larger
    (containing CQ) to the smaller (contained CQ).
  • Example
  • Q1 p(X,Y) - r(X,W), b(W,Z), r(Z,Y).
  • Q2 p(X,Y) - r(X,W), b(W,W), r(W,Y).
  • Containment mapping from Q1 to Q2 X ? X, Y ? Y,
    W ? W, Z ? W
  • No containment mapping from Q2 to Q1
  • For b(W,W) in Q2, its only possible target in Q1
    is b(W,Z)
  • However, we cannot have a mapping W?W and W?Z,
    since each variable cannot be mapped to two
    different variables

13
Extending CQs
  • CQs with built-in predicates
  • We can add more conditions to variables in a CQ.
  • Example
  • student(name, GPA, courseNum),
    course(number,instructor,year)
  • Q1(SN) - student(SN, G, CN), course(CN, Li),
    Ggt3.5.
  • Q2(SN) - student(SN, G, CN), course(CN, Li),
    Ggt3.5, Y lt 2002.
  • Q2(SN) ? Q1(SN).
  • Datalog queries
  • a (possibly infinite) set of CQs with (possibly)
    recursion
  • Example r(Parent, Child)
  • Query finding all ancestors of Tom
  • ancestor(P,C) - r(P, C).
  • ancestor(P,C) - ancestor(P,X), r(X, C).
  • result(P) - ancestor(P, tom).

14
Outline
  • Basics theories of conjunctive queries
  • Global-as-view (GAV) approach to data
    integration
  • Local-as-view (LAV) approach to data integration

15
GAV approach to data integration
  • Readings
  • Jeffrey Ullman, Information Integration Using
    Logical Views, ICDT 1997.
  • Ramana Yerneni, Chen Li, Hector Garcia-Molina,
    and Jeffrey Ullman, Computing Capabilities of
    Mediators, SIGMOD 1999.

16
Global-as-view Approach
med(Dealer,City,Make,Year) R1 R2
Mediator
R1(Dealer,City)
R2(Dealer, Make, Year)
  • Mediator exports views defined on source
    relations
  • med(Dealer,City,Make,Year) R1 R2
  • A query is posted on mediator views
  • SELECT FROM med
  • WHERE Year 2001
  • ans(D,C,M, 2001) - med(D,C,M,2001).
  • Mediator expands query to source queries
  • SELECT FROM R1, R2
  • WHERE Year 2001
  • ans(D,C,M,2001) - R1(D,C), R2(D,M,
    2001).

17
GAV Approach
  • Project TSIMMIS at Stanford
  • Advantages
  • User queries are easy to define
  • Query transformation generation is
    straightforward
  • Disadvantages
  • Not all source information is exported
  • Not easily scalable every time a new source is
    added, mediator views need to be changed.
  • Research issues
  • Efficient query execution?
  • Deal with limited source capabilities?

18
Limited source capabilities
  • Complete scans of relations not possible
  • Reasons
  • Legacy databases or structured files limited
    interfaces
  • Security/Privacy
  • Performance concerns
  • Example 1 legacy databases with restrictive
    interfaces

title
author
Given an author, return the books.
Ullman
DBMS
Knuth
TeX


19
Another example Web search forms
www.imdb.com
20
Problems
  • How to describe source restrictions?
  • How to compute mediator restrictions from
    sources?
  • How to answer queries efficiently given these
    restrictions?
  • How to compute as many answers as possible to a
    query?

21
Computing mediator restrictions
  • Motivation do not want users to be frustrated by
    submitting a query that cannot be answerable by
    the mediator
  • Example
  • Source 1 book(author?, title, price)
  • Capability bff
  • I.e., we must provide an author, and can get
    title and price info
  • Source 2 review(title?, reviewer, rate)
  • Capability bff
  • I.e., we must provide a book title, and can get
    other info
  • Mediator view
  • MedView(A?,T,P,RV,RT) - book(A,T,P),review(T,RV,R
    T).
  • Query on the mediator view
  • Ans(RT) - MedView(A, db, P, RV, RT).
  • I.e., find the review rates of DB books
  • But the mediator cannot answer this query, since
    we do not know the authors.
  • We want to tell the user beforehand what queries
    can be answered

22
Outline
  • Basics theories of conjunctive queries
  • Global-as-view (GAV) approach to data
    integration
  • Local-as-view (LAV) approach to data
    integration.

23
Local-as-view (LAV) approach
Mediator
sources
  • There are global predicates, e.g., car,
    person, book, etc.
  • They can been seen as mediator views
  • The content of each source is described using
    these global predicates
  • A query to the mediator is also defined on the
    global predicates
  • The mediator finds a way to answer the query
    using the source contents

24
Example
Mediator
S1(Dealer,City)
S2(Dealer,Make,Year)
  • Global predicates Loc(Dealer,City),Sell(Dealer,Ma
    ke,Year)
  • Source content defined on global predicates
  • S1(Dealer,City) - Loc(Dealer, City).
  • S2(Dealer,Make,Year) - Sell(Dealer, Make,
    Year).
  • In general, each definition could be more
    complicated, rather than direct copies.
  • Queries defined on global predicates.
  • Q ans(D,M,Y) - Loc(D, windsor), Sell(D, M,
    Y).
  • Users do not know source views.
  • The mediator decides how to use source views to
    answer queries.
  • Answering queries using views
  • ans(D, M, Y) - S1(D,windsor), S2(D,M,Y).

25
Answering queries using views
Mediator
Query
V1(Dealer,City)- Loc(Dealer, City). V2(Dealer,Mak
e,Year)-Sell(Dealer, Make, Year). V3(D,C,M,Y) -
Loc(D,C),Sell(D,M,Y). V4(D,C,M,Y) -
Loc(D,C),Sell(D,M,Y), Ylt1970.
  • Source views can be complicated SPJs, arithmetic
    comparisons,
  • Not easy to decide how to answer a query using
    source views
  • Query ans(D,M) - Loc(D,windsor'),
    Sell(D,M,Y).
  • Rewriting
  • ans(D, M) - V3(D,windsor, M,Y).
  • ans(D, M) - V1(D,windsor), V2(D,M,Y).
  • Equivalent rewriting compute the same answer
    as the query
  • A rewriting can join multiple source views

26
Arithmetic comparisons
Mediator
V(D,C,M,Y)- Loc(D,C),Sell(D,M,Y),Ylt1970
  • Comparisons can make the problem even trickier
  • Query ans(D,M) - Loc(D,windsor'),
    Sell(D,M,Y).
  • Rewriting ans(D,M) - V(D,windsor, M,Y).
  • Contained rewriting only retrieve cars before
    1970.
  • Query ans(D,M)- Loc(D, windsor'), Sell(D,M,Y),
    Y lt 1960
  • Rewriting ans(D,M) - V(D,windsor, M, Y), Y lt
    1960

27
Rewritings
  • Given a query Q and a set of views V
  • A conjunctive query P is called a rewriting of
    Q using V if P only uses views in V, and P
    computes a partial answer of Q. That is Pexp ?Q.
    A rewriting is also called a contained
    rewriting (CR).
  • A conjunctive query P is called an equivalent
    rewriting (ER) of Q using V if P only uses views
    in V, and P computes the exact answer of Q. That
    is Pexp ? Q.

28
Bucket algorithm
  • Query
  • q(x)-car(x), sell(x, d), loc(d, windsor).
  • Views
  • v1(x) - car(x).
  • v2(x) - car(x), sell(x, d).
  • v3(x,d) - sell(x, d), loc(d, windsor).
  • v4(x) - sell(x, d), loc(d, windsor).

q(x)-v1(x), v2(x), v3(x,d). q(x)-v1(x),
v3(x,d). q(x)-v1(x), v4(x). q(x)-v2(x),
v3(x,d). q(x)-v2(x), v4(x).
29
Projects using the LAV approach
  • Projects Information Manifold, Infomaster,
    Tukwila,
  • Advantages
  • Scalable new sources easy to add without
    modifying the mediator views
  • All we need to do is to define the new source
    using the existing mediator views (predicates)
  • Disadvantages
  • Hard to decide how to answer a query using views
  • Reading Alon Halevy, Answering Queries Using
    Views A Survey.
Write a Comment
User Comments (0)
About PowerShow.com