Data Integration and Physical Storage - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Data Integration and Physical Storage

Description:

... also define mappings between ... We find all views that define book with author and title, and we compose ... Define each source as the subset of a query ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 45
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Data Integration and Physical Storage


1
Data Integrationand Physical Storage
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • November 15, 2005

2
Mappings between Schemas
  • LSD provides attribute correspondences, but not
    complete mappings
  • Mappings generally are posed as views define
    relations in one schema (typically either the
    mediated schema or the source schema), given data
    in the other schema
  • This allows us to restructure or recompose
    decompose our data in a new way
  • We can also define mappings between values in a
    view
  • We use an intermediate table defining
    correspondences a concordance table
  • It can be filled in using some type of code, and
    corrected by hand

3
A Few Mapping Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
Need a concordance table from CustIDs to PennIDs
4
Two Important Approaches
  • TSIMMIS Garcia-Molina97 Stanford
  • Focus semistructured data (OEM), OQL-based
    language (Lorel)
  • Creates a mediated schema as a view over the
    sources
  • Spawned a UCSD project called MIX, which led to a
    company now owned by BEA Systems
  • Other important systems of this vein Kleisli/K2
    _at_ Penn
  • Information Manifold Levy96 ATT Research
  • Focus local-as-view mappings, relational model
  • Sources defined as views over mediated schema
  • Requires a special
  • Led to peer-to-peer integration approaches
    (Piazza, etc.)
  • Focus Web-based queriable sources

5
TSIMMIS
  • One of the first systems to support
    semi-structured data, which predated XML by
    several years OEM
  • An instance of a global-as-view mediation
    system
  • We define our global schema as views over the
    sources

6
XML vs. Object Exchange Model
ltbookgt ltauthorgtBernsteinlt/authorgt
ltauthorgtNewcomerlt/authorgt lttitlegtPrinciples of
TPlt/titlegtlt/bookgt ltbookgt ltauthorgtChamberlinlt/au
thorgt lttitlegtDB2 UDBlt/titlegtlt/bookgt
O1 book O2 author Bernstein O3
author Newcomer O4 title Principles of
TP O5 book O6 author Chamberlin
O7 title DB2 UDB
7
Queries in TSIMMIS
  • Specified in OQL-style language called Lorel
  • OQL was an object-oriented query language that
    looks like SQL
  • Lorel is, in many ways, a predecessor to XQuery
  • Based on path expressions over OEM structures
  • select book where book.title DB2 UDB and
    book.author Chamberlin
  • This is basically like XQuery, which well use in
    place of Lorel and the MSL template language.
    Previous query restated
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

8
Query Answering in TSIMMIS
  • Basically, its view unfolding, i.e., composing a
    query with a view
  • The query is the one being asked
  • The views are the MSL templates for the wrappers
  • Some of the views may actually require
    parameters, e.g., an author name, before theyll
    return answers
  • Common for web forms (see Amazon, Google, )
  • XQuery functions (XQuerys version of views)
    support parameters as well, so well see these in
    action

9
A Wrapper Definition in MSL
  • Wrappers have templates and binding patterns (X)
    in MSL
  • B - B ltbook ltauthor Xgtgt // select
    from book where author X //
  • This reformats a SQL query over Book(author,
    year, title)
  • In XQuery, this might look like
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgtb/titleltauthorgtxlt/authorgtlt/bookgt


book
author
title


The union of GetBooks results is unioned with
others to form the view Mediator()
10
How to Answer the Query
  • Given our query
  • for b in Mediator()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b
  • Find all wrapper definitions that
  • Contain output enough structure to match the
    conditions of the query
  • Or have already tested the conditions for us!

11
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt
  • for b in Mediator()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

book
author
title


12
Matching View Output to Our Querys Conditions
  • Determine that b/book/author/text() ?? x by
    matching the pattern on the functions output
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgt b/title ltauthorgtxlt/author
    gtlt/bookgt
  • let x Chamberlinfor b in
    GetBook(x)/bookwhere b/title/text() DB2
    UDB return b

book
author
title


13
The Final Step Unfolding
  • let x Chamberlinfor b in ( for b in
    sql(Amazon.com, select from book where
    author x ) return ltbookgt b/title
    ltauthorgtxlt/authorgtlt/bookgt
  • )/bookwhere b/title/text() DB2 UDB
    return b
  • How do we simplify further to get to here?
  • for b in sql(Amazon.com, select from
    book where authorChamberlin)where
    b/title/text() DB2 UDB return b

14
Virtues of TSIMMIS
  • Early adopter of semistructured data, greatly
    predating XML
  • Can support data from many different kinds of
    sources
  • Obviously, doesnt fully solve heterogeneity
    problem
  • Presents a mediated schema that is the union of
    multiple views
  • Query answering based on view unfolding
  • Easily composed in a hierarchy of mediators

15
Limitations of TSIMMIS Approach
  • Some data sources may contain data with certain
    ranges or properties
  • Books by Aho, Students at UPenn,
  • If we ask a query for students at Columbia, dont
    want to bother querying students at Penn
  • How do we express these?
  • Mediated schema is basically the union of the
    various MSL templates as they change, so may
    the mediated schema

16
An Alternate ApproachThe Information Manifold
(Levy et al.)
  • When you integrate something, you have some
    conceptual model of the integrated domain
  • Define that as a basic frame of reference,
    everything else as a view over it
  • Local as View
  • May have overlapping/incomplete sources
  • Define each source as the subset of a query over
    the mediated schema
  • We can use selection or join predicates to
    specify that a source contains a range of values
  • ComputerBooks() ? Books(Title, , Subj), Subj
    Computers

17
The Local-as-View Model
  • The basic model is the following
  • Local sources are views over the mediated
    schema
  • Sources have the data mediated schema is
    virtual
  • Sources may not have all the data from the domain
    open-world assumption
  • The system must use the sources (views) to answer
    queries over the mediated schema

18
Query Answering
  • Assumption conjunctive queries, set semantics
  • Suppose we have a mediated schema author(aID,
    isbn, year), book(isbn, title, publisher)
  • Suppose we have the query
  • q(a, t) - author(a, i, _), book(i, t, p), t
    DB2 UDB
  • and sources
  • s1(a,t) ? author(a, i, _), book(i, t, p), t
    123
  • s5(a, t, p) ? author(a, i, _), book(i,t), p
    SAMS
  • We want to compose the query with the source
    mappings but theyre in the wrong direction!
  • Yet everything in s1, s5 is an answer to the
    query!

19
Answering Queries Using Views
  • Numerous recently-developed algorithms for these
  • Inverse rules Duschka et al.
  • Bucket algorithm Levy et al.
  • MiniCon Pottinger Halevy
  • Also related chase and backchase Popa,
    Tannen, Deutsch
  • Requires conjunctive queries

20
Summary of Data Integration
  • Local-as-view integration has replaced
    global-as-view as the standard
  • More robust way of defining mediated schemas and
    sources
  • Mediated schema is clearly defined, less likely
    to change
  • Sources can be more accurately described
  • Methods exist for query reformulation, including
    inverse rules
  • Integration requires standardization on a single
    schema
  • Can be hard to get consensus
  • Today we have peer-to-peer data integration,
    e.g., Piazza Halevy et al., Orchestra Ives et
    al., Hyperion Miller et al.
  • Some other aspects of integration were addressed
    in related papers
  • Overlap between sources coverage of data at
    sources
  • Semi-automated creation of mappings and wrappers
  • Data integration capabilities in commercial
    products BEAs Liquid Data, IBMs DB2
    Information Integrator, numerous packages from
    middleware companies

21
Performance What Governs It?
  • Speed of the machine of course!
  • But also many software-controlled factors that we
    must understand
  • Caching and buffer management
  • How the data is stored physical layout,
    partitioning
  • Auxiliary structures indices
  • Locking and concurrency control (well talk about
    this later)
  • Different algorithms for operations query
    execution
  • Different orderings for execution query
    optimization
  • Reuse of materialized views, merging of query
    subexpressions answering queries using views
    multi-query optimization

22
Our General Emphasis
  • Goal cover basic principles that are applied
    throughout database system design
  • Use the appropriate strategy in the appropriate
    place
  • Every (reasonable) algorithm is good somewhere
  • And a corollary database people reinvent a lot
    of things and add minor tweaks

23
Whats the Base in Database?
  • Could just be a file with random access
  • What are the advantages and disadvantages?
  • DBs generally require raw disk access
  • Need to know when a page is actually written to
    disk, vs. queued by the OS
  • Predictable performance, less fragmentation
  • May want to exploit striping or contiguous
    regions
  • Typically divided into extents and pages

24
Buffer Management
  • Could keep DB in RAM
  • Main-memory DBs like TimesTen
  • But many DBs are still too big we read replace
    pages
  • May need to force to disk or pin in buffer
  • Policies for page replacement, prefetching
  • LRU, as in Operating Systems (not as good as you
    might think why not?)
  • MRU (one-time sequential scans)
  • Clock, etc.
  • DBMIN (min pages, local policy)

Tuple Reads/Writes
Buffer Mgr
25
Storing Tuples in Pages
t1
  • Tuples
  • Many possible layouts
  • Dynamic vs. fixed lengths
  • Ptrs, lengths vs. slots
  • Tuples grow down, directories grow up
  • Identity and relocation
  • Objects and XML are harder
  • Horizontal, path, vertical partitioning
  • Generally no algorithmic way of deciding
  • Generally want to leave some space for insertions

t2
t3
26
Alternatives for Organizing Files
  • Many alternatives, each ideal for some situation,
    and poor for others
  • Heap files for full file scans or frequent
    updates
  • Data unordered
  • Write new data at end
  • Sorted Files if retrieved in sort order or want
    range
  • Need external sort or an index to keep sorted
  • Hashed Files if selection on equality
  • Collection of buckets with primary overflow
    pages
  • Hashing function over search key attributes

27
Model for Analyzing Access Costs
  • We ignore CPU costs, for simplicity
  • p(T) The number of data pages in table T
  • r(T) Number of records in table T
  • D (Average) time to read or write disk page
  • Measuring number of page I/Os ignores gains of
    pre-fetching blocks of pages thus, I/O cost is
    only approximated.
  • Average-case analysis based on several
    simplistic assumptions.
  • Good enough to show the overall trends!

28
Assumptions in Our Analysis
  • Single record insert and delete
  • Heap files
  • Equality selection on key exactly one match
  • Insert always at end of file
  • Sorted files
  • Files compacted after deletions
  • Selections on sort field(s)
  • Hashed files
  • No overflow buckets, 80 page occupancy

29
Cost of Operations
  • Several assumptions underlie these (rough)
    estimates!

30
Speeding Operations over Data
  • Three general data organization techniques
  • Indexing
  • Sorting
  • Hashing

31
Technique I Indexing
  • An index on a file speeds up selections on the
    search key attributes for the index (trade space
    for speed).
  • Any subset of the fields of a relation can be the
    search key for an index on the relation.
  • Search key is not the same as key (minimal set of
    fields that uniquely identify a record in a
    relation).
  • An index contains a collection of data entries,
    and supports efficient retrieval of all data
    entries k with a given key value k.

32
Alternatives for Data Entry k in Index
  • Three alternatives
  • Data record with key value k
  • Clustered ? fast lookup
  • Index is large only 1 can exist
  • ltk, rid of data record with search key value kgt,
    OR
  • ltk, list of rids of data records with search key
    kgt
  • Can have secondary indices
  • Smaller index may mean faster lookup
  • Often not clustered ? more expensive to use
  • Choice of alternative for data entries is
    orthogonal to the indexing technique used to
    locate data entries with a given key value k.

33
Classes of Indices
  • Primary vs. secondary primary has primary key
  • Clustered vs. unclustered order of records and
    index approximately same
  • Alternative 1 implies clustered, but not
    vice-versa
  • A file can be clustered on at most one search key
  • Dense vs. Sparse dense has index entry per data
    value sparse may skip some
  • Alternative 1 always leads to dense index
  • Every sparse index is clustered!
  • Sparse indexes are smaller however, some useful
    optimizations are based on dense indexes

34
Clustered vs. Unclustered Index
  • Suppose Index Alternative (2) used, records are
    stored in Heap file
  • Perhaps initially sort data file, leave some gaps
  • Inserts may require overflow pages

Index entries
UNCLUSTERED
CLUSTERED
direct search for
data entries
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
35
B Tree The DB Worlds Favorite Index
  • Insert/delete at log F N cost
  • (F fanout, N leaf pages)
  • Keep tree height-balanced
  • Minimum 50 occupancy (except for root).
  • Each node contains d lt m lt 2d entries. d is
    called the order of the tree.
  • Supports equality and range searches efficiently.

Index Entries
(Direct search)
Data Entries
("Sequence set")
36
Example B Tree
  • Search begins at root, and key comparisons direct
    it to a leaf.
  • Search for 5, 15, all data entries gt 24 ...

Root
30
17
24
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
  • Based on the search for 15, we know it is not
    in the tree!

37
B Trees in Practice
  • Typical order 100. Typical fill-factor 67.
  • average fanout 133
  • Typical capacities
  • Height 4 1334 312,900,700 records
  • Height 3 1333 2,352,637 records
  • Can often hold top levels in buffer pool
  • Level 1 1 page 8 Kbytes
  • Level 2 133 pages 1 Mbyte
  • Level 3 17,689 pages 133 MBytes

38
Inserting Data into a B Tree
  • Find correct leaf L.
  • Put data entry onto L.
  • If L has enough space, done!
  • Else, must split L (into L and a new node L2)
  • Redistribute entries evenly, copy up middle key.
  • Insert index entry pointing to L2 into parent of
    L.
  • This can happen recursively
  • To split index node, redistribute entries evenly,
    but push up middle key. (Contrast with leaf
    splits.)
  • Splits grow tree root split increases height.
  • Tree growth gets wider or one level taller at
    top.

39
Inserting 8 into Example B Tree
  • Observe how minimum occupancy is guaranteed in
    both leaf and index pg splits.
  • Recall that all data items are in leaves, and
    partition values for keys are in intermediate
    nodes
  • Note difference between copy-up and push-up.

40
Inserting 8 Example Copy up
Root
24
30
17
13
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
Want to insert here no room, so split copy up
8
Entry to be inserted in parent node.
(Note that 5 is copied up and
5
continues to appear in the leaf.)
3
5
2
7
8
41
Inserting 8 Example Push up
Need to split node push up
Root
24
30
17
13
5
39
3
19
20
22
24
27
38
2
14
16
29
33
34
5
7
8
Entry to be inserted in parent node.
(Note that 17 is pushed up and only appears once
in the index. Contrast this with a leaf split.)
17
5
24
30
13
42
Deleting Data from a B Tree
  • Start at root, find leaf L where entry belongs.
  • Remove the entry.
  • If L is at least half-full, done!
  • If L has only d-1 entries,
  • Try to re-distribute, borrowing from sibling
    (adjacent node with same parent as L).
  • If re-distribution fails, merge L and sibling.
  • If merge occurred, must delete entry (pointing to
    L or sibling) from parent of L.
  • Merge could propagate to root, decreasing height.

43
B Tree Summary
  • B tree and other indices ideal for range
    searches, good for equality searches.
  • Inserts/deletes leave tree height-balanced logF
    N cost.
  • High fanout (F) means depth rarely more than 3 or
    4.
  • Almost always better than maintaining a sorted
    file.
  • Typically, 67 occupancy on average.
  • Note Order (d) concept replaced by physical
    space criterion in practice (at least
    half-full).
  • Records may be variable sized
  • Index pages typically hold more entries than
    leaves

44
Other Kinds of Indices
  • Multidimensional indices
  • R-trees, kD-trees,
  • Text indices
  • Inverted indices
  • Structural indices
  • Object indices access support relations, path
    indices
  • XML and graph indices dataguides, 1-indices,
    d(k) indices
  • Describe parent-child, path relationships
Write a Comment
User Comments (0)
About PowerShow.com