Modern Databases - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Modern Databases

Description:

Modern Databases Willem Visser RW334 – PowerPoint PPT presentation

Number of Views:196
Avg rating:3.0/5.0
Slides: 38
Provided by: Informati556
Category:

less

Transcript and Presenter's Notes

Title: Modern Databases


1
Modern Databases
  • Willem VisserRW334

2
The Web is Changing the Game
  • Databases used to be the domain of corporations
    with limited amounts of data and limited amounts
    of users
  • Very valuable information, but not a lot of it
  • Important users, but not many of them
  • In the modern web-driven world
  • Enormous amounts of data are being generated
  • Millions of users are interested in that data

3
What is Wrong here?
4
What is Wrong here?
How to make the DB scale?
5
Partition and Distribute the Data
6
Distributed Database
  • A single logical database spread physically
    across computers in multiple locations that are
    connected by a data communications link

7
Major Objectives
  • Location Transparency
  • User does not have to know the location of the
    data
  • Data requests automatically forwarded to
    appropriate sites
  • Local Autonomy
  • Local site can operate with its database when
    network connections fail
  • Each site controls its own data, security,
    logging, recovery

8
Distributed Databases Advantages
  • Increased reliability/availability
  • Local control over data
  • Modular growth
  • Lower communication costs
  • Faster response for certain queries

9
Distributed Database Disadvantages
  • Software cost and complexity
  • Processing overhead
  • Data integrity exposure
  • Slower response for certain queries

10
Options forDistributing a Database
  • Data replication
  • Copies of data distributed to different sites
  • Horizontal partitioning/Sharding
  • Different rows of a table distributed to
    different sites
  • Vertical partitioning
  • Different columns of a table distributed to
    different sites
  • Combinations of the above

11
Data Replication
  • Advantages
  • Reliability
  • Fast response
  • May avoid complicated distributed transaction
    integrity routines (if replicated data is
    refreshed at scheduled intervals)
  • Decouples nodes (transactions proceed even if
    some nodes are down)
  • Reduced network traffic at prime time (if updates
    can be delayed)

12
Data Replication (cont.)
  • Disadvantages
  • Additional requirements for storage space
  • Additional time for update operations
  • Complexity and cost of updating
  • Integrity exposure of getting incorrect data if
    replicated data is not updated simultaneously

Therefore, better when used for non-volatile
(read-only) data
13
Factors in Choice ofDistributed Strategy
  • Funding, autonomy, security
  • Site data referencing patterns
  • Growth and expansion needs
  • Technological capabilities
  • Costs of managing complex technologies
  • Need for reliable service

14
Distributed DBMS
  • Distributed database requires distributed DBMS
  • Functions of a distributed DBMS
  • Locate data with a distributed data dictionary
  • Determine location from which to retrieve data
    and process query components
  • DBMS translation between nodes with different
    local DBMSs (using middleware)
  • Data management functions security, concurrency,
    deadlock control, query optimization, failure
    recovery
  • Data consistency (via multiphase commit
    protocols)
  • Global primary key control
  • Scalability
  • Data and stored procedure replication
  • Allowing for different DBMSs and application code
    at different nodes

15
Distributed DBMSTransparency Objectives
  • Location Transparency
  • User/application does not need to know where data
    resides
  • Replication Transparency
  • User/application does not need to know about
    duplication
  • Failure Transparency
  • Either all or none of the actions of a
    transaction are committed
  • Each site has a transaction manager
  • Logs transactions and before and after images
  • Concurrency control scheme to ensure data
    integrity
  • Requires special commit protocol

16
Query Optimization
  • In a query involving a multi-site join and,
    possibly, a distributed database with replicated
    files, the distributed DBMS must decide where to
    access the data and how to proceed with the join.
    Three step process
  • Query decompositionrewritten and simplified
  • Data localizationquery fragmented so that
    fragments reference data at only one site
  • Global optimization
  • Order in which to execute query fragments
  • Data movement between sites
  • Where parts of the query will be executed
  • Semi join operation only the joining attribute
    of the query is sent from one site to the other,
    rather than all selected attributes

17
Brewers CAP Theorem
  • Eric Brewer, Keynote at ACM Symposium on the
    Principles of Distributed Computing 2000
  • You cannot have all three of
  • Consistency
  • Availability
  • Partition Tolerance
  • Nothing short of complete network failure and the
    system must keep functioning
  • Theorem proven in 2002 by Gilbert and Lynch
  • See http//www.julianbrowne.com/article/viewer/bre
    wers-cap-theorem

18
Dealing with CAP?
  • Drop Partitioning Tolerance
  • Dont partition, but then you have serious
    scalability issues, which is probably why you
    want to partition in the first place
  • Drop Availability
  • Wait for all the partitions to sync before
    allowing any usage
  • This is as bad for scalability as having no
    partitioning
  • Drop Consistency
  • Eventual Consistency seems to work in most cases
  • If you have to drop one, this is the preferred
    option
  • Flies against most DB principles

19
Relational DB?
  • Seems like we are assuming the DB must still be
    relational
  • Web also forces a new concept
  • Not all data look the same anymore!
  • Email messages, Images, News documents, Facebook
    updates, Tweets,
  • Relations are too rigid
  • Semi-structured data

20
The Information-Integration Problem
  • Related data exists in many places and could, in
    principle, work together.
  • But different databases differ in
  • Model (relational, object-oriented?).
  • Schema (normalized/ not normalized?).
  • Terminology are consultants employees?
    Retirees? Subcontractors?
  • Conventions (meters versus feet?).
  • How do we model information residing in
    heterogeneous sources (if we cannot combine it
    all in a single new database)?

21
Example
  • Suppose we are integrating information about bars
    in some town.
  • Every bar has a database.
  • One may use a relational DBMS another keeps the
    menu in an MS-Word document.
  • One stores the phones of distributors, another
    does not.
  • One distinguishes ales from other beers, another
    doesnt.
  • One counts beer inventory by bottles, another by
    cases.

22
Semi-structured Data
  • Purpose represent data from independent sources
    more flexibly than either relational or
    object-oriented models.
  • Think of objects, but with the type of each
    object its own business, not that of its class.
  • Labels to indicate meaning of substructures.
  • Data is self-describing structural information
    is part of the data.

23
Graphs of Semistructured Data
  • Nodes objects.
  • Labels on arcs (attributes, relationships).
  • Atomic values at leaf nodes (nodes with no arcs
    out).
  • Flexibility no restriction on
  • Labels out of a node.
  • Number of successors with a given label.

24
Example Data Graph
root
beer
beer
bar
manf
manf
prize
A.B.
name
name
year
award
servedAt
Bud
Gold
1995
Mlob
name
addr
Maple
Joes
25
XML
  • XML Extensible Markup Language.
  • While HTML uses tags for formatting (e.g.,
    italic), XML uses tags for semantics (e.g.,
    this is an address).
  • Key idea create tag sets for a domain (e.g.,
    bars), and translate all data into properly
    tagged XML documents.
  • Well formed XML - XML which is syntactically
    correct
  • tags and their nesting totally arbitrary.
  • Valid XML - XML which has DTD (document type
    definition)
  • imposes some structure on the tags, but much
    more flexible than relational database schema.
  • DTD and XML Schema
  • Meta-data for XML
  • Describe what are valid XML structures

26
XML and Semi-structured Data
  • Well-Formed XML with nested tags is exactly the
    same idea as trees of semi-structured data.
  • XML also enables non-tree structures (with
    references to IDs of nodes), as does the
    semi-structured data model.

27
Example Well-Formed XML
  • lt?xml version 1.0 standalone yes ?gt
  • ltBARSgt
  • ltBARgtltNAMEgtJoes Barlt/NAMEgt
  • ltBEERgtltNAMEgtBudlt/NAMEgt
  • ltPRICEgt2.50lt/PRICEgtlt/BEERgt
  • ltBEERgtltNAMEgtMillerlt/NAMEgt
  • ltPRICEgt3.00lt/PRICEgtlt/BEERgt
  • lt/BARgt
  • ltBARgt
  • lt/BARSgt

28
Example
  • The ltBARSgt XML document is

BARS
BAR
BAR
BAR
NAME
. . .
BEER
BEER
Joes Bar
PRICE
PRICE
NAME
NAME
Bud
2.50
Miller
3.00
29
DTD Elements
  • The description of an element consists of its
    name (tag), and a parenthesized description of
    any nested tags.
  • Includes order of subtags and their multiplicity.
  • Leaves (text elements) have PCDATA (Parsed
    Character DATA ) in place of nested tags.

30
Example DTD
  • lt!DOCTYPE BARS
  • lt!ELEMENT BARS (BAR)gt
  • lt!ELEMENT BAR (NAME, BEER)gt
  • lt!ELEMENT NAME (PCDATA)gt
  • lt!ELEMENT BEER (NAME, PRICE)gt
  • lt!ELEMENT PRICE (PCDATA)gt
  • gt

31
Querying XML
  • Why query XML-documents?
  • special XML databases
  • major DBMSs speak XML
  • Does the world need a new query language?
  • Most of the world's business data is stored in
    relational databases
  • The relational language SQL is mature and
    well-established
  • Can SQL be adapted to query XML data?
  • Leverage existing software
  • Leverage existing user skills

32
XML vs Relational Data
  • Relational data is "flat rows and columns
  • XML data is nested and its depth may be
    irregular and unpredictable
  • Relations can represent hierarchic data by
    foreign keys or by structured datatypes
  • In XML it is natural to search for objects at
  • unknown levels of the hierarchy
  • "Find all the red things

33
XML vs Relational Data (cont.)
  • Relational data is uniform and repetitive
  • All bank accounts are similar in structure
  • Metadata can be factored out to a system catalog
  • XML data is highly variable
  • Every web page is different
  • Each XML object needs to be self-describing
  • Metadata is distributed throughout the document
  • Queries may access metadata as well as data
    "Find elements whose name is the same as their
    content
  • //name(.) string(.)

34
XML vs Relational Data (cont.)
  • Relational queries return uniform sets of rows
  • The results of an XML query may have mixed types
    and complex structures
  • "Red things" a flag, a cherry, a stopsign, ...
  • Elements can be mixed with atomic values
  • XML queries need to be able to perform structural
    transformations
  • Example invert a hierarchy

35
XML vs Relational Data (cont.)
  • The rows of a relation are unordered
  • Any desired output ordering must be derived from
    values
  • The elements in an XML document are ordered
  • Implications for query
  • Preserve input order in query results
  • Specify an output ordering at multiple levels
  • "Find the fifth step
  • "Find all the tools used before the hammer

36
XML vs Relational Data (cont.)
  • Relational data is "dense
  • Every row has a value in every column
  • A "null" value is needed for missing or
    inapplicable data
  • XML data can be "sparse
  • Missing or inapplicable elements can be "empty
    or "not there
  • This gives XML a degree of freedom not present in
    relational databases

37
XPATH and XQUERY
  • XPATH is a language for describing paths in XML
    documents.
  • Really think of the semi-structured data graph
    and its paths.
  • Why do we need path description language cant
    get at the data using just Relation.Attribute
    expressions.
  • XQUERY is a full query language for XML documents
    with power similar to OQL (Object Query Language,
    query language for object-oriented databases).
Write a Comment
User Comments (0)
About PowerShow.com