Modern Databases

About This Presentation

Title:

Modern Databases

Description:

Modern Databases Willem Visser RW334 – PowerPoint PPT presentation

Number of Views:199

Avg rating:3.0/5.0

Slides: 38

Provided by: Informati556

Category:

more less

Transcript and Presenter's Notes

Title: Modern Databases

1
Modern Databases

Willem VisserRW334

2
The Web is Changing the Game

Databases used to be the domain of corporations
with limited amounts of data and limited amounts
of users
Very valuable information, but not a lot of it
Important users, but not many of them
In the modern web-driven world
Enormous amounts of data are being generated
Millions of users are interested in that data

3
What is Wrong here?
4
What is Wrong here?
How to make the DB scale?
5
Partition and Distribute the Data
6
Distributed Database

A single logical database spread physically
across computers in multiple locations that are
connected by a data communications link

7
Major Objectives

Location Transparency
User does not have to know the location of the
data
Data requests automatically forwarded to
appropriate sites
Local Autonomy
Local site can operate with its database when
network connections fail
Each site controls its own data, security,
logging, recovery

8
Distributed Databases Advantages

Increased reliability/availability
Local control over data
Modular growth
Lower communication costs
Faster response for certain queries

9
Distributed Database Disadvantages

Software cost and complexity
Processing overhead
Data integrity exposure
Slower response for certain queries

10
Options forDistributing a Database

Data replication
Copies of data distributed to different sites
Horizontal partitioning/Sharding
Different rows of a table distributed to
different sites
Vertical partitioning
Different columns of a table distributed to
different sites
Combinations of the above

11
Data Replication

Advantages
Reliability
Fast response
May avoid complicated distributed transaction
integrity routines (if replicated data is
refreshed at scheduled intervals)
Decouples nodes (transactions proceed even if
some nodes are down)
Reduced network traffic at prime time (if updates
can be delayed)

12
Data Replication (cont.)

Disadvantages
Additional requirements for storage space
Additional time for update operations
Complexity and cost of updating
Integrity exposure of getting incorrect data if
replicated data is not updated simultaneously

Therefore, better when used for non-volatile
(read-only) data
13
Factors in Choice ofDistributed Strategy

Funding, autonomy, security
Site data referencing patterns
Growth and expansion needs
Technological capabilities
Costs of managing complex technologies
Need for reliable service

14
Distributed DBMS

Distributed database requires distributed DBMS
Functions of a distributed DBMS
Locate data with a distributed data dictionary
Determine location from which to retrieve data
and process query components
DBMS translation between nodes with different
local DBMSs (using middleware)
Data management functions security, concurrency,
deadlock control, query optimization, failure
recovery
Data consistency (via multiphase commit
protocols)
Global primary key control
Scalability
Data and stored procedure replication
Allowing for different DBMSs and application code
at different nodes

15
Distributed DBMSTransparency Objectives

Location Transparency
User/application does not need to know where data
resides
Replication Transparency
User/application does not need to know about
duplication
Failure Transparency
Either all or none of the actions of a
transaction are committed
Each site has a transaction manager
Logs transactions and before and after images
Concurrency control scheme to ensure data
integrity
Requires special commit protocol

16
Query Optimization

In a query involving a multi-site join and,
possibly, a distributed database with replicated
files, the distributed DBMS must decide where to
access the data and how to proceed with the join.
Three step process
Query decompositionrewritten and simplified
Data localizationquery fragmented so that
fragments reference data at only one site
Global optimization
Order in which to execute query fragments
Data movement between sites
Where parts of the query will be executed
Semi join operation only the joining attribute
of the query is sent from one site to the other,
rather than all selected attributes

17
Brewers CAP Theorem

Eric Brewer, Keynote at ACM Symposium on the
Principles of Distributed Computing 2000
You cannot have all three of
Consistency
Availability
Partition Tolerance
Nothing short of complete network failure and the
system must keep functioning
Theorem proven in 2002 by Gilbert and Lynch
See http//www.julianbrowne.com/article/viewer/bre
wers-cap-theorem

18
Dealing with CAP?

Drop Partitioning Tolerance
Dont partition, but then you have serious
scalability issues, which is probably why you
want to partition in the first place
Drop Availability
Wait for all the partitions to sync before
allowing any usage
This is as bad for scalability as having no
partitioning
Drop Consistency
Eventual Consistency seems to work in most cases
If you have to drop one, this is the preferred
option
Flies against most DB principles

19
Relational DB?

Seems like we are assuming the DB must still be
relational
Web also forces a new concept
Not all data look the same anymore!
Email messages, Images, News documents, Facebook
updates, Tweets,
Relations are too rigid
Semi-structured data

20
The Information-Integration Problem

Related data exists in many places and could, in
principle, work together.
But different databases differ in
Model (relational, object-oriented?).
Schema (normalized/ not normalized?).
Terminology are consultants employees?
Retirees? Subcontractors?
Conventions (meters versus feet?).
How do we model information residing in
heterogeneous sources (if we cannot combine it
all in a single new database)?

21
Example

Suppose we are integrating information about bars
in some town.
Every bar has a database.
One may use a relational DBMS another keeps the
menu in an MS-Word document.
One stores the phones of distributors, another
does not.
One distinguishes ales from other beers, another
doesnt.
One counts beer inventory by bottles, another by
cases.

22
Semi-structured Data

Purpose represent data from independent sources
more flexibly than either relational or
object-oriented models.
Think of objects, but with the type of each
object its own business, not that of its class.
Labels to indicate meaning of substructures.
Data is self-describing structural information
is part of the data.

23
Graphs of Semistructured Data

Nodes objects.
Labels on arcs (attributes, relationships).
Atomic values at leaf nodes (nodes with no arcs
out).
Flexibility no restriction on
Labels out of a node.
Number of successors with a given label.

24
Example Data Graph
root
beer
beer
bar
manf
manf
prize
A.B.
name
name
year
award
servedAt
Bud
Gold
1995
Mlob
name
addr
Maple
Joes
25
XML

XML Extensible Markup Language.
While HTML uses tags for formatting (e.g.,
italic), XML uses tags for semantics (e.g.,
this is an address).
Key idea create tag sets for a domain (e.g.,
bars), and translate all data into properly
tagged XML documents.
Well formed XML - XML which is syntactically
correct
tags and their nesting totally arbitrary.
Valid XML - XML which has DTD (document type
definition)
imposes some structure on the tags, but much
more flexible than relational database schema.
DTD and XML Schema
Meta-data for XML
Describe what are valid XML structures

26
XML and Semi-structured Data

Well-Formed XML with nested tags is exactly the
same idea as trees of semi-structured data.
XML also enables non-tree structures (with
references to IDs of nodes), as does the
semi-structured data model.

27
Example Well-Formed XML

lt?xml version 1.0 standalone yes ?gt
ltBARSgt
ltBARgtltNAMEgtJoes Barlt/NAMEgt
ltBEERgtltNAMEgtBudlt/NAMEgt
ltPRICEgt2.50lt/PRICEgtlt/BEERgt
ltBEERgtltNAMEgtMillerlt/NAMEgt
ltPRICEgt3.00lt/PRICEgtlt/BEERgt
lt/BARgt
ltBARgt
lt/BARSgt

28
Example

The ltBARSgt XML document is

BARS
BAR
BAR
BAR
NAME
. . .
BEER
BEER
Joes Bar
PRICE
PRICE
NAME
NAME
Bud
2.50
Miller
3.00
29
DTD Elements

The description of an element consists of its
name (tag), and a parenthesized description of
any nested tags.
Includes order of subtags and their multiplicity.
Leaves (text elements) have PCDATA (Parsed
Character DATA ) in place of nested tags.

30
Example DTD

lt!DOCTYPE BARS
lt!ELEMENT BARS (BAR)gt
lt!ELEMENT BAR (NAME, BEER)gt
lt!ELEMENT NAME (PCDATA)gt
lt!ELEMENT BEER (NAME, PRICE)gt
lt!ELEMENT PRICE (PCDATA)gt
gt

31
Querying XML

Why query XML-documents?
special XML databases
major DBMSs speak XML
Does the world need a new query language?
Most of the world's business data is stored in
relational databases
The relational language SQL is mature and
well-established
Can SQL be adapted to query XML data?
Leverage existing software
Leverage existing user skills

32
XML vs Relational Data

Relational data is "flat rows and columns
XML data is nested and its depth may be
irregular and unpredictable
Relations can represent hierarchic data by
foreign keys or by structured datatypes
In XML it is natural to search for objects at
unknown levels of the hierarchy
"Find all the red things

33
XML vs Relational Data (cont.)

Relational data is uniform and repetitive
All bank accounts are similar in structure
Metadata can be factored out to a system catalog
XML data is highly variable
Every web page is different
Each XML object needs to be self-describing
Metadata is distributed throughout the document
Queries may access metadata as well as data
"Find elements whose name is the same as their
content
//name(.) string(.)

34
XML vs Relational Data (cont.)

Relational queries return uniform sets of rows
The results of an XML query may have mixed types
and complex structures
"Red things" a flag, a cherry, a stopsign, ...
Elements can be mixed with atomic values
XML queries need to be able to perform structural
transformations
Example invert a hierarchy

35
XML vs Relational Data (cont.)

The rows of a relation are unordered
Any desired output ordering must be derived from
values
The elements in an XML document are ordered
Implications for query
Preserve input order in query results
Specify an output ordering at multiple levels
"Find the fifth step
"Find all the tools used before the hammer

36
XML vs Relational Data (cont.)

Relational data is "dense
Every row has a value in every column
A "null" value is needed for missing or
inapplicable data
XML data can be "sparse
Missing or inapplicable elements can be "empty
or "not there
This gives XML a degree of freedom not present in
relational databases

37
XPATH and XQUERY

XPATH is a language for describing paths in XML
documents.
Really think of the semi-structured data graph
and its paths.
Why do we need path description language cant
get at the data using just Relation.Attribute
expressions.
XQUERY is a full query language for XML documents
with power similar to OQL (Object Query Language,
query language for object-oriented databases).

Write a Comment

User Comments (0)

About PowerShow.com

Modern Databases - PowerPoint PPT Presentation

Modern Databases

Modern Databases Willem Visser RW334 – PowerPoint PPT presentation