Title: Modern Databases
1Modern Databases
2The Web is Changing the Game
- Databases used to be the domain of corporations
with limited amounts of data and limited amounts
of users - Very valuable information, but not a lot of it
- Important users, but not many of them
- In the modern web-driven world
- Enormous amounts of data are being generated
- Millions of users are interested in that data
3What is Wrong here?
4What is Wrong here?
How to make the DB scale?
5Partition and Distribute the Data
6Distributed Database
- A single logical database spread physically
across computers in multiple locations that are
connected by a data communications link
7Major Objectives
- Location Transparency
- User does not have to know the location of the
data - Data requests automatically forwarded to
appropriate sites - Local Autonomy
- Local site can operate with its database when
network connections fail - Each site controls its own data, security,
logging, recovery
8Distributed Databases Advantages
- Increased reliability/availability
- Local control over data
- Modular growth
- Lower communication costs
- Faster response for certain queries
9Distributed Database Disadvantages
- Software cost and complexity
- Processing overhead
- Data integrity exposure
- Slower response for certain queries
10Options forDistributing a Database
- Data replication
- Copies of data distributed to different sites
- Horizontal partitioning/Sharding
- Different rows of a table distributed to
different sites - Vertical partitioning
- Different columns of a table distributed to
different sites - Combinations of the above
11Data Replication
- Advantages
- Reliability
- Fast response
- May avoid complicated distributed transaction
integrity routines (if replicated data is
refreshed at scheduled intervals) - Decouples nodes (transactions proceed even if
some nodes are down) - Reduced network traffic at prime time (if updates
can be delayed)
12Data Replication (cont.)
- Disadvantages
- Additional requirements for storage space
- Additional time for update operations
- Complexity and cost of updating
- Integrity exposure of getting incorrect data if
replicated data is not updated simultaneously
Therefore, better when used for non-volatile
(read-only) data
13Factors in Choice ofDistributed Strategy
- Funding, autonomy, security
- Site data referencing patterns
- Growth and expansion needs
- Technological capabilities
- Costs of managing complex technologies
- Need for reliable service
14Distributed DBMS
- Distributed database requires distributed DBMS
- Functions of a distributed DBMS
- Locate data with a distributed data dictionary
- Determine location from which to retrieve data
and process query components - DBMS translation between nodes with different
local DBMSs (using middleware) - Data management functions security, concurrency,
deadlock control, query optimization, failure
recovery - Data consistency (via multiphase commit
protocols) - Global primary key control
- Scalability
- Data and stored procedure replication
- Allowing for different DBMSs and application code
at different nodes
15Distributed DBMSTransparency Objectives
- Location Transparency
- User/application does not need to know where data
resides - Replication Transparency
- User/application does not need to know about
duplication - Failure Transparency
- Either all or none of the actions of a
transaction are committed - Each site has a transaction manager
- Logs transactions and before and after images
- Concurrency control scheme to ensure data
integrity - Requires special commit protocol
16Query Optimization
- In a query involving a multi-site join and,
possibly, a distributed database with replicated
files, the distributed DBMS must decide where to
access the data and how to proceed with the join.
Three step process - Query decompositionrewritten and simplified
- Data localizationquery fragmented so that
fragments reference data at only one site - Global optimization
- Order in which to execute query fragments
- Data movement between sites
- Where parts of the query will be executed
- Semi join operation only the joining attribute
of the query is sent from one site to the other,
rather than all selected attributes
17Brewers CAP Theorem
- Eric Brewer, Keynote at ACM Symposium on the
Principles of Distributed Computing 2000 - You cannot have all three of
- Consistency
- Availability
- Partition Tolerance
- Nothing short of complete network failure and the
system must keep functioning - Theorem proven in 2002 by Gilbert and Lynch
- See http//www.julianbrowne.com/article/viewer/bre
wers-cap-theorem
18Dealing with CAP?
- Drop Partitioning Tolerance
- Dont partition, but then you have serious
scalability issues, which is probably why you
want to partition in the first place - Drop Availability
- Wait for all the partitions to sync before
allowing any usage - This is as bad for scalability as having no
partitioning - Drop Consistency
- Eventual Consistency seems to work in most cases
- If you have to drop one, this is the preferred
option - Flies against most DB principles
19Relational DB?
- Seems like we are assuming the DB must still be
relational - Web also forces a new concept
- Not all data look the same anymore!
- Email messages, Images, News documents, Facebook
updates, Tweets, - Relations are too rigid
- Semi-structured data
20The Information-Integration Problem
- Related data exists in many places and could, in
principle, work together. - But different databases differ in
- Model (relational, object-oriented?).
- Schema (normalized/ not normalized?).
- Terminology are consultants employees?
Retirees? Subcontractors? - Conventions (meters versus feet?).
- How do we model information residing in
heterogeneous sources (if we cannot combine it
all in a single new database)?
21Example
- Suppose we are integrating information about bars
in some town. - Every bar has a database.
- One may use a relational DBMS another keeps the
menu in an MS-Word document. - One stores the phones of distributors, another
does not. - One distinguishes ales from other beers, another
doesnt. - One counts beer inventory by bottles, another by
cases.
22Semi-structured Data
- Purpose represent data from independent sources
more flexibly than either relational or
object-oriented models. - Think of objects, but with the type of each
object its own business, not that of its class. - Labels to indicate meaning of substructures.
- Data is self-describing structural information
is part of the data.
23Graphs of Semistructured Data
- Nodes objects.
- Labels on arcs (attributes, relationships).
- Atomic values at leaf nodes (nodes with no arcs
out). - Flexibility no restriction on
- Labels out of a node.
- Number of successors with a given label.
24Example Data Graph
root
beer
beer
bar
manf
manf
prize
A.B.
name
name
year
award
servedAt
Bud
Gold
1995
Mlob
name
addr
Maple
Joes
25XML
- XML Extensible Markup Language.
- While HTML uses tags for formatting (e.g.,
italic), XML uses tags for semantics (e.g.,
this is an address). - Key idea create tag sets for a domain (e.g.,
bars), and translate all data into properly
tagged XML documents. - Well formed XML - XML which is syntactically
correct - tags and their nesting totally arbitrary.
- Valid XML - XML which has DTD (document type
definition) - imposes some structure on the tags, but much
more flexible than relational database schema. - DTD and XML Schema
- Meta-data for XML
- Describe what are valid XML structures
26XML and Semi-structured Data
- Well-Formed XML with nested tags is exactly the
same idea as trees of semi-structured data. - XML also enables non-tree structures (with
references to IDs of nodes), as does the
semi-structured data model.
27Example Well-Formed XML
- lt?xml version 1.0 standalone yes ?gt
- ltBARSgt
- ltBARgtltNAMEgtJoes Barlt/NAMEgt
- ltBEERgtltNAMEgtBudlt/NAMEgt
- ltPRICEgt2.50lt/PRICEgtlt/BEERgt
- ltBEERgtltNAMEgtMillerlt/NAMEgt
- ltPRICEgt3.00lt/PRICEgtlt/BEERgt
- lt/BARgt
- ltBARgt
- lt/BARSgt
28Example
- The ltBARSgt XML document is
BARS
BAR
BAR
BAR
NAME
. . .
BEER
BEER
Joes Bar
PRICE
PRICE
NAME
NAME
Bud
2.50
Miller
3.00
29DTD Elements
- The description of an element consists of its
name (tag), and a parenthesized description of
any nested tags. - Includes order of subtags and their multiplicity.
- Leaves (text elements) have PCDATA (Parsed
Character DATA ) in place of nested tags.
30Example DTD
- lt!DOCTYPE BARS
- lt!ELEMENT BARS (BAR)gt
- lt!ELEMENT BAR (NAME, BEER)gt
- lt!ELEMENT NAME (PCDATA)gt
- lt!ELEMENT BEER (NAME, PRICE)gt
- lt!ELEMENT PRICE (PCDATA)gt
- gt
31Querying XML
- Why query XML-documents?
- special XML databases
- major DBMSs speak XML
- Does the world need a new query language?
- Most of the world's business data is stored in
relational databases - The relational language SQL is mature and
well-established - Can SQL be adapted to query XML data?
- Leverage existing software
- Leverage existing user skills
32XML vs Relational Data
- Relational data is "flat rows and columns
- XML data is nested and its depth may be
irregular and unpredictable - Relations can represent hierarchic data by
foreign keys or by structured datatypes - In XML it is natural to search for objects at
- unknown levels of the hierarchy
- "Find all the red things
33XML vs Relational Data (cont.)
- Relational data is uniform and repetitive
- All bank accounts are similar in structure
- Metadata can be factored out to a system catalog
- XML data is highly variable
- Every web page is different
- Each XML object needs to be self-describing
- Metadata is distributed throughout the document
- Queries may access metadata as well as data
"Find elements whose name is the same as their
content - //name(.) string(.)
34XML vs Relational Data (cont.)
- Relational queries return uniform sets of rows
- The results of an XML query may have mixed types
and complex structures - "Red things" a flag, a cherry, a stopsign, ...
- Elements can be mixed with atomic values
- XML queries need to be able to perform structural
transformations - Example invert a hierarchy
35XML vs Relational Data (cont.)
- The rows of a relation are unordered
- Any desired output ordering must be derived from
values - The elements in an XML document are ordered
- Implications for query
- Preserve input order in query results
- Specify an output ordering at multiple levels
- "Find the fifth step
- "Find all the tools used before the hammer
36XML vs Relational Data (cont.)
- Relational data is "dense
- Every row has a value in every column
- A "null" value is needed for missing or
inapplicable data - XML data can be "sparse
- Missing or inapplicable elements can be "empty
or "not there - This gives XML a degree of freedom not present in
relational databases
37XPATH and XQUERY
- XPATH is a language for describing paths in XML
documents. - Really think of the semi-structured data graph
and its paths. - Why do we need path description language cant
get at the data using just Relation.Attribute
expressions. - XQUERY is a full query language for XML documents
with power similar to OQL (Object Query Language,
query language for object-oriented databases).