Title: Distributed Databases
1Distributed Databases
- David Nelson
- CAT
- May 2006
2Contents
- Internet Databases
- Client/Server Architectures, advantages and
disadvantages - Web Database Approaches
- XML
- Semi-structured data
- Distributed Database Systems
- Definitions
- Homogeneous and Heterogeneous Systems
- Federated DBMS Systems
- Interoperability
- The Grid
3Traditional Architecture
- Traditional Database Systems are based on a
two-tier client-server architecture
- User interface
- Main business and data processing logic
Client
Database Server
- Server-side validation
- Database access
4Web Architecture
- Need for enterprise scalability causes problems
which can be solved by a three-tier architecture - Generalised to n-tier
Client
Application Server
- Business logic
- Data processing logic
- Server-side validation
- Database access
Database Server
5Web as a Database Platform
- Advantages
- DBMS advantages
- Simplicity
- Platform independence
- Graphical User Interface
- Standardization
- Cross-platform support
- Transparent network access
- Scalable deployment
- Innovation
6Web as a Database Platform
- Disadvantages
- Reliability
- Security
- Cost
- Scalability
- Limited HTML Functionality
- Statelessness
- Bandwidth
- Performance
- Immaturity of development tools
7Approaches
- CGI
- Server Side Includes
- HTTP Cookies
- API (non-CGI gateways)
- ODBC
- Java (JDBC, JSQL, JRB)
- JavaScript, JScript
- Microsoft Active Platform (ASP, ADO, ActiveX)
- PHP (Hypertext Preprocessor)
- XML
8Extensible Markup Language (XML)
- A simplified version of SGML, designed
specifically for Web documents - a meta-language to create customised tags which
provide functionality not available in HTML - links can point to multiple documents
- links can be bi-directional
- links to relative objects
- broken into
- document data
- document type definition (DTD) for well-formed
documents - stylesheet (XSL standard)
9Semi-structured Data
- Typical data models (e.g. relational) are
structured - i.e. has a separate schema
- Semi-structured data is self describing
- aka schemaless
- no separate description of the type/structure of
data - e.g. XML
10Sample XML Database
unicode
- lt?xml version 1.0 encoding UTF-8
standaloneyesgt - lt?xmlstylesheet type text/xsl
hrefstaff_list.xslgt - lt!DOCTYPE STAFFLIST SYSTEM staff_list.dtdgt
- ltSTAFFLISTgt
- ltSTAFF branchNo B005gt
- ltSTAFFNOgtSL21lt/STAFFNOgt
- ltNAMEgt
- ltFNAMEgtJohnlt/FNAMEgtltLNAMEgtWhitelt/LNAMEgt
- lt/NAMEgt
- ltPOSITIONgtManagergt
- lt/STAFFgt
- ltSTAFF branchNoB003gt
-
- lt/STAFFLISTgt
root element only 1 per document
attribute
elements ordered attributes unordered
11Sample DTD
- lt!ELEMENT STAFFLIST (STAFF)gt
- lt!ELEMENT STAFF (NAME, POSITION, DOB?, SALARY)gt
- lt!ELEMENT NAME (FNAME, LNAME)gt
- lt!ELEMENT FNAME (PCDATA)gt
- lt!ELEMENT LNAME (PCDATA)gt
- lt!ELEMENT POSITION (PCDATA)gt
- lt!ATTLIST STAFF branchNo CDATA IMPLIED)gt
12Sample StyleSheet
- lt?xml version 1.0?gt
- ltxslstylesheet xmlnsxsl http//www.w3.org/TR/
WD-xslgt - ltxsltemplate match /gt
- lthtmlgtltbodygt
- ltcentergtlth2gtDreamHome Estate agentslt/h2gtlt/center
gt - lttable border 1 bgcolor ffffffgt
- lttrgt
- ltthgtstaffNolt/thgt
- --- repeat for other column headings
- ltxslfor-each selectSTAFFLIST/STAFFgt
- lttrgtltxslvalue-of-selectSTAFFNO/gtlt/tdgt
- lttrgtltxslvalue-of-selectNAME/FNAME/gtlt/tdgtlt/t
rgt - lt/xslfor-eachgtlt/tablegtlt/bodygtlt/htmlgt
- lt/xsl-stylesheetgt
13Benefits of XML
- Simplicity
- Open standard and platform/vendor-independent
- Extensibility
- Reuse
- Separation of content and presentation
- Improved load balancing
14Benefits of XML
- Support for integration of data from multiple
sources - Ability to describe data from a wide variety of
applications - More advanced search engines
- XQuery
15XML Schema
-
- ltxsdgroup-name STAFFTYPE
- ltxsdelementnameSTAFFgt
- ltxsdcomplexTypegt
- ltxsdsequencegt
- ltxsdelement name STAFFNO
typeSTAFFNOTYPE/gt - ltxsdelement name NAMEgt
- ltxsdcomplexTypegt
- ltxsdsequencegt
- ltxsdelement name FNAMEgt
type xsdstring/gt - ltxsdelement name LNAMEgt
type xsdstring/gt - lt/xsdsequencegt
- lt/xsdcomplexTypegt
- ...
-
16Querying
- W3C working group has produced
- XML Query Requirements
- XML Query Data Model
- XML Query Algebra
- projection, iteration, selection, join, sorting,
aggregation - XQuery - a query language for XML
17XQuery Queries
- List the staff at branch B005 with a salary
greater than 15000 - FOR S IN document(staff_list.xml)//STAFF
- WHERE S/SALARY gt 15000 AND
- S/_at_branchNo B005
- RETURN S/STAFFNO
18XML Databases
- Native
- XML is the primary data store of the DBMS
- Semi-structured databases
- e.g. Lore
- XML Enabled
- Traditional RDBMS provides mappings between XML
and data store - Can be stored
- Course grained
- Medium grained
- Fine grained
- E.g. Oracle, SQL Server, SQL2003
19Fine Grained Approach
- Good for queries which need to inspect/manipulate
specific elements in the XML document - Not good for queries which manipulate (e.g.
retrieve/store) the entire document
Child
Element ( parent)
Document
CharData
Attribute
20Course Grained Approach
- One table
- Best for queries which manipulate whole document
- e.g. retrieve/store a document
- Worst for queries which manipulate elements
- e.g. retrieve children of a tag
21Medium Grained Approach
- A compromise between fine and course grained
- Slice document tree up into sections
- Store sub-sections using a course grained
approach - Good for both types of queries
22Distributed Db Definitions
- Distributed Database System
- the ability of the DDBS users to run applications
at each node - Federated Database System
- a DDBS is usually a single application
distributed over various sites - a FDBS is a cooperating multiple system
- a simple solution for interoperability (as
discussed later)
23Distributed Database Systems
- System needs facilities to be able to
- perform distributed query optimization
- manage distributed transactions
- manage data replication
- Homogeneous DDBS simplest case
- several sites, each running their own
applications on same DBMS with same schema and
transactions - location transparency
- can communicate over large distances, and are
autonomous
24Heterogeneous DDBMS
- Several existing databases (using different
DBMSs) linked into a single system - Problems
- variation in costs of operation between sites
- some operations may not be available at some
sites - some DBMSs cannot read records of others
- varying base types
- Requesting site must
- have detailed knowledge of operation of remote
system - assume remote system has only rudimentary
functionality - make programmer do query composition by hand
25Federated DBMS
- A collection of independently managed,
heterogeneous database systems - allow partial and controlled sharing of data
without affecting existing applications
Federated schema
Federated to local schema mapping
Local schema
Federated schema
Federated to local schema mapping
Local schema
26Interoperability
- The web is the ultimate interoperable database
platform - Need to be able to query using various sources on
the web - Without duplication of data as in a data warehouse
27Interoperability
- IEEE (1990) Definition
- the ability of two or more systems or components
to exchange information and to use the
information that has been exchanged - IEEE Standard Computer Dictionary A Compilation
of IEEE Standard Computer Glossaries - Current simple solutions
- transformation
- mediation
28Interoperability Definition 2
- The ability to request and receive services
between various systems and use their
functionality - More than data exchange
- Implies a close integration
29Interoperability Definitions 3
- Semantic Interoperability
- agreements about content description standards
- Ontologies
- Structural Interoperability
- Specifying semantic schemas such that they can be
shared, e.g. RDF - Syntactic Interoperability
- How to tag and mark data to facilitate exchange
- E.g. XML
30Features
- Exchange of messages and requests
- Use of each others functionality
- Client-server abilities
- Distribution
- Operate multiple systems as single unit
- Communication despite incompatibilities
- Extensibility and evolution
31The Problems and Difficulties
- Different data models
- There can be major semantic differences even
within the same data model - Properties may be called by different names
- Different data types may be used
- What about recreating local defined functions?
- All this implies we know where they are and we
have a physical means of getting to them
32The Problems and Difficulties
- Databases are by their nature protectors of
data, they do not share easily - Many (particularly legacy systems) do not have
any form of web interface - Most databases are security protected
- Databases do not advertise their services to the
web - Even client/Server databases operate within a
cocoon of silence
33EBCDIC
- EBCDIC /eb's-dik/, /eb'seedik/, or
/eb'k-dik/ n. - abbreviation, Extended Binary Coded Decimal
Interchange Code - A character set used on early IBM computers. It
exists in at least six mutually incompatible
versions, all featuring such delights as
non-contiguous letter sequences and the absence
of several punctuation characters fairly
important for modern computer languages (exactly
which characters are absent varies according to
which version of EBCDIC you're looking at). IBM
adapted EBCDIC from punched card code in the
early 1960s and promulgated it as a
customer-control tactic, spurning the already
established ASCII standard. Today, IBM claims to
be an open-systems company, but IBM's own
description of the EBCDIC variants and how to
convert between them is still an internally
classified top-secret. - EBCDIC is the most common alternate character
code but there are others. - http//www.cheverus.org/advanced/data/EBCDIC.html
34Some Simple Integration Problems 1
- Differing schema
- author char(50) author_surname char(50)
- author_inits char(10)
- title varchar(300) title varchar(200)
- keyword set(char(30)) keywd array(8) (char(30))
- - both are valid schema in SQL-3
- also A.N.Other, A N Other, Other N A, ...
35Some Simple Problems 2
- Homogeneous Models
- the same information may be held as attribute
name, relation name or a value in different
databases - e.g. library fines
- as a dedicated relation Fine(amount, borrowed_id)
- as an attribute Loan(id, isbn, date_out, fine)
- or as a value Charge(1.25, fine)
36Complex Problems
- Heterogeneous models
- Need to relate model constructions to one
another, for example - relate classes in object-oriented to user-defined
types in object-relational - All problems are magnified at this level!
37Data Models
- We are only touching the surface in repositories
and data warehouses -
38XML RDF
- Resource Description Framework
- XML Schema defines a grammar
- therefore we have all the problems shown
previously (e.g. names) - RDF provides a way to encode domain models
- an infrastructure that enables the encoding,
exchange and reuse of structured meta-data (W3C) - this is what we need for interoperable systems
39RDF Data Model
- RDF Data Model consists of three objects
- Resource
- anything that can have a URL
- Property
- a specific attribute which is used to describe a
resource - Statement
- a combination of a resource, a property and a
value - known as the subject, predicate and object
- e.g. The author of http//www.myhome.net/staff_li
st.xml is Fred Smith
40RDF Example
- The statement would be defined in RDF
(simplified) as - lt?xml version"1.0"?gt
- ltRDFgt
- ltDescription about" http//www.myhome.net/staff_
list.xml "gt - ltauthorgtFred Smithlt/authorgt
- ltcreatedgt25 May 2006lt/createdgt
- lt/Descriptiongt
- lt/RDFgt
41The Grid
- Original Motivation
- the need for a distributed computing
infrastructure for advanced science and
engineering (Walker) - Used originally in science for large number
crunching applications - but now finding larger appeal
- Compare to the national power grid
- Interoperability is a key issue
42Examples of Computational and Information Grids
- NASA Information Power Grid
- access to large-scale computing resources, large
databases, and high-end instruments - dynamically co-allocated resources (e.g.
supercomputers) - AstroGrid
- a virtual observatory
- European Data Grid
- high energy physics, biology and Earth
observation - distributed, large-scale data intensive computing
43Summary
- Distributed and web-databases are increasingly
important areas - XML is being increasingly used in data models,
data transmission and data integration - Interoperability is the key issue and the major
research area in database systems - XML and RDF have the potential as a stepping
stone to achieving this - The Grid is an example of a system which could
require interoperability to integrate database
systems
44Further Reading
- Connolly and Begg, Database Systems, chapters
22,23,29,and 30. - Ozsu and Valduriez, Principles of Distributed
Database Systems, 2nd edition - everything you ever wanted to know about
distributed database systems - Chaudri and Zicari, Succeeding with Object
Databases, 2001 - D Walker, Emerging Distributed Computing
Technologies, Cardiff University - http//www.cs.cf.ac.uk/User/David.W.Walker/IGDS/Gr
idCourse.htm - an introduction to the Grid
- XML and RDF
- www.w3schools.com