XML%20Research%20Issues%20in%20Database%20Perspective - PowerPoint PPT Presentation

About This Presentation

Title:

XML%20Research%20Issues%20in%20Database%20Perspective

Description:

Saturday, October 28 2000. XML Research Issues in Database Perspective - KISS'00 Fall ... In RDBMS, due to disassembly of XML data into various tables, implementing an ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 63

Provided by: kyuseo

Category:

more less

Transcript and Presenter's Notes

Title: XML%20Research%20Issues%20in%20Database%20Perspective

1
XML Research Issues in Database Perspective

Kyuseok Shim
shim_at_cs.kaist.ac.kr
http//cs.kaist.ac.kr/shim
Korea Advanced Institute of Science and Technology

2
XML Working Groups

Core XML
XML, namespaces, XML Inforset
XML Linking
Xpath, Xpointer, Xlink
XML Schema
XML Schema
XML Query
XML Query, XML Query Data Model
Document Object Model (DOM)
XSL

3
XML

A W3C standard to complement HTML
An instance of semistructured data Abi97
Document Type Descriptor (DTD)
Origin SGML
Tags describe the semantics of the data
HTML simply specify how the data time is to be
displayed
An element can contain a sequence of nested
sub-elements
Sub-elements may themselves be tagged elements or
character data

4
Document Type Definition (DTD)

A part of XML specification
An XML document may have a DTD
Grammar for describing the structure of XML
document
The structure of an element is specified by a
regular expression
Terminology for XML
well-formed if tags are correctly closed
valid if it has a DTD and conforms to it
For exchanges of data, validation is useful

5
Document Type Definition (DTD)

Syntax
comma sequence
or
() grouping
?, , zero or one, zero or more, one or more
occurrences
ANY allows an arbitrary XML fragment to be
nested within the element

6
A DTD Example

lt!ENTITY USA United States of Americagt
lt!ELEMENT book (booktitle, author)gt
lt!ATTLIST book id ID IMPLIEDgt
lt!ELEMENT booktitle (PCDATA)gt
lt!ELEMENT author (name, (address affiliation))gt
lt!ELEMENT name (PCDATA)gt
lt!ELEMENT address ANYgt
lt!ELEMENT affiliation (PCDATA)gt

7
An XML Document Example

ltbook id123gt
ltbooktitlegt The Selfish Gene lt/booktitlegt
ltauthor iddawkinsgt
ltnamegt Richard Dawkins lt/namegt
ltaddressgt
ltcitygt Timbuktu lt/citygt
ltzipgt 99999 lt/zipgt
lt/addressgt
lt/authorgt
lt/bookgt
ltbookgt
ltbooktitlegt The C Programming Languagelt/booktitle
gt
ltauthorgt
ltnamegt Brian W. Kernighan lt/namegt
ltaddressgt ltcountrygt USA lt/countrygt lt/addressgt
lt/authorgt
ltauthorgt
ltnamegt Dennis M. Ritchie lt/namegt
ltaffiliationgt Bell Labs lt/affiliationgt

8
An XML Namespace

Provides a simple method for qualifying element
and attribute names used in Extensible Markup
Language documents by associating them with
namespaces identified by URI references.
Is a collection of names, identified by a URI
reference, which are used in XML documents as
element types and attribute names.
ltx xmlnsedi'http//ecommerce.org/schema'gt
lt!the 'price' element's namespace is http//ecom
merce.org/schema --gt
ltediprice units'Euro'gt32.18lt/edipricegt
lt/xgt

9
XML Schemas

Recently proposed
http//www.w3c.org/TR/xmlschema-1
http//www.w3c.org/TR/xmlschema-2
Unifies previous schema proposals
Generalizes DTDs
Use XML syntax

10
XML Schema

ltelementType name articlegt
ltsequencegt
ltelementTypeRef name title/gt
ltelementTypeRef name author
minOccurs0/gt
lt/sequencegt
lt/elementTypegt
DTD lt!ELEMENT article (title, author)gt

11
XTRACT Extracting DTD from XML Documents

Garofalakis, Gionis, Rastogi, Seshadri, Shim 99
DTDs contain valuable information on the
structure of the documents
play a critical role in the storage as well as
formulation and optimization of queries
DTDs are not mandatory
it is frequently possible the XML database does
not have accompanying DTDs
XTRACT can infer concise and semantically
meaningful DTDs for XML documents

12
XTRACT Motivation

DTD is very useful!
Plays a crucial role in efficient storage of XML
data
SHT99, DFS99 DTDT is exploited to generate
effective relational schema
Devise efficient plans for queries
GW97, FS97 DTD allows to restrict the search
only relevant portions of the data
Aids users to form meaningful queries over the
XML database
However, XML document may not always have an
accompanying DTD

13
XTRACT Related Work

Mining DTDs from a collection of XML documents
has not been addressed in the literature
Extraction of schema from semistructured data
NAM98, GW97, FS97
attempts to find typing for semistructured data
finding a typing is tantamount to grouping
objects that have similar edges
In DTD, outgoing edges from a type can be
described by an arbitrary regular expression
No ordering is imposed for edges

14
XTRACT Related Work

Gol67, Gol78, Ang78
Infer formal languages from examples
Purely theoretical and focus on investigating the
computational complexity of the language
inference problem
KMU95
Infers a pattern language from positive examples
MDL principle was used
Assume the set of simple patterns is available
Cannot find general regular expressions
Patterns are not known apriori

15
XTRACT Problem Formulation

Given a set I of N input sequences nested within
elements e
Compute a DTD for e such that every sequence in I
conforms to the DTD

16
XTRACT Naive Approaches

Factor as much as possible
e.g. t, ta, taa, taaa, taaaa
t t (a a(a a(a aa)))
much more voluminous and a lot less intuitive
Find the automaton with the smallest number of
states that accepts I and drive regular
expressions from automaton
may not be the shortest regular expression

17
XTRACT Desirable DTDs

The DTD should be concise (i.e. small in size)
easy to understand and succinct
The DTD should be precise
not cover too many sequences not contained in I
not too general and captures the structure f
input sequences
Trade-off!

18
XTRACT Example

I ab, abab, ababab
(a b)
a gross over-generalization of the input
completely fails to capture any structure
inherent in input
ab abab ababab, ab ab(ab abab)
accurately reflect the structure of the input
sequences but do not generalize
(ab)
succinct and generalizes the input sequence
without loosing too much structure information

19
XTRACT MDL Principle

An information-theoretic measure for quantifying
and thereby resolving the tradeoff between the
conciseness and preciseness
MDL principle has been successfully applied in a
variety of situations
e.g. decision tree classifiers
Roughly speaking, the best theory to infer from a
set of data is the one that minimizes the sum of
the length of the theory, in bits (conciseness)
the length of the data, in bits, when encoded
with the help of the theory (preciseness)

20
XTRACT Example

I ab, abab, ababab
(a b)
abab cost of 5 (the number of repetitions (4)
4 characters to represent chosen character)
MDL cost 6 (encoding DTD) 3 5 7 21
ab abab ababab
MDL cost 14 3 17
ab ab(ab abab)
MDL cost 14 1 2 2 19
(ab)
MDL cost 5 3 8

21
XTRACT

Generalization
generalizes zero or more candidate DTDs by
replacing patters in the input sequence with
meta-characters like
e.g. abab gt (ab), bbbe gt be
Factorization
factors common subexpressions from the
generalized candidate DTDs
e.g. bd be gt b (d e)
Minimum Description Length (MDL) Principle
MDL ranks each candidate DTD and chooses the
minimum cost DTD

22
XTRACT Example
23
XML Storage

Existing approaches either sacrifice efficiency
or flexibility unnecessary
Traditional DBMSs (RDB or OODB) have rigid
schemas.
Integrating a new site requires complex mapping
and potential loss of information
Integrating a new site may require schema
evolution.
Existing fully semi-structured data storage
techniques sacrifice query efficiency and space.
they require excessive interpretation (harming
query efficiency) and
redundant storage

24
XML Storage

Need to store and query XML data flexibly and
efficiently
improve the tradeoffs for storage space and
query efficiency for a given degree of
flexibility.
allows user to choose the degree of storage
flexibility

25
XML Storage

text file
relational DBMS
object-oriented DBMS
build special purpose repository

26
XML Storage Text File

To store the flat streams, file system or a BLOB
manager in DBMS is used
e.g. Abiteboul, Cluet, Milo VLDB93
Pros
simple
fast for storing and retrieving whole documents
less space than one think
reasonable clustering
Cons
incremental update is difficult
require special purpose query processor
accessing documents structure is only possible
through parsing

27
XML Storage Relational DBMS

Advantages
RDBMS products are mature and scales well
Traditional and semi-structured data can co-exist
RDBMS can process even complex queries on large
databases within seconds
Disadvantages
expensive to reconstruct the original XML data
from relational data
updates are both complicated and expensive for a
certain cases
extra efforts to translate XML queries and
updates into SQL

28
XML Storage RDMBS (1)

Florescu, Kossmann IEEE Data Eng. Bulletin 99

29
XML Storage RDBMS (2)

Shanmugasundaram et al. 99
process DTD to generate a relational schema
Use DTD graph and element graph
three approaches
Basic
Shared
Hybrid

30
DTD
31
XML Document
32
The Basic Inline Technique

Creates relations for every element
an XML document can be rooted at any element in a
DTD
element graph is used to decide the relations
Inlines as many descendants as possible
e.g. the author relation has attributes
firstname, lastname, address and authorid
Creates a separate relation to handle in DTD
graph using a foreign key
Expresses the recursive relationship using the
notion of relational keys

33
Building an Element Graph

Do a depth first traversal of the DTD graph
starting at the element node
Each node is marked as visited the first time
reached
Each node is unmarked once all of its children
have been traversed
If an unmarked node in DTD graph is reached, a
new node with the same name is created in the
element graph
If an attempt is made to traverse marked DTD
node, backpointer edge is added

34
DTD Graph
35
An Example Element Graph
36
Creation of Relations

Given an element graph, relations are created as
follows
A relation is produced for the root element
All descendent elements are inlined into that
relation except
children directly below a node
each node having a backpointer edge pointing to
it
A separate relation is created for each of the
above exception node
Each relation has ID and parentID fields

37
Basic Inline Schema
38
Basic Inline Technique

Pros
List all authors of books
Cons
List all authors having first name Jack (5
separate queries)
Large number of relations are created

39
Shared Inline Technique

Relations are created for all elements in the DTD
graph whose nodes have in-degree greater than one
Nodes with an in-degree of one are inlined
Nodes with an in-degree of zero are made separate
relations
Of mutually recursive elements all having
in-degree one, one of them is made a separate
relation
e.g. monograph and editor

40
Shared Inlining Technique

Small number of relations compared to Basic
schema
Use isRoot field for inlining problems
Requires only one query for finding all authors
Still Basic is superior for reducing the number
of joins

41
Shared Inlining Technique

Additionally inlines elements with in-degree
greater than one that are not recursive or
reached through a node
e.g. author is inlined with book and monograph

42
XML Storage STORED

Deutsch, Fernandez, Suciu SIGMOD99
Semistructured data into relational data
Integrate both relational and overflow systems
Use data mining algorithm to find out frequent
subtrees
due to the fact that there is no notion of DTD in
semistructured data
Overflow mapping is used to insure lossless
overflow objects or object parts are stored in a
separate semistructured data object repository
Incremental updates and ordering of elements are
not considered

43
XML Storage STORED

Derive schema from data with data mining algorithm

44
XML Storage OODBMS

Stores XML elements with the structured semantics
Flexible locking down to element level
In RDBMS, due to disassembly of XML data into
various tables, implementing an effective locking
scheme is hard
In using flat file, no portion of a document
being modified is available to other users
Use a separate record for each tree node
Systems available
POET (POET Content Management system)
Excelon (ObjectDesign)
LORE

45
XML Storage NATIX

Kanne, Moerkotte ICDE00
Native repository
Classical record manager
Accesses raw disk or file system files
Provides a memory space divided into segments
(equal sized pages)
Tree storage manager
maps treed used to model documents
Schema manager
maintains the system catalog data (e.g. DTD)
system catalog is stored in XML format

46
NATIX

Store whole document in one record, instead of
storing each tree node in a separate record
Semantically split large tree based on underlying
tree structure
Partition the data into subtrees and store each
subtree in a record
Connected subtrees residing in other records are
represented by proxy objects
proxy objects consist of RID
substituting all proxies by the respective
subtrees reconstruct the original data tree

47
XML Query Processing

McHugh, Widom Workshop 99
Expand regular path expressions at compile time
using structural summary
Guarantee to visit, at run-time, a subset of the
objects visited with the original path expression
e.g. Library. -
Proceedings.Conference.Paper
Books.Book
Movies.Movie.BasedOn

48
XML Query Processing

Fernandez, Suciu ICDE 98
Optimize regular path expressions
Restrict navigation to only a fragment of the
data
Use state extents to eliminate and reduce
navigation
McHugh, Widom VLDB 99
Propose cost-based query optimizer
Transform a query into a logical query plan
Explore the space of possible physical plans
Introduce new types of indexes for efficient
traversals through data graphs
Suggest an appropriate set of statistics and
devise methods for computing and storing
statistics

49
XML Query Processing

Christophides, Cluet, Simeon SIGMOD 00
Propose an XML algebra
Captures the expressive power of semistructured
or XML query languages
Can wrap more structures languages such as SQL or
OQL
New optimization techniques
Exploit type information
Push query evaluation to external source

50
XML View of Relational Data

Fernandez, Tan, Suciu WWW 00
Mediator system
Automatically convert the relational data into
XML
An XML view of the relational database is defined
using a declarative query language
Some other application formulates a query over
the virtual view
Exploit fully underlying RDBMS query engine

51
XML View of Relational Data

Shanmugasundaram et al. VLDB 00
Propose to use new scalar and aggregate in SQL to
construct complex XML document
Explore different execution plans for generating
the contents of XML documents
Construct XML document inside the relational
engine benefits most for performance
Outer union plan

52
Metadata Management

Generic data model
Not impossible, but unlikely
Proliferation of data models
No proof anyone is superior
Semantics arent fully captured in any data model

53
Metadata Management

Philip Bernstein VLDB 00s Panel
Generality - representation of metadata must
apply to all application areas
Usefulness exploit application-specific
semantics
Is there an effective middle-ground?
Define generic high-level operations on models
and mappings, e.g., Match, Merge, Select,
Compose,
Match(M1, M2, ?, map), Merge(M1, M2, map),
Compose(map1, map2)
Implement operations on a DBMS

54
Metadata Management
55
Metadata Management Clio

Miller, Haas, Hernandez VLDB 00
Tool to support mapping between data
representations
Mapping represented as SQL
Heterogeneous query middleware to examine data
and schemas
Build database competencies in query and schema
management, data mining
Exploit user knowledge of target semantics
Enhance user knowledge of source schema and data
Provide knowledge of query subtleties,
alternative mappings

56
Metadata Management Clio

User indicates what schema and data values are
needed for target
Tool enumerates and ranks mappings
Many possible subtle differences
Best mappings are simple, but lose least
information possible
Allows immediate user feedback

57
Filtering XML Documents

Altinel, Franklin VLDB 00
Xfilter provides highly efficient matching of
XML documents to user profiles
Event filtering system
Highly scalable
Use XPath as a profile language

58
XML Data Compression

Liefke, Suciu SIGMOD 00
Structure, consisting of tags and attributes, is
compressed separately
Group related data items and compress each
related group separately
Apply semantic compression
Automatic data mining tools to cluster data needs
to be developed

59
Future Research Issues

XML views of traditional databases
Relational database
Object-relational database
XML Storage
Object-relational databases
Alternative storage methods
Indexes for XML data
XML query processing and optimization
Centralized and distributed processing

60
Future Research Issues

Schema mapping
Mixing structure search with full-text search
XML-based mediators
XML data compression

61
Summary

XML provides a lot of challenges to database
community
XML Storage Issues
XML Indexes
DTD Extraction
Query language
Query processing
Metadata Management

62
Biography of Kyuseok Shim

Kyuseok Shim is an Assistant Professor in
Computer Science Department at KAIST in Korea. He
is also currently an Advisory Committee Member
for ACM SIGKDD. Before joining KAIST, he was a
member of technical staff (MTS) in the Database
Systems Research Department at Bell Laboratories.
While he was in Bell Laboratories, he started and
worked for Serendip data mining project and
eXcalibur XML storage project. Before joining
Bell Laboratories, he worked for Rakesh Agrawal's
Quest data mining project at IBM Almaden Research
Center. He also worked with Surajit Chaudhuri as
a summer intern for two summers at Hewlett
Packard Laboratories. He received B.S. degree in
Electrical Engineering from Seoul National
University, and the MS and Ph.D. degree in
Computer Science from University of Maryland,
College Park.
Kyuseok Shim has been working in the area
of databases focusing on XML, data mining, data
warehousing, query processing and query
optimization. He has published more than 30
research papers in prestigious international
conferences and journals. He has also served as a
program committee member on several international
conferences including ICDE'97, SIGKDD'98,
SIGMOD'99, SIGKDD'99, ICDE'00, VLDB'00 and
SIGKDD01.