Title: XML%20Research%20Issues%20in%20Database%20Perspective
1XML Research Issues in Database Perspective
- Kyuseok Shim
- shim_at_cs.kaist.ac.kr
- http//cs.kaist.ac.kr/shim
- Korea Advanced Institute of Science and Technology
2XML Working Groups
- Core XML
- XML, namespaces, XML Inforset
- XML Linking
- Xpath, Xpointer, Xlink
- XML Schema
- XML Schema
- XML Query
- XML Query, XML Query Data Model
- Document Object Model (DOM)
- XSL
3XML
- A W3C standard to complement HTML
- An instance of semistructured data Abi97
- Document Type Descriptor (DTD)
- Origin SGML
- Tags describe the semantics of the data
- HTML simply specify how the data time is to be
displayed - An element can contain a sequence of nested
sub-elements - Sub-elements may themselves be tagged elements or
character data
4Document Type Definition (DTD)
- A part of XML specification
- An XML document may have a DTD
- Grammar for describing the structure of XML
document - The structure of an element is specified by a
regular expression - Terminology for XML
- well-formed if tags are correctly closed
- valid if it has a DTD and conforms to it
- For exchanges of data, validation is useful
5Document Type Definition (DTD)
- Syntax
- comma sequence
- or
- () grouping
- ?, , zero or one, zero or more, one or more
occurrences - ANY allows an arbitrary XML fragment to be
nested within the element
6A DTD Example
- lt!ENTITY USA United States of Americagt
- lt!ELEMENT book (booktitle, author)gt
- lt!ATTLIST book id ID IMPLIEDgt
- lt!ELEMENT booktitle (PCDATA)gt
- lt!ELEMENT author (name, (address affiliation))gt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT address ANYgt
- lt!ELEMENT affiliation (PCDATA)gt
7An XML Document Example
- ltbook id123gt
- ltbooktitlegt The Selfish Gene lt/booktitlegt
- ltauthor iddawkinsgt
- ltnamegt Richard Dawkins lt/namegt
- ltaddressgt
- ltcitygt Timbuktu lt/citygt
- ltzipgt 99999 lt/zipgt
- lt/addressgt
- lt/authorgt
- lt/bookgt
- ltbookgt
- ltbooktitlegt The C Programming Languagelt/booktitle
gt - ltauthorgt
- ltnamegt Brian W. Kernighan lt/namegt
- ltaddressgt ltcountrygt USA lt/countrygt lt/addressgt
- lt/authorgt
- ltauthorgt
- ltnamegt Dennis M. Ritchie lt/namegt
- ltaffiliationgt Bell Labs lt/affiliationgt
8An XML Namespace
- Provides a simple method for qualifying element
and attribute names used in Extensible Markup
Language documents by associating them with
namespaces identified by URI references. - Is a collection of names, identified by a URI
reference, which are used in XML documents as
element types and attribute names. - ltx xmlnsedi'http//ecommerce.org/schema'gt
- lt!the 'price' element's namespace is http//ecom
merce.org/schema --gt - ltediprice units'Euro'gt32.18lt/edipricegt
- lt/xgt
9XML Schemas
- Recently proposed
- http//www.w3c.org/TR/xmlschema-1
- http//www.w3c.org/TR/xmlschema-2
- Unifies previous schema proposals
- Generalizes DTDs
- Use XML syntax
10XML Schema
- ltelementType name articlegt
- ltsequencegt
- ltelementTypeRef name title/gt
- ltelementTypeRef name author
minOccurs0/gt - lt/sequencegt
- lt/elementTypegt
- DTD lt!ELEMENT article (title, author)gt
11XTRACT Extracting DTD from XML Documents
- Garofalakis, Gionis, Rastogi, Seshadri, Shim 99
- DTDs contain valuable information on the
structure of the documents - play a critical role in the storage as well as
formulation and optimization of queries - DTDs are not mandatory
- it is frequently possible the XML database does
not have accompanying DTDs - XTRACT can infer concise and semantically
meaningful DTDs for XML documents
12XTRACT Motivation
- DTD is very useful!
- Plays a crucial role in efficient storage of XML
data - SHT99, DFS99 DTDT is exploited to generate
effective relational schema - Devise efficient plans for queries
- GW97, FS97 DTD allows to restrict the search
only relevant portions of the data - Aids users to form meaningful queries over the
XML database - However, XML document may not always have an
accompanying DTD
13XTRACT Related Work
- Mining DTDs from a collection of XML documents
has not been addressed in the literature - Extraction of schema from semistructured data
- NAM98, GW97, FS97
- attempts to find typing for semistructured data
- finding a typing is tantamount to grouping
objects that have similar edges - In DTD, outgoing edges from a type can be
described by an arbitrary regular expression - No ordering is imposed for edges
14XTRACT Related Work
- Gol67, Gol78, Ang78
- Infer formal languages from examples
- Purely theoretical and focus on investigating the
computational complexity of the language
inference problem - KMU95
- Infers a pattern language from positive examples
- MDL principle was used
- Assume the set of simple patterns is available
- Cannot find general regular expressions
- Patterns are not known apriori
15XTRACT Problem Formulation
- Given a set I of N input sequences nested within
elements e - Compute a DTD for e such that every sequence in I
conforms to the DTD
16XTRACT Naive Approaches
- Factor as much as possible
- e.g. t, ta, taa, taaa, taaaa
- t t (a a(a a(a aa)))
- much more voluminous and a lot less intuitive
- Find the automaton with the smallest number of
states that accepts I and drive regular
expressions from automaton - may not be the shortest regular expression
17XTRACT Desirable DTDs
- The DTD should be concise (i.e. small in size)
- easy to understand and succinct
- The DTD should be precise
- not cover too many sequences not contained in I
- not too general and captures the structure f
input sequences - Trade-off!
18XTRACT Example
- I ab, abab, ababab
- (a b)
- a gross over-generalization of the input
- completely fails to capture any structure
inherent in input - ab abab ababab, ab ab(ab abab)
- accurately reflect the structure of the input
sequences but do not generalize - (ab)
- succinct and generalizes the input sequence
without loosing too much structure information
19XTRACT MDL Principle
- An information-theoretic measure for quantifying
and thereby resolving the tradeoff between the
conciseness and preciseness - MDL principle has been successfully applied in a
variety of situations - e.g. decision tree classifiers
- Roughly speaking, the best theory to infer from a
set of data is the one that minimizes the sum of - the length of the theory, in bits (conciseness)
- the length of the data, in bits, when encoded
with the help of the theory (preciseness)
20XTRACT Example
- I ab, abab, ababab
- (a b)
- abab cost of 5 (the number of repetitions (4)
4 characters to represent chosen character) - MDL cost 6 (encoding DTD) 3 5 7 21
- ab abab ababab
- MDL cost 14 3 17
- ab ab(ab abab)
- MDL cost 14 1 2 2 19
- (ab)
- MDL cost 5 3 8
21XTRACT
- Generalization
- generalizes zero or more candidate DTDs by
replacing patters in the input sequence with
meta-characters like - e.g. abab gt (ab), bbbe gt be
- Factorization
- factors common subexpressions from the
generalized candidate DTDs - e.g. bd be gt b (d e)
- Minimum Description Length (MDL) Principle
- MDL ranks each candidate DTD and chooses the
minimum cost DTD
22XTRACT Example
23XML Storage
- Existing approaches either sacrifice efficiency
or flexibility unnecessary - Traditional DBMSs (RDB or OODB) have rigid
schemas. - Integrating a new site requires complex mapping
and potential loss of information - Integrating a new site may require schema
evolution. - Existing fully semi-structured data storage
techniques sacrifice query efficiency and space. - they require excessive interpretation (harming
query efficiency) and - redundant storage
24XML Storage
- Need to store and query XML data flexibly and
efficiently - improve the tradeoffs for storage space and
query efficiency for a given degree of
flexibility. - allows user to choose the degree of storage
flexibility
25XML Storage
- text file
- relational DBMS
- object-oriented DBMS
- build special purpose repository
26XML Storage Text File
- To store the flat streams, file system or a BLOB
manager in DBMS is used - e.g. Abiteboul, Cluet, Milo VLDB93
- Pros
- simple
- fast for storing and retrieving whole documents
- less space than one think
- reasonable clustering
- Cons
- incremental update is difficult
- require special purpose query processor
- accessing documents structure is only possible
through parsing
27XML Storage Relational DBMS
- Advantages
- RDBMS products are mature and scales well
- Traditional and semi-structured data can co-exist
- RDBMS can process even complex queries on large
databases within seconds - Disadvantages
- expensive to reconstruct the original XML data
from relational data - updates are both complicated and expensive for a
certain cases - extra efforts to translate XML queries and
updates into SQL
28XML Storage RDMBS (1)
- Florescu, Kossmann IEEE Data Eng. Bulletin 99
29XML Storage RDBMS (2)
- Shanmugasundaram et al. 99
- process DTD to generate a relational schema
- Use DTD graph and element graph
- three approaches
- Basic
- Shared
- Hybrid
30DTD
31XML Document
32The Basic Inline Technique
- Creates relations for every element
- an XML document can be rooted at any element in a
DTD - element graph is used to decide the relations
- Inlines as many descendants as possible
- e.g. the author relation has attributes
firstname, lastname, address and authorid - Creates a separate relation to handle in DTD
graph using a foreign key - Expresses the recursive relationship using the
notion of relational keys
33Building an Element Graph
- Do a depth first traversal of the DTD graph
starting at the element node - Each node is marked as visited the first time
reached - Each node is unmarked once all of its children
have been traversed - If an unmarked node in DTD graph is reached, a
new node with the same name is created in the
element graph - If an attempt is made to traverse marked DTD
node, backpointer edge is added
34DTD Graph
35An Example Element Graph
36Creation of Relations
- Given an element graph, relations are created as
follows - A relation is produced for the root element
- All descendent elements are inlined into that
relation except - children directly below a node
- each node having a backpointer edge pointing to
it - A separate relation is created for each of the
above exception node - Each relation has ID and parentID fields
37Basic Inline Schema
38Basic Inline Technique
- Pros
- List all authors of books
- Cons
- List all authors having first name Jack (5
separate queries) - Large number of relations are created
39Shared Inline Technique
- Relations are created for all elements in the DTD
graph whose nodes have in-degree greater than one - Nodes with an in-degree of one are inlined
- Nodes with an in-degree of zero are made separate
relations - Of mutually recursive elements all having
in-degree one, one of them is made a separate
relation - e.g. monograph and editor
40Shared Inlining Technique
- Small number of relations compared to Basic
schema - Use isRoot field for inlining problems
- Requires only one query for finding all authors
- Still Basic is superior for reducing the number
of joins
41Shared Inlining Technique
- Additionally inlines elements with in-degree
greater than one that are not recursive or
reached through a node - e.g. author is inlined with book and monograph
42XML Storage STORED
- Deutsch, Fernandez, Suciu SIGMOD99
- Semistructured data into relational data
- Integrate both relational and overflow systems
- Use data mining algorithm to find out frequent
subtrees - due to the fact that there is no notion of DTD in
semistructured data - Overflow mapping is used to insure lossless
- overflow objects or object parts are stored in a
separate semistructured data object repository - Incremental updates and ordering of elements are
not considered
43XML Storage STORED
- Derive schema from data with data mining algorithm
44XML Storage OODBMS
- Stores XML elements with the structured semantics
- Flexible locking down to element level
- In RDBMS, due to disassembly of XML data into
various tables, implementing an effective locking
scheme is hard - In using flat file, no portion of a document
being modified is available to other users - Use a separate record for each tree node
- Systems available
- POET (POET Content Management system)
- Excelon (ObjectDesign)
- LORE
45XML Storage NATIX
- Kanne, Moerkotte ICDE00
- Native repository
- Classical record manager
- Accesses raw disk or file system files
- Provides a memory space divided into segments
(equal sized pages) - Tree storage manager
- maps treed used to model documents
- Schema manager
- maintains the system catalog data (e.g. DTD)
- system catalog is stored in XML format
46NATIX
- Store whole document in one record, instead of
storing each tree node in a separate record - Semantically split large tree based on underlying
tree structure - Partition the data into subtrees and store each
subtree in a record - Connected subtrees residing in other records are
represented by proxy objects - proxy objects consist of RID
- substituting all proxies by the respective
subtrees reconstruct the original data tree
47XML Query Processing
- McHugh, Widom Workshop 99
- Expand regular path expressions at compile time
using structural summary - Guarantee to visit, at run-time, a subset of the
objects visited with the original path expression
- e.g. Library. -
- Proceedings.Conference.Paper
- Books.Book
- Movies.Movie.BasedOn
48XML Query Processing
- Fernandez, Suciu ICDE 98
- Optimize regular path expressions
- Restrict navigation to only a fragment of the
data - Use state extents to eliminate and reduce
navigation - McHugh, Widom VLDB 99
- Propose cost-based query optimizer
- Transform a query into a logical query plan
- Explore the space of possible physical plans
- Introduce new types of indexes for efficient
traversals through data graphs - Suggest an appropriate set of statistics and
devise methods for computing and storing
statistics
49XML Query Processing
- Christophides, Cluet, Simeon SIGMOD 00
- Propose an XML algebra
- Captures the expressive power of semistructured
or XML query languages - Can wrap more structures languages such as SQL or
OQL - New optimization techniques
- Exploit type information
- Push query evaluation to external source
50XML View of Relational Data
- Fernandez, Tan, Suciu WWW 00
- Mediator system
- Automatically convert the relational data into
XML - An XML view of the relational database is defined
using a declarative query language - Some other application formulates a query over
the virtual view - Exploit fully underlying RDBMS query engine
51XML View of Relational Data
- Shanmugasundaram et al. VLDB 00
- Propose to use new scalar and aggregate in SQL to
construct complex XML document - Explore different execution plans for generating
the contents of XML documents - Construct XML document inside the relational
engine benefits most for performance - Outer union plan
52Metadata Management
- Generic data model
- Not impossible, but unlikely
- Proliferation of data models
- No proof anyone is superior
- Semantics arent fully captured in any data model
53Metadata Management
- Philip Bernstein VLDB 00s Panel
- Generality - representation of metadata must
apply to all application areas - Usefulness exploit application-specific
semantics - Is there an effective middle-ground?
- Define generic high-level operations on models
and mappings, e.g., Match, Merge, Select,
Compose, - Match(M1, M2, ?, map), Merge(M1, M2, map),
Compose(map1, map2) - Implement operations on a DBMS
54Metadata Management
55Metadata Management Clio
- Miller, Haas, Hernandez VLDB 00
- Tool to support mapping between data
representations - Mapping represented as SQL
- Heterogeneous query middleware to examine data
and schemas - Build database competencies in query and schema
management, data mining - Exploit user knowledge of target semantics
- Enhance user knowledge of source schema and data
- Provide knowledge of query subtleties,
alternative mappings
56Metadata Management Clio
- User indicates what schema and data values are
needed for target - Tool enumerates and ranks mappings
- Many possible subtle differences
- Best mappings are simple, but lose least
information possible - Allows immediate user feedback
57Filtering XML Documents
- Altinel, Franklin VLDB 00
- Xfilter provides highly efficient matching of
XML documents to user profiles - Event filtering system
- Highly scalable
- Use XPath as a profile language
58XML Data Compression
- Liefke, Suciu SIGMOD 00
- Structure, consisting of tags and attributes, is
compressed separately - Group related data items and compress each
related group separately - Apply semantic compression
- Automatic data mining tools to cluster data needs
to be developed
59Future Research Issues
- XML views of traditional databases
- Relational database
- Object-relational database
- XML Storage
- Object-relational databases
- Alternative storage methods
- Indexes for XML data
- XML query processing and optimization
- Centralized and distributed processing
60Future Research Issues
- Schema mapping
- Mixing structure search with full-text search
- XML-based mediators
- XML data compression
61Summary
- XML provides a lot of challenges to database
community - XML Storage Issues
- XML Indexes
- DTD Extraction
- Query language
- Query processing
- Metadata Management
62Biography of Kyuseok Shim
- Kyuseok Shim is an Assistant Professor in
Computer Science Department at KAIST in Korea. He
is also currently an Advisory Committee Member
for ACM SIGKDD. Before joining KAIST, he was a
member of technical staff (MTS) in the Database
Systems Research Department at Bell Laboratories.
While he was in Bell Laboratories, he started and
worked for Serendip data mining project and
eXcalibur XML storage project. Before joining
Bell Laboratories, he worked for Rakesh Agrawal's
Quest data mining project at IBM Almaden Research
Center. He also worked with Surajit Chaudhuri as
a summer intern for two summers at Hewlett
Packard Laboratories. He received B.S. degree in
Electrical Engineering from Seoul National
University, and the MS and Ph.D. degree in
Computer Science from University of Maryland,
College Park. - Kyuseok Shim has been working in the area
of databases focusing on XML, data mining, data
warehousing, query processing and query
optimization. He has published more than 30
research papers in prestigious international
conferences and journals. He has also served as a
program committee member on several international
conferences including ICDE'97, SIGKDD'98,
SIGMOD'99, SIGKDD'99, ICDE'00, VLDB'00 and
SIGKDD01.