Integrating%20Keyword%20Search%20into%20XML%20Query%20Processing - PowerPoint PPT Presentation

About This Presentation

Title:

Integrating%20Keyword%20Search%20into%20XML%20Query%20Processing

Description:

It can extract data from existing XML documents and construct new documents (transformations) ... CONSTRUCT clause specifies how to assemble the query results in XML ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 45

Provided by: alexk53

Category:

more less

Transcript and Presenter's Notes

Title: Integrating%20Keyword%20Search%20into%20XML%20Query%20Processing

1
Integrating Keyword Search into XML Query
Processing
XML Query Language (XML-QL) Extending XML-QL with
Keyword Search Extended XML-QL Implementation
Using RDBMS

Presentation By
Alex Kremer
Ariel Rosenblatt

2
Bibliography(well-formed, but invalid)

Bibliography
Article elements are from different sources
Same information, but using different XML Scheme
/ DTDs (Document Type Descriptors)

3
XML Queries

XML is becoming the Data Storage and Exchange
Format of choice in many applications
Handling of XML data requires a rich and powerful
Query Language
Allow for querying the content and structure of
an XML document
Varying or unknown structures can make
formulating queries very difficult

4
XML Queries Why not SQL/OQL

XML is not rigidly structured
In XML the schema can exists with the data as tag
names
If DTD is not available, schema is build while
the document is parsed
Missing elements or multiple occurrences of the
same element
This flexibility is crucial for EDI (Electronic
Document Interchange)

5
XML Query Requirements W3C Working Group

Goals
Support different usage scenarios
Define data model query operators
Define query language syntax
Interoperate with other XML working groups

6
XML Query Requirements Usage Scenarios

Human-readable documents
Manuals, Books, Articles
Data-oriented documents
XML representation of
Database data, Object data,
XML representation might be either
Physical or Virtual

7
XML Query Requirements Usage Scenarios Contd.

Mixed model documents
Hybrid of document oriented and data-oriented
Catalogues, Patient health records,
Administrative data
Configuration files, User profiles,
Administrative logs

8
XML Query Requirements Usage Scenarios Contd.

Filtering streams
On-line filtering / extracting / transforming /
routing, of XML data streams
Logs of email messages, Network packets, Stock
market data, Newswire feeds
Document Object Model (DOM)
Perform queries on DOM structures to return sets
of nodes that meet the specified criteria

9
XML Query Requirements Usage Scenarios Contd.

Multiple syntactic environments for queries
embedded in
URL, XML, JSP or ASP pages, a string in a
general-purpose programming language

10
XML Query Requirements Interoperability

Results must be returned in a DOM compatible
manner
XPath (used in XPointer and XSLT)
XPath expressibility and search facilities should
be used in query syntax
Usage of XML Schema (XSDL) and/or DTD

11
XML Query Languages Proposals to W3C

XQL (heavily based on XPath)
XML-QL

12
XML-QL

It is declarative
It is relational complete in particular it can
express joins
Simple enough to enable optimizations
It can extract data from existing XML documents
and construct new documents (transformations)

13
XML-QL Syntax
WHERE ( xml-pattern ELEMENT_AS elem_var
) IN url, ( predicate ) CONSTRUCT xml-pattern
variable

WHERE clause specifies how to filter data from
the input XML dataset
CONSTRUCT clause specifies how to assemble the
query results in XML

14
XML-QL Example 1
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt

Yields the following result

15
XML-QL ExplainedThe Data Model

A Set of XML documents must be represented (XML
Data Set)
XML elements in a dataset can be partitioned
according to their types
Need to represent information in a loss-less
manner (original data set must be recreatable
from the representation)

16
XML-QL ExplainedData Model Representation
ID00
Bibliography
article
article
article
article
ID14
ID01
ID04
ID08
id
id
link
id
link
3
http
4
http
6
title
title
id
link
date
author
author
author
author
20000815
1
http
ID05
ID06
ID07
ID09
ID10
ID12
_at_article Florescu
source
title
name
name
name
name
A Query
Alon L
Integr
ID02
ID03
ID11
ID13
Daniela Florescu
Daniela Florescu
Donald K
XML Query
W3C
17
XML-QL ExplainedData Model Representation

Dataset D is represented as a graph GD
Nodes
Element e ? node Ne uniquely labeled IDe
Data value v ? leaf Lv uniquely labeled v
Edges
(Ne , Ne) labeled with the tag of e, if e is
directly nested within e (ltegtltegtlt/egtlt/egt)
(Ne , Lv) labeled with , if v is directly
contained within e (ltegtvlt/egt)
(Ne , Lv) labeled with attribute name a, if v is
the value of atribute a of element e (lte
avgtlt/egt)

18
XML-QL ExplainedQuery Processing

An XML pattern can be also modeled by a graph
Some labels in the graph are now variables
The result of the evaluation of query q on the
input D, is
Each mapping from the graph Gq to the graph GD
which preservers the constant labels
This mapping induces a substitution of the
variables in the query on the set of constant
values

19
XML-QL ExplainedA Query Graph for Example 1
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
article
title
author
name
T
Florescu
20
XML-QL ExplainedQuery Processing, Example 1
ID00
Bibliography
article
article
article
article
ID014
ID01
ID04
ID08
No ltauthorgt
id
id
link
id
link
3
http
4
http
6
title
title
id
link
date
author
author
author
author
No ltnamegt name is an attribute
20000815
1
http
ID05
ID06
ID07
ID09
ID10
ID12
_at_article Florescu
source
title
name
name
name
name
A Query
Alon L
Integr
ID02
ID03
ID11
ID13
Daniela Florescu
Daniela Florescu
Donald K
XML Query
W3C
article
Match! Add ID08 to Results E ID08 T
Integrating Keyword Search
title
author
name
T
Florescu
21
XML-QL Advanced QueriesExample 2 (More
Florescu)
WHERE ltarticlegt ltgtltauthorgtltnamegtNlt/namegtlt/
authorgtlt/gt lttitlegtTlt/titlegt ltarticlegt
ELEMENT_AS E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E
lt/resultgt union WHERE ltarticlegt
ltgtltauthorgtlt_ nameNgtlt/_gtlt/authorgtlt/gt
lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS E IN
bibliography.xml, N like Florescu CONSTRUCT
ltresultgt E lt/resultgt

We now look for articles where the author name
can be also an attribute!, result

Back
22
XML-QL Disadvantages

We need to know the XML structure in order to
query
We can still perform more efficient queries,
where we get all the information available, but
These queries can easily grow very complex as
seen previously

23
XML-QL Keyword Search Extension

Addition of special predicate called contains to
XML-QL
Tests the existence of a given word within an XML
element
Works on partially known or not-known XML
structure
Allows querying several XML documents with
different structure

24
Extended XML-QL The contains Predicate

The contains predicate has 4 arguments, (E,
word, depth, location)
E is an XML element variable
Word the word we are searching for
Depth is an integer expression limiting the depth
at which the word is found within the element
Location is a boolean expression over the set of
constants,
tag_name, attribute_name, content,
attribute_value

25
Extended XML-QLExample 3

We can use the extended XML-QL to formulate a
query which yields the same result as Example 2

WHERE ltarticlegt ltauthorgtlt/authorgt ELEMENT_AS
A lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, contains(A,
Florescu, 3, content or attribute_value) CO
NSTRUCT ltresultgt E lt/resultgt
Back
26
Extended XML-QLExample 4

We are able to query unstructured data (full text
search) within a set of articles

WHERE ltarticlegtlt/articlegt ELEMENT_AS E IN
bibliography.xml, contains(E, Florescu, 3,
any) CONSTRUCT ltresultgt E lt/resultgt
Yielding the result
27
Implementing the contains predicate

The authors suggest an implementation of the
XML-QL extension on top of a Commercial RDBMS
Oracle 8, IBM DB2, MS-SQL,

28
Implementation Using RDBMS

Reasons
Easy to implement an extended XML query processor
Universally available
RDBMS allow to mix XML data and other (relational
data)
Very good performance over large volumes of data

29
Relational Support forFull-text Indexing

Use of extended Inverted Files to implement
The contains predicate
Finding of relevant XML data sources (URLs) in a
distributed environment
We will use RDBMS to implement Inverted Files

30
Inverting Files

For our needs the inverted file will contain
tuples of the following format
ltword, elID, depth, locationgt
Examples from bibliography.xml
ltarticle, elID01, 0, taggt
ltid, elID01, 1, attrgt
ltRequirements, elID01, 2, valuegt

31
Storing Inverted Files in RDBMS Unique Internal
elIDs

Unique element IDs are modeled as records
containing
Document locators (URLs)
Element locators within the document
Using absolute positions (start, end)
Using unique identifiers specified by DTD
(explicit id attribute)
Why not XPointer?

32
Storing Inverted Files in RDBMS Unique elID
Schemes

After normalization the authors propose the
following scheme
Elements(elID, docid, start_pos, end_pos, type,
id_val)
Documents(docid, URL)
From this point elID can be used as an internal
key used for faster processing

33
Storing Inverted Files in RDBMS

Natural way using scheme
contains(elID, word, depth, location)
Huge! We partition it into word tables for each
keyword ltwordgt in the dataset
ltwordgt(elID, depth, location)
Virtually all IR (Information Retrieval) systems
use partitioning by word

Back
34
Storing Inverted Files in RDBMS Further
Partitioning

We use further partitioning to optimize the query
processing
The type (tag) of the element is usually known at
predicate evaluation time
by looking at the XML pattern of the query
We further partition the individual ltwordgt tables
by the type of the element they are in
ltwordgt-lttypegt(elID, depth, location)
Table examples Name-author, Florescu-name

bibliography.xml
Back
35
Implementation Extended XML-QL Query Processing

Two Ways
Replicating the whole XML data in an RDBMS
XML-QL processing is entirely performed in an
RDBMS
Distributed XML Query Processing
only index (contains) is stored in an RDBMS

36
Replicating the XML Data in an RDBMS

The binary table approach
For each type (tag name or attribute name), a
table is built with the following scheme
lttypegt(parent, element, value)
The parent element contains the element of type
lttypegt
element is null if a lttypegt has no sub-elements
or if lttypegt is an attribute name (in that case
we are usually interested in the value)

bibliography.xml
37
Replicating the XML Data in an RDBMS XML-QL
Queries

Every XML-QL query can be translated into an
equivalent SQL query
The SQL query will process the binary tables of
the replicated XML Data

Back
38
XML-QL to SQL Example 5 (from Example 1)
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/aut
horgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
SELECT article.element FROM article, author,
name, title WHERE article.element
author.parent AND author.element name.parent
AND article.element title.parent AND / title
exists / name.value like Florescu
39
Extended XML-QL to SQL Keyword Search

Processing the contains predicate involves usage
of inverted file tables
The word-type table has to be joined with the
previous result
The word-type table is the resulting table of the
word by type partitioning

40
Extended XML-QL to SQL Example 6
WHERE ltarticlegt ltauthorgtlt/authorgt
ELEMENT_AS A lttitlegtTtextlt/titlegt
ELEMENT_AS T ltarticlegt ELEMENT_AS E IN
bibliography.xml, contains(A, Florescu, 3,
any) contains(T, Integrating, 3,
any) CONSTRUCT ltresultgt Ttext lt/resultgt
SELECT title.value FROM article, author, name,
title, Florescu-author, Integrating-title W
HERE article.element author.parent AND
author.element Florescu-author.elID AND
article.element title.parent AND
title.element Integrating-title.elID
41
Distributed XML Query Processing

XML data can be indexed in RDBMS, but
The XML data cannot be stored in the RDBMS
Reasons volume (entire www) or legal
The mediator (query interface)
Uses inverted files in RDBMS, but
Accesses the data sources to compute the full
query result (Expensive!)
Load relevant documents/elements into RDBMS and
process the query as described before
(XML-QL to SQL)

42
Distributed XML Query Processing Elements
Retrieval

Use of Inverted Files for the retrieval of
relevant documents/elements
Evaluate contains predicates to disqualify
irrelevant elements
Further reduce the dataset needed to process the
remaining basic XML-QL query
This is an optimization since retrieval of remote
data is expensive
Load the relevant documents/elements

43
Distributed XML Query Processing Reducing
Retrieval
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, T like
XML CONSTRUCT ltresultgt N lt/resultgt

Get the intersection of elIDs sets from
author-article
name-article
title-article
XML-article

44
Conclusions

XML-QL can be extended to support keyword search
Use of RDBMS
Inverted Files can be stored an queried using an
RDBMS
XML data itself can be replicated and queried in
the RDBMS
Keyword search and overall XML query processing
can be carried out very efficiently
Data structure influence
The more structure is known, the faster a query
will be executed
Totally unstructured queries can be executed very
fast
The more structure is known, the higher is the
quality of the query results