Title: Integrating%20Keyword%20Search%20into%20XML%20Query%20Processing
1Integrating Keyword Search into XML Query
Processing
XML Query Language (XML-QL) Extending XML-QL with
Keyword Search Extended XML-QL Implementation
Using RDBMS
- Presentation By
- Alex Kremer
- Ariel Rosenblatt
2Bibliography(well-formed, but invalid)
- Bibliography
- Article elements are from different sources
- Same information, but using different XML Scheme
/ DTDs (Document Type Descriptors)
3XML Queries
- XML is becoming the Data Storage and Exchange
Format of choice in many applications - Handling of XML data requires a rich and powerful
Query Language - Allow for querying the content and structure of
an XML document - Varying or unknown structures can make
formulating queries very difficult
4XML Queries Why not SQL/OQL
- XML is not rigidly structured
- In XML the schema can exists with the data as tag
names - If DTD is not available, schema is build while
the document is parsed - Missing elements or multiple occurrences of the
same element - This flexibility is crucial for EDI (Electronic
Document Interchange)
5XML Query Requirements W3C Working Group
- Goals
- Support different usage scenarios
- Define data model query operators
- Define query language syntax
- Interoperate with other XML working groups
6XML Query Requirements Usage Scenarios
- Human-readable documents
- Manuals, Books, Articles
- Data-oriented documents
- XML representation of
- Database data, Object data,
- XML representation might be either
- Physical or Virtual
7XML Query Requirements Usage Scenarios Contd.
- Mixed model documents
- Hybrid of document oriented and data-oriented
- Catalogues, Patient health records,
- Administrative data
- Configuration files, User profiles,
Administrative logs
8XML Query Requirements Usage Scenarios Contd.
- Filtering streams
- On-line filtering / extracting / transforming /
routing, of XML data streams - Logs of email messages, Network packets, Stock
market data, Newswire feeds - Document Object Model (DOM)
- Perform queries on DOM structures to return sets
of nodes that meet the specified criteria
9XML Query Requirements Usage Scenarios Contd.
- Multiple syntactic environments for queries
embedded in - URL, XML, JSP or ASP pages, a string in a
general-purpose programming language
10XML Query Requirements Interoperability
- Results must be returned in a DOM compatible
manner - XPath (used in XPointer and XSLT)
- XPath expressibility and search facilities should
be used in query syntax - Usage of XML Schema (XSDL) and/or DTD
11XML Query Languages Proposals to W3C
- XQL (heavily based on XPath)
- XML-QL
12XML-QL
- It is declarative
- It is relational complete in particular it can
express joins - Simple enough to enable optimizations
- It can extract data from existing XML documents
and construct new documents (transformations)
13XML-QL Syntax
WHERE ( xml-pattern ELEMENT_AS elem_var
) IN url, ( predicate ) CONSTRUCT xml-pattern
variable
- WHERE clause specifies how to filter data from
the input XML dataset - CONSTRUCT clause specifies how to assemble the
query results in XML
14XML-QL Example 1
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
- Yields the following result
15XML-QL ExplainedThe Data Model
- A Set of XML documents must be represented (XML
Data Set) - XML elements in a dataset can be partitioned
according to their types - Need to represent information in a loss-less
manner (original data set must be recreatable
from the representation)
16XML-QL ExplainedData Model Representation
ID00
Bibliography
article
article
article
article
ID14
ID01
ID04
ID08
id
id
link
id
link
3
http
4
http
6
title
title
id
link
date
author
author
author
author
20000815
1
http
ID05
ID06
ID07
ID09
ID10
ID12
_at_article Florescu
source
title
name
name
name
name
A Query
Alon L
Integr
ID02
ID03
ID11
ID13
Daniela Florescu
Daniela Florescu
Donald K
XML Query
W3C
17XML-QL ExplainedData Model Representation
- Dataset D is represented as a graph GD
- Nodes
- Element e ? node Ne uniquely labeled IDe
- Data value v ? leaf Lv uniquely labeled v
- Edges
- (Ne , Ne) labeled with the tag of e, if e is
directly nested within e (ltegtltegtlt/egtlt/egt) - (Ne , Lv) labeled with , if v is directly
contained within e (ltegtvlt/egt) - (Ne , Lv) labeled with attribute name a, if v is
the value of atribute a of element e (lte
avgtlt/egt)
18XML-QL ExplainedQuery Processing
- An XML pattern can be also modeled by a graph
- Some labels in the graph are now variables
- The result of the evaluation of query q on the
input D, is - Each mapping from the graph Gq to the graph GD
which preservers the constant labels - This mapping induces a substitution of the
variables in the query on the set of constant
values
19XML-QL ExplainedA Query Graph for Example 1
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
article
title
author
name
T
Florescu
20XML-QL ExplainedQuery Processing, Example 1
ID00
Bibliography
article
article
article
article
ID014
ID01
ID04
ID08
No ltauthorgt
id
id
link
id
link
3
http
4
http
6
title
title
id
link
date
author
author
author
author
No ltnamegt name is an attribute
20000815
1
http
ID05
ID06
ID07
ID09
ID10
ID12
_at_article Florescu
source
title
name
name
name
name
A Query
Alon L
Integr
ID02
ID03
ID11
ID13
Daniela Florescu
Daniela Florescu
Donald K
XML Query
W3C
article
Match! Add ID08 to Results E ID08 T
Integrating Keyword Search
title
author
name
T
Florescu
21XML-QL Advanced QueriesExample 2 (More
Florescu)
WHERE ltarticlegt ltgtltauthorgtltnamegtNlt/namegtlt/
authorgtlt/gt lttitlegtTlt/titlegt ltarticlegt
ELEMENT_AS E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E
lt/resultgt union WHERE ltarticlegt
ltgtltauthorgtlt_ nameNgtlt/_gtlt/authorgtlt/gt
lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS E IN
bibliography.xml, N like Florescu CONSTRUCT
ltresultgt E lt/resultgt
- We now look for articles where the author name
can be also an attribute!, result
Back
22XML-QL Disadvantages
- We need to know the XML structure in order to
query - We can still perform more efficient queries,
where we get all the information available, but - These queries can easily grow very complex as
seen previously
23XML-QL Keyword Search Extension
- Addition of special predicate called contains to
XML-QL - Tests the existence of a given word within an XML
element - Works on partially known or not-known XML
structure - Allows querying several XML documents with
different structure
24Extended XML-QL The contains Predicate
- The contains predicate has 4 arguments, (E,
word, depth, location) - E is an XML element variable
- Word the word we are searching for
- Depth is an integer expression limiting the depth
at which the word is found within the element - Location is a boolean expression over the set of
constants, - tag_name, attribute_name, content,
attribute_value
25Extended XML-QLExample 3
- We can use the extended XML-QL to formulate a
query which yields the same result as Example 2
WHERE ltarticlegt ltauthorgtlt/authorgt ELEMENT_AS
A lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, contains(A,
Florescu, 3, content or attribute_value) CO
NSTRUCT ltresultgt E lt/resultgt
Back
26Extended XML-QLExample 4
- We are able to query unstructured data (full text
search) within a set of articles
WHERE ltarticlegtlt/articlegt ELEMENT_AS E IN
bibliography.xml, contains(E, Florescu, 3,
any) CONSTRUCT ltresultgt E lt/resultgt
Yielding the result
27Implementing the contains predicate
- The authors suggest an implementation of the
XML-QL extension on top of a Commercial RDBMS - Oracle 8, IBM DB2, MS-SQL,
28Implementation Using RDBMS
- Reasons
- Easy to implement an extended XML query processor
- Universally available
- RDBMS allow to mix XML data and other (relational
data) - Very good performance over large volumes of data
29Relational Support forFull-text Indexing
- Use of extended Inverted Files to implement
- The contains predicate
- Finding of relevant XML data sources (URLs) in a
distributed environment - We will use RDBMS to implement Inverted Files
30Inverting Files
- For our needs the inverted file will contain
tuples of the following format - ltword, elID, depth, locationgt
- Examples from bibliography.xml
- ltarticle, elID01, 0, taggt
- ltid, elID01, 1, attrgt
- ltRequirements, elID01, 2, valuegt
31Storing Inverted Files in RDBMS Unique Internal
elIDs
- Unique element IDs are modeled as records
containing - Document locators (URLs)
- Element locators within the document
- Using absolute positions (start, end)
- Using unique identifiers specified by DTD
(explicit id attribute) - Why not XPointer?
32Storing Inverted Files in RDBMS Unique elID
Schemes
- After normalization the authors propose the
following scheme - Elements(elID, docid, start_pos, end_pos, type,
id_val) - Documents(docid, URL)
- From this point elID can be used as an internal
key used for faster processing
33Storing Inverted Files in RDBMS
- Natural way using scheme
- contains(elID, word, depth, location)
- Huge! We partition it into word tables for each
keyword ltwordgt in the dataset - ltwordgt(elID, depth, location)
- Virtually all IR (Information Retrieval) systems
use partitioning by word
Back
34Storing Inverted Files in RDBMS Further
Partitioning
- We use further partitioning to optimize the query
processing - The type (tag) of the element is usually known at
predicate evaluation time - by looking at the XML pattern of the query
- We further partition the individual ltwordgt tables
by the type of the element they are in - ltwordgt-lttypegt(elID, depth, location)
- Table examples Name-author, Florescu-name
bibliography.xml
Back
35Implementation Extended XML-QL Query Processing
- Two Ways
- Replicating the whole XML data in an RDBMS
- XML-QL processing is entirely performed in an
RDBMS - Distributed XML Query Processing
- only index (contains) is stored in an RDBMS
36Replicating the XML Data in an RDBMS
- The binary table approach
- For each type (tag name or attribute name), a
table is built with the following scheme - lttypegt(parent, element, value)
- The parent element contains the element of type
lttypegt - element is null if a lttypegt has no sub-elements
or if lttypegt is an attribute name (in that case
we are usually interested in the value)
bibliography.xml
37Replicating the XML Data in an RDBMS XML-QL
Queries
- Every XML-QL query can be translated into an
equivalent SQL query - The SQL query will process the binary tables of
the replicated XML Data
Back
38XML-QL to SQL Example 5 (from Example 1)
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/aut
horgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, N like
Florescu CONSTRUCT ltresultgt E lt/resultgt
SELECT article.element FROM article, author,
name, title WHERE article.element
author.parent AND author.element name.parent
AND article.element title.parent AND / title
exists / name.value like Florescu
39Extended XML-QL to SQL Keyword Search
- Processing the contains predicate involves usage
of inverted file tables - The word-type table has to be joined with the
previous result - The word-type table is the resulting table of the
word by type partitioning
40Extended XML-QL to SQL Example 6
WHERE ltarticlegt ltauthorgtlt/authorgt
ELEMENT_AS A lttitlegtTtextlt/titlegt
ELEMENT_AS T ltarticlegt ELEMENT_AS E IN
bibliography.xml, contains(A, Florescu, 3,
any) contains(T, Integrating, 3,
any) CONSTRUCT ltresultgt Ttext lt/resultgt
SELECT title.value FROM article, author, name,
title, Florescu-author, Integrating-title W
HERE article.element author.parent AND
author.element Florescu-author.elID AND
article.element title.parent AND
title.element Integrating-title.elID
41Distributed XML Query Processing
- XML data can be indexed in RDBMS, but
- The XML data cannot be stored in the RDBMS
- Reasons volume (entire www) or legal
- The mediator (query interface)
- Uses inverted files in RDBMS, but
- Accesses the data sources to compute the full
query result (Expensive!) - Load relevant documents/elements into RDBMS and
process the query as described before - (XML-QL to SQL)
42Distributed XML Query Processing Elements
Retrieval
- Use of Inverted Files for the retrieval of
relevant documents/elements - Evaluate contains predicates to disqualify
irrelevant elements - Further reduce the dataset needed to process the
remaining basic XML-QL query - This is an optimization since retrieval of remote
data is expensive - Load the relevant documents/elements
43Distributed XML Query Processing Reducing
Retrieval
WHERE ltarticlegt ltauthorgtltnamegtNlt/namegtlt/auth
orgt lttitlegtTlt/titlegt ltarticlegt ELEMENT_AS
E IN bibliography.xml, T like
XML CONSTRUCT ltresultgt N lt/resultgt
- Get the intersection of elIDs sets from
- author-article
- name-article
- title-article
- XML-article
44Conclusions
- XML-QL can be extended to support keyword search
- Use of RDBMS
- Inverted Files can be stored an queried using an
RDBMS - XML data itself can be replicated and queried in
the RDBMS - Keyword search and overall XML query processing
can be carried out very efficiently - Data structure influence
- The more structure is known, the faster a query
will be executed - Totally unstructured queries can be executed very
fast - The more structure is known, the higher is the
quality of the query results