Title: Information Retrieval in XML Documents
1Information Retrieval in XML Documents
Supervisor Dr. Mourad Ykhlef
IS 531
2Agenda
- Quick overview of XML
- Structured Information Retrieval in XML Documents
3XML Data
- Special case from Semistructured Data
- A W3C standard to complement HTML
- Origin SGML
- Motivation
- HTML describes presentation
- XML describes content
4XML Syntax
- ltstudentgt
- ltnamegt
- ltfirstgtFaisal lt/firstgt
- ltlastgt Alharbi lt/lastgt
- lt/namegt
- ltidgt 425121597 lt/idgt
- ltemailgt fbadrany_at_kacst.edu.sa
- lt/emailgt
- lt/studentgt
student
email
name
id
425121597
fbadrany_at_kacst.edu.sa
last
first
Faisal
Alharbi
5DTD
The root element is db
db contains arbitrary number of student
elements
- lt!DOCTYPE db
- lt!ELEMENT db (student)gt
- lt!ELEMENT student (name,id,email)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT id (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- gt
student contain three sub-elements
The three sub-elements contain only chars
6XML Schema
- Valid XML is well-formed XML that conforms to a
DTD or schema - XML Schema is a W3C standard
- The syntax used is XML
- ltxsdelement name"student"gt
- ltxsdcomplexTypegtltxsdsequencegt ltxsdelement
name"name" type"xsdstring"/gt ltxsdelement
name"id" type"xsdstring"/gt - ltxsdelement name"email"
type"xsdstring"/gtlt/xsdsequencegt - lt/xsdcomplexTypegt
- lt/xsdelementgt
7IR in XML
- Existing approaches on storing/searching XML
documents are viewed from a database perspective - approach is based on IR and proposes an index
that facilitates search and ranking - The proposed index combines
- An inverted file
- Path index
8Index organization
- Involves literals and tags
- Term recognition
- Normalization algorithms to remove unimportant
terms (stop words) - Stemming algorithm
- Estimating the distribution (within document
frequency) - Generate summary tree
9Index organization
XML Doc
-
- ltarticlegt
- ltauthorgt Nick Hoffman lt/authorgt
- lttitlegt aspects of architecture lt/titlegt
- ltabstractgt geometrical figures lt/abstractgt
- ltpagesgt 324-333 lt/pagesgt
- lt/articlegt
- ltarticlegt
- ltauthorgt Paul Fisher lt/authorgt
- lttitlegt modern architecture lt/titlegt
- ltabstractgt Many classes create lt/abstractgt
- ltpagesgt 122-128 lt/pagesgt
- lt/articlegt
ltarticlegt ltauthorgt Nick Hoffman
lt/authorgt lttitlegt aspects of architecture
lt/titlegt ltabstractgt geometrical figures
lt/abstractgt ltpagesgt 324-333
lt/pagesgt lt/articlegt ltarticlegt ltauthorgt
Paul Fisher lt/authorgt lttitlegt modern
architecture lt/titlegt ltabstractgt Many
classes create lt/abstractgt ltpagesgt 122-128
lt/pagesgt lt/articlegt
ltarticlegt ltauthorgt Nick Hoffman
lt/authorgt lttitlegt aspects of architecture
lt/titlegt ltabstractgt geometrical figures
lt/abstractgt ltpagesgt 324-333
lt/pagesgt lt/articlegt ltarticlegt ltauthorgt
Paul Fisher lt/authorgt lttitlegt modern
architecture lt/titlegt ltabstractgt Many
classes create lt/abstractgt ltpagesgt 122-128
lt/pagesgt lt/articlegt
ltarticlegt ltauthorgt (Nick Hoffman)1
(Paul Fisher)1 lt/authorgt lttitlegt aspect1
architect2 modern1 lt/titlegt ltabstractgt
geometr1 figur1 create1 class1lt/abstractgt
lt/articlegt
Tags
Literals
Stopper
Normalization
Stemmer
Statistics
10Summary Tree
article
article
title
abstract
(Nick Hoffman)1 (Paul Fisher)1
aspect1 architect2 modern1
geometr1 figur1 creat1 Class1
11Loading XML Summary Tree
12Example
13Inserting Summary Tree
T
article
- Insert(T,P,I)
- I1 If the index structure is empty a new root is
created, referenced by P. Then the recursive
function in step I2 is invoked by
AddSummaryTree(T,P). - I2 AddSummaryTree(t, p)
- if there is no child c of p such that tag(c)
tag(t) - then
- make a new child node c of p such that tag(c)
tag(t) - UpdateInvertedFile(I, t, c)
- for each child x of t do AddSummaryTree(x, c)
- UpdateInvertedFile(I, t, c)
- Comments content(t) is the literal content of
the tag t in the summary tree. I is the inverted
file where the terms of content(t) will be
stored. c is the node in the path index to which
all the terms in content(t) will be linked. - Method Updates the inverted file.
- For each term x in the content(t) do
-
- If x is not in the vocabulary of I then
- add x to I.
- Update accordingly the inverted list of the term
x. - Make a reference from c to the entry x
abstract
author
title
aspect1 architect2 modern1
geometr1 figur1 creat1 Class1
(Nick Hoffman)1 (Paul Fisher)1
P
article
author
abstract
title
14Deleting Summary Tree
- Input The docID of the document that will be
removed from the index. - Output index after removing the document
- Method Deletes the document from the index.
- For each term x in the vocabulary of the
inverted file that can be identified as being a
term of the document with ID docID do -
- Delete the contribution of x in the inverted
file. - Update the path index to the root by removing
any link to x -
15Query Evaluation
16Query Evaluation
- Input The query Q with x1, x2, , xn
conjunctive terms. - Output A set S of documents.
- Method Normalize the query terms x1, , xn and
set S 0. Decompose the query into several
conjunctive terms - For each term xi, (1 lt i lt n) of the query do
-
- Let Ti be the set containing the inverted lists
that match the path of the term xi. Let Ai be the
set containing the document entries that match
the path of the term xi. - Extract from Ti the list of documents pointed to
by the literal part of xi and store this list to
the set Bi - The result set S is given by
17Ranking
Ranking is based on ? The term distribution and
? The structural position of the term
Let fp1 be assigned to article/title and fp2 to
article/abstract Then fp1 gt fp2
18Performance
- The Cystic Fibrosis collection has been used
- This consists of 1239 XML documents.
- Its size is 6MB
- It comes with 100 queries
- The index overhead is 2.5MB (42)
- The time required to build the index is 55sec (on
a typical Pentium III with 256MB RAM)
19Thank you