Information Retrieval in XML Documents - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Information Retrieval in XML Documents

Description:

'student' contain three sub-elements. The three sub-elements contain only chars. 9/17/09 ... xsd:element Valid XML is well-formed XML that conforms to a DTD or schema ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 20

Provided by: faculty7

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval in XML Documents

1
Information Retrieval in XML Documents

By
Faisal Alharbi

Supervisor Dr. Mourad Ykhlef
IS 531
2
Agenda

Quick overview of XML
Structured Information Retrieval in XML Documents

3
XML Data

Special case from Semistructured Data
A W3C standard to complement HTML
Origin SGML
Motivation
HTML describes presentation
XML describes content

4
XML Syntax

ltstudentgt
ltnamegt
ltfirstgtFaisal lt/firstgt
ltlastgt Alharbi lt/lastgt
lt/namegt
ltidgt 425121597 lt/idgt
ltemailgt fbadrany_at_kacst.edu.sa
lt/emailgt
lt/studentgt

student
email
name
id
425121597
fbadrany_at_kacst.edu.sa
last
first
Faisal
Alharbi
5
DTD
The root element is db
db contains arbitrary number of student
elements

lt!DOCTYPE db
lt!ELEMENT db (student)gt
lt!ELEMENT student (name,id,email)gt
lt!ELEMENT name (PCDATA)gt
lt!ELEMENT id (PCDATA)gt
lt!ELEMENT email (PCDATA)gt
gt

student contain three sub-elements
The three sub-elements contain only chars
6
XML Schema

Valid XML is well-formed XML that conforms to a
DTD or schema
XML Schema is a W3C standard
The syntax used is XML

ltxsdelement name"student"gt
ltxsdcomplexTypegtltxsdsequencegt ltxsdelement
name"name" type"xsdstring"/gt ltxsdelement
name"id" type"xsdstring"/gt
ltxsdelement name"email"
type"xsdstring"/gtlt/xsdsequencegt
lt/xsdcomplexTypegt
lt/xsdelementgt

7
IR in XML

Existing approaches on storing/searching XML
documents are viewed from a database perspective
approach is based on IR and proposes an index
that facilitates search and ranking
The proposed index combines
An inverted file
Path index

8
Index organization

Involves literals and tags
Term recognition
Normalization algorithms to remove unimportant
terms (stop words)
Stemming algorithm
Estimating the distribution (within document
frequency)
Generate summary tree

9
Index organization
XML Doc

ltarticlegt
ltauthorgt Nick Hoffman lt/authorgt
lttitlegt aspects of architecture lt/titlegt
ltabstractgt geometrical figures lt/abstractgt
ltpagesgt 324-333 lt/pagesgt
lt/articlegt
ltarticlegt
ltauthorgt Paul Fisher lt/authorgt
lttitlegt modern architecture lt/titlegt
ltabstractgt Many classes create lt/abstractgt
ltpagesgt 122-128 lt/pagesgt
lt/articlegt

ltarticlegt ltauthorgt Nick Hoffman
lt/authorgt lttitlegt aspects of architecture
lt/titlegt ltabstractgt geometrical figures
lt/abstractgt ltpagesgt 324-333
lt/pagesgt lt/articlegt ltarticlegt ltauthorgt
Paul Fisher lt/authorgt lttitlegt modern
architecture lt/titlegt ltabstractgt Many
classes create lt/abstractgt ltpagesgt 122-128
lt/pagesgt lt/articlegt
ltarticlegt ltauthorgt Nick Hoffman
lt/authorgt lttitlegt aspects of architecture
lt/titlegt ltabstractgt geometrical figures
lt/abstractgt ltpagesgt 324-333
lt/pagesgt lt/articlegt ltarticlegt ltauthorgt
Paul Fisher lt/authorgt lttitlegt modern
architecture lt/titlegt ltabstractgt Many
classes create lt/abstractgt ltpagesgt 122-128
lt/pagesgt lt/articlegt
ltarticlegt ltauthorgt (Nick Hoffman)1
(Paul Fisher)1 lt/authorgt lttitlegt aspect1
architect2 modern1 lt/titlegt ltabstractgt
geometr1 figur1 create1 class1lt/abstractgt
lt/articlegt
Tags
Literals
Stopper
Normalization
Stemmer
Statistics
10
Summary Tree
article
article
title
abstract
(Nick Hoffman)1 (Paul Fisher)1
aspect1 architect2 modern1
geometr1 figur1 creat1 Class1
11
Loading XML Summary Tree
12
Example
13
Inserting Summary Tree
T
article

Insert(T,P,I)
I1 If the index structure is empty a new root is
created, referenced by P. Then the recursive
function in step I2 is invoked by
AddSummaryTree(T,P).
I2 AddSummaryTree(t, p)
if there is no child c of p such that tag(c)
tag(t)
then
make a new child node c of p such that tag(c)
tag(t)
UpdateInvertedFile(I, t, c)
for each child x of t do AddSummaryTree(x, c)

UpdateInvertedFile(I, t, c)
Comments content(t) is the literal content of
the tag t in the summary tree. I is the inverted
file where the terms of content(t) will be
stored. c is the node in the path index to which
all the terms in content(t) will be linked.
Method Updates the inverted file.
For each term x in the content(t) do
If x is not in the vocabulary of I then
add x to I.
Update accordingly the inverted list of the term
x.
Make a reference from c to the entry x

abstract
author
title
aspect1 architect2 modern1
geometr1 figur1 creat1 Class1
(Nick Hoffman)1 (Paul Fisher)1
P
article
author
abstract
title
14
Deleting Summary Tree

Input The docID of the document that will be
removed from the index.
Output index after removing the document
Method Deletes the document from the index.
For each term x in the vocabulary of the
inverted file that can be identified as being a
term of the document with ID docID do
Delete the contribution of x in the inverted
file.
Update the path index to the root by removing
any link to x

15
Query Evaluation
16
Query Evaluation

Input The query Q with x1, x2, , xn
conjunctive terms.
Output A set S of documents.
Method Normalize the query terms x1, , xn and
set S 0. Decompose the query into several
conjunctive terms
For each term xi, (1 lt i lt n) of the query do
Let Ti be the set containing the inverted lists
that match the path of the term xi. Let Ai be the
set containing the document entries that match
the path of the term xi.
Extract from Ti the list of documents pointed to
by the literal part of xi and store this list to
the set Bi
The result set S is given by

17
Ranking
Ranking is based on ? The term distribution and
? The structural position of the term
Let fp1 be assigned to article/title and fp2 to
article/abstract Then fp1 gt fp2
18
Performance

The Cystic Fibrosis collection has been used
This consists of 1239 XML documents.
Its size is 6MB
It comes with 100 queries
The index overhead is 2.5MB (42)
The time required to build the index is 55sec (on
a typical Pentium III with 256MB RAM)

19
Thank you

Write a Comment

User Comments (0)