Information Retrieval in XML Documents - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Information Retrieval in XML Documents

Description:

'student' contain three sub-elements. The three sub-elements contain only chars. 9/17/09 ... xsd:element Valid XML is well-formed XML that conforms to a DTD or schema ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 20
Provided by: faculty7
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval in XML Documents


1
Information Retrieval in XML Documents
  • By
  • Faisal Alharbi

Supervisor Dr. Mourad Ykhlef
IS 531
2
Agenda
  • Quick overview of XML
  • Structured Information Retrieval in XML Documents

3
XML Data
  • Special case from Semistructured Data
  • A W3C standard to complement HTML
  • Origin SGML
  • Motivation
  • HTML describes presentation
  • XML describes content

4
XML Syntax
  • ltstudentgt
  • ltnamegt
  • ltfirstgtFaisal lt/firstgt
  • ltlastgt Alharbi lt/lastgt
  • lt/namegt
  • ltidgt 425121597 lt/idgt
  • ltemailgt fbadrany_at_kacst.edu.sa
  • lt/emailgt
  • lt/studentgt

student
email
name
id
425121597
fbadrany_at_kacst.edu.sa
last
first
Faisal
Alharbi
5
DTD
The root element is db
db contains arbitrary number of student
elements
  • lt!DOCTYPE db
  • lt!ELEMENT db (student)gt
  • lt!ELEMENT student (name,id,email)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT id (PCDATA)gt
  • lt!ELEMENT email (PCDATA)gt
  • gt

student contain three sub-elements
The three sub-elements contain only chars
6
XML Schema
  • Valid XML is well-formed XML that conforms to a
    DTD or schema
  • XML Schema is a W3C standard
  • The syntax used is XML
  • ltxsdelement name"student"gt
  • ltxsdcomplexTypegtltxsdsequencegt ltxsdelement
    name"name" type"xsdstring"/gt ltxsdelement
    name"id" type"xsdstring"/gt
  • ltxsdelement name"email"
    type"xsdstring"/gtlt/xsdsequencegt
  • lt/xsdcomplexTypegt
  • lt/xsdelementgt

7
IR in XML
  • Existing approaches on storing/searching XML
    documents are viewed from a database perspective
  • approach is based on IR and proposes an index
    that facilitates search and ranking
  • The proposed index combines
  • An inverted file
  • Path index

8
Index organization
  • Involves literals and tags
  • Term recognition
  • Normalization algorithms to remove unimportant
    terms (stop words)
  • Stemming algorithm
  • Estimating the distribution (within document
    frequency)
  • Generate summary tree

9
Index organization
XML Doc
  • ltarticlegt
  • ltauthorgt Nick Hoffman lt/authorgt
  • lttitlegt aspects of architecture lt/titlegt
  • ltabstractgt geometrical figures lt/abstractgt
  • ltpagesgt 324-333 lt/pagesgt
  • lt/articlegt
  • ltarticlegt
  • ltauthorgt Paul Fisher lt/authorgt
  • lttitlegt modern architecture lt/titlegt
  • ltabstractgt Many classes create lt/abstractgt
  • ltpagesgt 122-128 lt/pagesgt
  • lt/articlegt

ltarticlegt ltauthorgt Nick Hoffman
lt/authorgt lttitlegt aspects of architecture
lt/titlegt ltabstractgt geometrical figures
lt/abstractgt ltpagesgt 324-333
lt/pagesgt lt/articlegt ltarticlegt ltauthorgt
Paul Fisher lt/authorgt lttitlegt modern
architecture lt/titlegt ltabstractgt Many
classes create lt/abstractgt ltpagesgt 122-128
lt/pagesgt lt/articlegt
ltarticlegt ltauthorgt Nick Hoffman
lt/authorgt lttitlegt aspects of architecture
lt/titlegt ltabstractgt geometrical figures
lt/abstractgt ltpagesgt 324-333
lt/pagesgt lt/articlegt ltarticlegt ltauthorgt
Paul Fisher lt/authorgt lttitlegt modern
architecture lt/titlegt ltabstractgt Many
classes create lt/abstractgt ltpagesgt 122-128
lt/pagesgt lt/articlegt
ltarticlegt ltauthorgt (Nick Hoffman)1
(Paul Fisher)1 lt/authorgt lttitlegt aspect1
architect2 modern1 lt/titlegt ltabstractgt
geometr1 figur1 create1 class1lt/abstractgt
lt/articlegt
Tags
Literals
Stopper
Normalization
Stemmer
Statistics
10
Summary Tree
article
article
title
abstract
(Nick Hoffman)1 (Paul Fisher)1
aspect1 architect2 modern1
geometr1 figur1 creat1 Class1
11
Loading XML Summary Tree
12
Example
13
Inserting Summary Tree
T
article
  • Insert(T,P,I)
  • I1 If the index structure is empty a new root is
    created, referenced by P. Then the recursive
    function in step I2 is invoked by
    AddSummaryTree(T,P).
  • I2 AddSummaryTree(t, p)
  • if there is no child c of p such that tag(c)
    tag(t)
  • then
  • make a new child node c of p such that tag(c)
    tag(t)
  • UpdateInvertedFile(I, t, c)
  • for each child x of t do AddSummaryTree(x, c)
  • UpdateInvertedFile(I, t, c)
  • Comments content(t) is the literal content of
    the tag t in the summary tree. I is the inverted
    file where the terms of content(t) will be
    stored. c is the node in the path index to which
    all the terms in content(t) will be linked.
  • Method Updates the inverted file.
  • For each term x in the content(t) do
  • If x is not in the vocabulary of I then
  • add x to I.
  • Update accordingly the inverted list of the term
    x.
  • Make a reference from c to the entry x

abstract
author
title
aspect1 architect2 modern1
geometr1 figur1 creat1 Class1
(Nick Hoffman)1 (Paul Fisher)1
P
article
author
abstract
title
14
Deleting Summary Tree
  • Input The docID of the document that will be
    removed from the index.
  • Output index after removing the document
  • Method Deletes the document from the index.
  • For each term x in the vocabulary of the
    inverted file that can be identified as being a
    term of the document with ID docID do
  • Delete the contribution of x in the inverted
    file.
  • Update the path index to the root by removing
    any link to x

15
Query Evaluation
16
Query Evaluation
  • Input The query Q with x1, x2, , xn
    conjunctive terms.
  • Output A set S of documents.
  • Method Normalize the query terms x1, , xn and
    set S 0. Decompose the query into several
    conjunctive terms
  • For each term xi, (1 lt i lt n) of the query do
  • Let Ti be the set containing the inverted lists
    that match the path of the term xi. Let Ai be the
    set containing the document entries that match
    the path of the term xi.
  • Extract from Ti the list of documents pointed to
    by the literal part of xi and store this list to
    the set Bi
  • The result set S is given by

17
Ranking
Ranking is based on ? The term distribution and
? The structural position of the term
Let fp1 be assigned to article/title and fp2 to
article/abstract Then fp1 gt fp2
18
Performance
  • The Cystic Fibrosis collection has been used
  • This consists of 1239 XML documents.
  • Its size is 6MB
  • It comes with 100 queries
  • The index overhead is 2.5MB (42)
  • The time required to build the index is 55sec (on
    a typical Pentium III with 256MB RAM)

19
Thank you
Write a Comment
User Comments (0)
About PowerShow.com