Searching XML Documents via XML Fragments - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Searching XML Documents via XML Fragments

Description:

3. XML in a nutshell. Hierarchical data format. Nested element structure having a root ... title Harry Potter /title author $a /author , year $y /year ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 23

Provided by: sifaka

Category:

more less

Transcript and Presenter's Notes

Title: Searching XML Documents via XML Fragments

1
Searching XML Documents via XML Fragments

D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass
and A. Soffer

Presented by Hui Fang
2
Background(1) --- Data
Why is semi-structured data important?
Un-structured Data
Well-structured Data
Lack of flexibility Lack of extensibility
ltpapergt lttitlegt XIRQL lt/titlegt ltauthorgt N.Fuhr
lt/authorgt ltauthorgt K.Grobjohann lt/authorgt ltconfgt
SIGIR lt/confgt lt/papergt
Lack of the logical structure of a document.
3
XML in a nutshell
ltbook id25gt ltyeargt1997lt/yeargt ltauthorgt
Karen Sparck Jones lt/authorgt ltauthorgtPeter
Willett lt/authorgt ltpublishergtMorgan
Kaufmannlt/publishergt lttitlegt Readings in
Information Retrieval lt/titlegt lt/bookgt

Hierarchical data format
Nested element structure having a root
Self describing data (tags), schema is attached
to the data itself.

4
Background(2) --- Query
How to query semi-structured data (e.g. XML data)
?
5
Related Work

DB-oriented approaches
E.g. XML-QL, XQL, XQUERY

WHERE ltbookgt lttitlegtHarry
Potter lt/titlegt ltauthorgtalt/authorgt,
ltyeargt y lt/yeargt lt/bookgt in books.xml,
ygt2002 CONSTRUCT ltresultgt
ltauthorgttlt/authorgt lt/resultgt

DBIR approaches
E.g. XIRQL
IR-oriented approaches
E.g. this paper

6
Problem Refinement---CAS Search

Document collection
XML documents
Each document is a hierarchical structure of
nested elements
Markup in the document mainly serves for exposing
the logical structure of a document.
Query
content explicit references to the XML
structure
specifies the target element need to be returned

An example
Retrieval all articles from the years 1999-2000
and deal with works on nonmonotonic reasoning. Do
not retrieve articles that are calendar/call for
papers.
7
Approach

Compare apple and apple
Recall vector space models
Both documents and queries are expressed in free
text.
Compare unstructured data to unstructured data
This paper
Search XML documents via XML fragments

8
Query---XML Fragments(1)

Topic 1 Find all books about fishing
ltbookgt fishing lt/bookgt
Topic 2 Find all books having a title about
search
ltbookgt lttitlegt fishing lt/titlegt lt/bookgt

More intuitive More flexible
9
Query --- XML Fragment(2)

Limited expressiveness
E.g. Finding figures that describe the Corba
architecture and the paragraphs that refer to
those figures.

Requires a join operation between two elements
figures and paragraphs
10
Recall Text Retrieval Task

Give a query
According to the retrieval formula, compute the
relevance score for each document
Rank the documents according to relevance score.

11
Extending the Vector Space Model(1)

Indexing unit
E.g. (Harry Potter , /book/title)
Can be matched with
(Harry Potter ,/book)
(Harry Potter ,/book/sec/title)
Retrieval Formula

12
Extending the Vector Space Model(2)
If c is rare, idf(t,c) would be high in spite of
t being very common.
13
Evaluation

Runs
Partial-match
Partial-match. merge-idf
Partial-match.merge
Fuzzy-match.merge-idf
Flat (ignore context)

14
Result(1)

Result for free-text-oriented topics
An example topic
ltyrgt1995,1996,1997,1998,1999lt/yrgt
ltbdygtXML Electronic commerce lt/bdygt

15
Result(2)

Result for context-oriented topics
An example topic
ltatlgt Content-Based retrieval of video
databaseslt/atlgt

16
Summary

Using XML fragments with an extended vector space
model is promising.
Use different solutions for different types of
applications
Something wrong?

17
Another Problem --- CO Search

Document collection
XML documents
Query
a set of keywords
Task Find smallest element satisfying the query

Challenge rank the components instead of document
18
Possible Solutions
,where
Possible Method(1) treat each component as a
document.
Problem with this method XML components are
nested.
ltarticlegt t1 ltsecgt ltpgt t2lt/pgtlt/secgt lt/articlegt
19
Possible Solutions (Cont.)
,where
Possible Method(2) counting TF at the component
level computing N DF at the document level.
ltarticlegt ltsecgtt1lt/secgt ltsecgtt1lt/secgt ltsecgtt2lt/
secgt lt/articlegt
20
Proposed Solution

Create a index for each component type
Elements in each index are regarded as documents
Keep N, DF,TF for the specific component type
Can apply the regular vector space model on each
index
Given a query
Run the query in parallel on each index
Return one ranked list of results, one from each
index
Normalize the scores in each index into the range
(0,1)
Achieved by computing
Merge the normalized results into a one ranked
list of all components

Assume the set of potential components to be
returned must be known in advance. Assume no
nesting of the same component.
21
Conclusion

Possible solutions to solve the following
challenges.
Challenge 1 (Information/Doc Unit) What is an
appropriate information unit?
Document may no longer be the most natural unit
Components in a document may be more appropriate
Challenge 2 (Query) What is an appropriate query
language?
Keyword (free text) query is no longer the only
choice
Constraints on the structures can be posed

22
References

Retrieving the most relevant XML components, by
Y. Mass, M. Mandelbrod. INEX03 workshop.
Searching XML Documents via XML fragments, by D.
Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and
A. Soffer. SIGIR03
XIRQL A Query Language for Information Retrieval
in XML Documents by N. Fuhr, K. Großjohann.
SIGIR02

Write a Comment

User Comments (0)