Searching XML Documents via XML Fragments - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Searching XML Documents via XML Fragments

Description:

3. XML in a nutshell. Hierarchical data format. Nested element structure having a root ... title Harry Potter /title author $a /author , year $y /year ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 23
Provided by: sifaka
Category:

less

Transcript and Presenter's Notes

Title: Searching XML Documents via XML Fragments


1
Searching XML Documents via XML Fragments
  • D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass
    and A. Soffer

Presented by Hui Fang
2
Background(1) --- Data
Why is semi-structured data important?
Un-structured Data
Well-structured Data
Lack of flexibility Lack of extensibility
ltpapergt lttitlegt XIRQL lt/titlegt ltauthorgt N.Fuhr
lt/authorgt ltauthorgt K.Grobjohann lt/authorgt ltconfgt
SIGIR lt/confgt lt/papergt
Lack of the logical structure of a document.
3
XML in a nutshell
ltbook id25gt ltyeargt1997lt/yeargt ltauthorgt
Karen Sparck Jones lt/authorgt ltauthorgtPeter
Willett lt/authorgt ltpublishergtMorgan
Kaufmannlt/publishergt lttitlegt Readings in
Information Retrieval lt/titlegt lt/bookgt
  • Hierarchical data format
  • Nested element structure having a root
  • Self describing data (tags), schema is attached
    to the data itself.

4
Background(2) --- Query
How to query semi-structured data (e.g. XML data)
?
5
Related Work
  • DB-oriented approaches
  • E.g. XML-QL, XQL, XQUERY

WHERE ltbookgt lttitlegtHarry
Potter lt/titlegt ltauthorgtalt/authorgt,
ltyeargt y lt/yeargt lt/bookgt in books.xml,
ygt2002 CONSTRUCT ltresultgt
ltauthorgttlt/authorgt lt/resultgt
  • DBIR approaches
  • E.g. XIRQL
  • IR-oriented approaches
  • E.g. this paper

6
Problem Refinement---CAS Search
  • Document collection
  • XML documents
  • Each document is a hierarchical structure of
    nested elements
  • Markup in the document mainly serves for exposing
    the logical structure of a document.
  • Query
  • content explicit references to the XML
    structure
  • specifies the target element need to be returned

An example
Retrieval all articles from the years 1999-2000
and deal with works on nonmonotonic reasoning. Do
not retrieve articles that are calendar/call for
papers.
7
Approach
  • Compare apple and apple
  • Recall vector space models
  • Both documents and queries are expressed in free
    text.
  • Compare unstructured data to unstructured data
  • This paper
  • Search XML documents via XML fragments

8
Query---XML Fragments(1)
  • Topic 1 Find all books about fishing
  • ltbookgt fishing lt/bookgt
  • Topic 2 Find all books having a title about
    search
  • ltbookgt lttitlegt fishing lt/titlegt lt/bookgt

More intuitive More flexible
9
Query --- XML Fragment(2)
  • Limited expressiveness
  • E.g. Finding figures that describe the Corba
    architecture and the paragraphs that refer to
    those figures.

Requires a join operation between two elements
figures and paragraphs
10
Recall Text Retrieval Task
  • Give a query
  • According to the retrieval formula, compute the
    relevance score for each document
  • Rank the documents according to relevance score.

11
Extending the Vector Space Model(1)
  • Indexing unit
  • E.g. (Harry Potter , /book/title)
  • Can be matched with
  • (Harry Potter ,/book)
  • (Harry Potter ,/book/sec/title)
  • Retrieval Formula

12
Extending the Vector Space Model(2)
If c is rare, idf(t,c) would be high in spite of
t being very common.
13
Evaluation
  • Runs
  • Partial-match
  • Partial-match. merge-idf
  • Partial-match.merge
  • Fuzzy-match.merge-idf
  • Flat (ignore context)

14
Result(1)
  • Result for free-text-oriented topics
  • An example topic
  • ltyrgt1995,1996,1997,1998,1999lt/yrgt
  • ltbdygtXML Electronic commerce lt/bdygt

15
Result(2)
  • Result for context-oriented topics
  • An example topic
  • ltatlgt Content-Based retrieval of video
    databaseslt/atlgt

16
Summary
  • Using XML fragments with an extended vector space
    model is promising.
  • Use different solutions for different types of
    applications
  • Something wrong?

17
Another Problem --- CO Search
  • Document collection
  • XML documents
  • Query
  • a set of keywords
  • Task Find smallest element satisfying the query

Challenge rank the components instead of document
18
Possible Solutions
,where
Possible Method(1) treat each component as a
document.
Problem with this method XML components are
nested.
ltarticlegt t1 ltsecgt ltpgt t2lt/pgtlt/secgt lt/articlegt
19
Possible Solutions (Cont.)
,where
Possible Method(2) counting TF at the component
level computing N DF at the document level.
ltarticlegt ltsecgtt1lt/secgt ltsecgtt1lt/secgt ltsecgtt2lt/
secgt lt/articlegt
20
Proposed Solution
  • Create a index for each component type
  • Elements in each index are regarded as documents
  • Keep N, DF,TF for the specific component type
  • Can apply the regular vector space model on each
    index
  • Given a query
  • Run the query in parallel on each index
  • Return one ranked list of results, one from each
    index
  • Normalize the scores in each index into the range
    (0,1)
  • Achieved by computing
  • Merge the normalized results into a one ranked
    list of all components

Assume the set of potential components to be
returned must be known in advance. Assume no
nesting of the same component.
21
Conclusion
  • Possible solutions to solve the following
    challenges.
  • Challenge 1 (Information/Doc Unit) What is an
    appropriate information unit?
  • Document may no longer be the most natural unit
  • Components in a document may be more appropriate
  • Challenge 2 (Query) What is an appropriate query
    language?
  • Keyword (free text) query is no longer the only
    choice
  • Constraints on the structures can be posed

22
References
  • Retrieving the most relevant XML components, by
    Y. Mass, M. Mandelbrod. INEX03 workshop.
  • Searching XML Documents via XML fragments, by D.
    Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and
    A. Soffer. SIGIR03
  • XIRQL A Query Language for Information Retrieval
    in XML Documents by N. Fuhr, K. Großjohann.
    SIGIR02
Write a Comment
User Comments (0)
About PowerShow.com