Title: Searching XML Documents via XML Fragments
1Searching XML Documents via XML Fragments
- D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass
and A. Soffer
Presented by Hui Fang
2Background(1) --- Data
Why is semi-structured data important?
Un-structured Data
Well-structured Data
Lack of flexibility Lack of extensibility
ltpapergt lttitlegt XIRQL lt/titlegt ltauthorgt N.Fuhr
lt/authorgt ltauthorgt K.Grobjohann lt/authorgt ltconfgt
SIGIR lt/confgt lt/papergt
Lack of the logical structure of a document.
3XML in a nutshell
ltbook id25gt ltyeargt1997lt/yeargt ltauthorgt
Karen Sparck Jones lt/authorgt ltauthorgtPeter
Willett lt/authorgt ltpublishergtMorgan
Kaufmannlt/publishergt lttitlegt Readings in
Information Retrieval lt/titlegt lt/bookgt
- Hierarchical data format
- Nested element structure having a root
- Self describing data (tags), schema is attached
to the data itself.
4Background(2) --- Query
How to query semi-structured data (e.g. XML data)
?
5Related Work
- DB-oriented approaches
- E.g. XML-QL, XQL, XQUERY
WHERE ltbookgt lttitlegtHarry
Potter lt/titlegt ltauthorgtalt/authorgt,
ltyeargt y lt/yeargt lt/bookgt in books.xml,
ygt2002 CONSTRUCT ltresultgt
ltauthorgttlt/authorgt lt/resultgt
- DBIR approaches
- E.g. XIRQL
- IR-oriented approaches
- E.g. this paper
6Problem Refinement---CAS Search
- Document collection
- XML documents
- Each document is a hierarchical structure of
nested elements - Markup in the document mainly serves for exposing
the logical structure of a document. - Query
- content explicit references to the XML
structure - specifies the target element need to be returned
An example
Retrieval all articles from the years 1999-2000
and deal with works on nonmonotonic reasoning. Do
not retrieve articles that are calendar/call for
papers.
7Approach
- Compare apple and apple
- Recall vector space models
- Both documents and queries are expressed in free
text. - Compare unstructured data to unstructured data
- This paper
- Search XML documents via XML fragments
8Query---XML Fragments(1)
- Topic 1 Find all books about fishing
- ltbookgt fishing lt/bookgt
- Topic 2 Find all books having a title about
search - ltbookgt lttitlegt fishing lt/titlegt lt/bookgt
More intuitive More flexible
9Query --- XML Fragment(2)
- Limited expressiveness
- E.g. Finding figures that describe the Corba
architecture and the paragraphs that refer to
those figures.
Requires a join operation between two elements
figures and paragraphs
10Recall Text Retrieval Task
- Give a query
- According to the retrieval formula, compute the
relevance score for each document - Rank the documents according to relevance score.
11Extending the Vector Space Model(1)
- Indexing unit
- E.g. (Harry Potter , /book/title)
- Can be matched with
- (Harry Potter ,/book)
- (Harry Potter ,/book/sec/title)
- Retrieval Formula
12Extending the Vector Space Model(2)
If c is rare, idf(t,c) would be high in spite of
t being very common.
13Evaluation
- Runs
- Partial-match
- Partial-match. merge-idf
- Partial-match.merge
- Fuzzy-match.merge-idf
- Flat (ignore context)
14Result(1)
- Result for free-text-oriented topics
- An example topic
- ltyrgt1995,1996,1997,1998,1999lt/yrgt
- ltbdygtXML Electronic commerce lt/bdygt
15Result(2)
- Result for context-oriented topics
- An example topic
- ltatlgt Content-Based retrieval of video
databaseslt/atlgt -
16Summary
- Using XML fragments with an extended vector space
model is promising. - Use different solutions for different types of
applications - Something wrong?
17Another Problem --- CO Search
- Document collection
- XML documents
- Query
- a set of keywords
- Task Find smallest element satisfying the query
Challenge rank the components instead of document
18Possible Solutions
,where
Possible Method(1) treat each component as a
document.
Problem with this method XML components are
nested.
ltarticlegt t1 ltsecgt ltpgt t2lt/pgtlt/secgt lt/articlegt
19Possible Solutions (Cont.)
,where
Possible Method(2) counting TF at the component
level computing N DF at the document level.
ltarticlegt ltsecgtt1lt/secgt ltsecgtt1lt/secgt ltsecgtt2lt/
secgt lt/articlegt
20Proposed Solution
- Create a index for each component type
- Elements in each index are regarded as documents
- Keep N, DF,TF for the specific component type
- Can apply the regular vector space model on each
index - Given a query
- Run the query in parallel on each index
- Return one ranked list of results, one from each
index - Normalize the scores in each index into the range
(0,1) - Achieved by computing
- Merge the normalized results into a one ranked
list of all components
Assume the set of potential components to be
returned must be known in advance. Assume no
nesting of the same component.
21Conclusion
- Possible solutions to solve the following
challenges. - Challenge 1 (Information/Doc Unit) What is an
appropriate information unit? - Document may no longer be the most natural unit
- Components in a document may be more appropriate
- Challenge 2 (Query) What is an appropriate query
language? - Keyword (free text) query is no longer the only
choice - Constraints on the structures can be posed
22References
- Retrieving the most relevant XML components, by
Y. Mass, M. Mandelbrod. INEX03 workshop. - Searching XML Documents via XML fragments, by D.
Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and
A. Soffer. SIGIR03 - XIRQL A Query Language for Information Retrieval
in XML Documents by N. Fuhr, K. Großjohann.
SIGIR02