Fast Index News Archive - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Fast Index News Archive

Description:

... with a web-based search engine that allows for fast searches. ... LSI Engine -Technique. A vector space-based approach to ... LSI Engine Singular Value ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 17
Provided by: atanasr
Category:
Tags: archive | engine | fast | index | news

less

Transcript and Presenter's Notes

Title: Fast Index News Archive


1
Fast Index News Archive
  • Mariana Barca
  • Guadalupe Canahuate
  • Laura Stoia
  • Rodi Tountcheva

2
What is fina?
  • A news database with a web-based search engine
    that allows for fast searches.
  • Users often search for certain meanings rather
    than for certain words
  • Searching for news is time consuming when a large
    number of articles is available
  • Few existing news systems offer search options
    based on different search techniques

3
Goals
  • We try to address the previously mentioned issues
    by developing a search system that
  • implements both exact search (within the title)
    and LSI (Latent Semantic Indexing - a semantic
    relatedness search that will be explained in
    following slides )
  • is generally available through a web interface
  • provides access to a big archive or articles, and
    adapts the search techniques to this amount of
    information

4
Why is fina interesting?
  • Useful for users who like to search and read news
    online
  • Interesting for researchers because it provides a
    basis to implement further ideas for evaluation,
    optimization and comparison between search
    techniques

5
Overall Design
6
LSI Engine -Technique
  • A vector space-based approach to information
    retrieval
  • Advantage capability of finding documents with
    similar content that may not contain the search
    words.
  • Given
  • A collection of d documents
  • A set of t terms (words) appearing in those
    documents

7
LSI Engine -Technique
  • Construct a t x d matrix X where X(i,j) is the
    number of occurrences of the ith term in the jth
    document
  • Decompose X as
  • X T0 x S0 x D0
  • Singular Value Decomposition

The X matrix
8
LSI Engine Singular Value Decomposition
  • Dimensionality reduction a large number of
    features is reduced to a smaller one.
  • Advantages
  • Requires less space
  • Searches are faster
  • It brings up correlation in the data the
    documents that are semantically similar become
    closer

9
LSI Engine -Implementation
  • Written in Java
  • Stemmes the words (ex sing, singer, singing-all
    map to the same root) and produces the matrices
    needed in LSI-based similarity computation in the
    database (D and T x S-1 )
  • Uses third-party free package for matrix
    transformation (SVDLIBC)

10
Database
11
Query Execution
  • LSI Search
  • Get stemmed words and counts from the Web
    Interface and compute weights for those words in
    the dictionary (vq )
  • Multiply the resulting vector with T_S-1 and save
    the query feature vector (Q vq x T x S-1 )
  • Compute the cosine of the Q and all other news
    (inner product) cos(Q,v)
  • Display the k most similar results and return a
    queryId to the Web Interface

12
Query Execution (contd)
  • Relevance Feedback
  • Get a queryId, a list of Relevant NewsIds and a
    list of Non-relevant NewsIds
  • Compute the avg FV of relevant news (R_FV)
  • Compute the avg FV of non-relevant news (U_FV)
  • Compute a new Q_FV for the query
  • Q_FV a Q_FV b R_FV c U_FV
  • a 1, b 0.75, and c 0.25
  • Compute the cosine of the Q_FV and all other news
    (inner product)
  • Display the k most similar results

13
(No Transcript)
14
Conclusion
  • This project combines various areas of computer
    science and engineering artificial intelligence,
    databases, natural language processing,
    algorithms and web design.
  • General techniques that are extensively used in
    these fields such as SVD, LSI, relevance feedback
    have been implemented in this system.
  • Performance improvements and resource management
    issues were taken into account for the design of
    FINA.

15
Thank you!
  • Questions and comments are welcome ?

16
References
  • "Telcordia LSI Engine Implementation and
    Scalability Issues" - C. Chen, N. Stoffel, M.
    Post, C. Basu, D. Bassu and C. Behrens (2001)
  • "Indexing by Latent Semantic Analysis" - S.
    Deerwester, S. T. Dumais, G. W. Furnas, T. K.
    Landauer and R. Harshman (1990)
  • "Adaptive Nearest Neighbor Search for Relevance
    Feedback in Large Image Databases" - P. Wu and B.
    S. Manjunath (2001)
  • "Content-Based Image Retrieval with Relevance
    Feedback in Mars" - Y. Rui, T. S. Huang and S.
    Mehrotra (1997)
Write a Comment
User Comments (0)
About PowerShow.com