Indexing by Latent Semantic Analysis - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Indexing by Latent Semantic Analysis

Description:

Most retrieval systems match words of a query (keywords) with words of a document. ... or two-mode factor analysis, satisfied all three criteria! ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 23
Provided by: comminfo
Category:

less

Transcript and Presenter's Notes

Title: Indexing by Latent Semantic Analysis


1
Indexing by Latent Semantic Analysis
  • Written by Deerwester, Dumais, Furnas, Landauer,
    and Harshman
  • (1990)
  • Reviewed by Cinthia Levy

2
Latent Semantic Indexing
  • Term-matching
  • Most retrieval systems match words of a query
    (keywords) with words of a document.
  • Problem
  • What if users want to retrieve information
    based upon conceptual content?

3
Latent Semantic Indexing
  • Expressing a concept in keywords is
  • complicated and unreliable
  • Synonymy many ways to define a concept.
    Results in poor recall.
  • Polysemy most words have multiple meanings.
    Results in poor precision.

4
Latent Semantic Indexing
  • Three factors contribute to the failure that IR
    systems have in overcoming problems associated
    w/synonymy polysemy
  • Identification of index terms is incomplete
  • No automatic method adequately addresses polysemy
  • Technical the way current IR systems work

5
Latent Semantic Indexing
  • Goal
  • ...to build an IR system that predicts what
    terms really are implied by a query or what
    terms really apply to a document (i.e. the
    latent semantics).

6
Latent Semantic Indexing
  • Choosing a model
  • Proximity model similar items are put near
    each other
  • in some space or structure.

7
Latent Semantic Indexing
  • Existing proximity models include
  • Hierarchical, partition overlapping clusterings
  • Ultrametric additive trees
  • Factor-analytic multidimensional distance models

8
Latent Semantic Indexing
  • Alternate model was considered, based on the
    following criteria
  • Adjustable representational richness
  • Explicit representation of both terms and
    documents
  • Computational tractability for large datasets

9
Latent Semantic Indexing
  • Singular value decomposition (SVD)
  • or two-mode factor analysis,
  • satisfied all three criteria!
  • SVD a fully automatic statistical method used
    to determine associations among terms in a
    large document collection, and to create a
    semantic or concept space.

10
Latent Semantic Indexing
  • Basis of LSI
  • Documents are condensed to contain only content
    words w/semantic meaning
  • Patterns of word distribution (co-occurrence) are
    analyzed across a collection of documents.

11
Latent Semantic Indexing
  • Basis of LSI
  • Document collection is examined as a whole
  • Documents with many words in common are
    semantically close.
  • Documents with few words in common are
    semantically distant.

12
Latent Semantic Indexing
  • Steps of LSI
  • Format document stop words removed, punctuation
    removed, no capitalization.
  • Select content words words with no semantic
    value are removed using stop list.
  • Apply Stemming reduces words to root form.
  • (not applied in Deerwester, et al.)

13
Latent Semantic Indexing
  • Result List of content words
  • The list of content words is used to generate a
  • term-document matrix.

14
Latent Semantic Indexing
  • Term-document matrix

15
Latent Semantic Indexing
  • Term-document matrix
  • Term weighting is applied to each value
  • SVD algorithm is applied to the matrix
  • Matrix represents vectors in a multi-dimensional
    space
  • (not applied in Deerwester, et al.)

16
Latent Semantic Indexing
  • Visual representation of a three-dimensional
    space
  • Content words form three orthogonal axes
    (mutually perpendicular)
  • eggs
  • bacon
  • coffee

17
Latent Semantic Indexing
  • If you draw a line from the origin of the
    graph to each of these points, you obtain a set
    of vectors in 'bacon-eggs-and-coffee' space. The
    size and direction of each vector tells you how
    many of the three key items were in any
    particular order, and the set of all the vectors
    taken together tells you something about the kind
    of breakfast people favor on a Saturday morning.
  • Retrieved from
  • http//javelina.cet.middlebury.edu/lsa/out/lsa_exp
    lanation.htm

18
Latent Semantic Indexing

Retrieved from http//lsi.research.telcordia.com/l
si-bin/lsiQuery
19
Latent Semantic Indexing
Romans 122 Professing themselves to be wise,
they became fools Romans 166 Greet Mary, who
bestowed much labour on us. Matthew 2422 And
except those days should be shortened, there
should no flesh be saved but for the elect's
sake those days shall be shortened. John 317
For God sent not his Son into the world to
condemn the world but that the world through him
might be saved.

20
Latent Semantic Indexing
  • (Deerwester)
  • System compared to
  • Straight term matching
  • Voorhees
  • SMART
  • Using
  • 1. collection of medical abstracts (MED)
  • 2. information science abstracts (CISI)

21
Latent Semantic Indexing
  • Summary of analyses
  • LSI performed better than or equal to simple term
    matching
  • LSI was shown to be superior to system described
    by Voorhees
  • LSI performed better than or equal to SMART

22
Latent Semantic Indexing
  • Conclusion
  • LSI represents both terms and documents in the
    same space which provides for the retrieval of
    relevant information.
  • LSI does not rely on literal matching thus
    retrieves more relevant information than other
    methods.
  • LSI offers an adequate solution to the problem of
    synonymy but only a partial solution to the
    problem of polysemy.
Write a Comment
User Comments (0)
About PowerShow.com