The Vector Space Model - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

The Vector Space Model

Description:

For fast querying, an inverted index is needed. This lists the documents ... designed to minimise memory and maximise ... need to minimise or eliminate ... – PowerPoint PPT presentation

Number of Views:229
Avg rating:3.0/5.0
Slides: 11
Provided by: Miketh8
Category:
Tags: minimise | model | space | vector

less

Transcript and Presenter's Notes

Title: The Vector Space Model


1
The Vector Space Model
  • Part 3 Advanced Topics
  • Search engine indexes
  • Latent Semantic Indexing
  • Clustering

2
The Inverted Index
  • For fast querying, an inverted index is needed
  • This lists the documents that contain each word
  • Only need to calculate similarity for documents
    containing the words

3
Search Engine Overview
  • Need normal indexes and inverted indexes
  • Carefully designed to minimise memory and
    maximise access speed
  • System must be designed to maximise speed in
    response to user queries
  • E.g. need to minimise or eliminate hard disk
    accesses
  • Efficient ranking through a combination of text
    and link mining

4
Latent Semantic Indexing
  • Users may choose different words to describe
    interests than the ones used in a document
  • E.g. dogs vs. canines -synonyms
  • Latent Semantic Indexing (LSI) is designed to
    overcome this

5
The LSI algorithm
  • Start with a matrix of documents and word
    frequencies
  • Apply Singular Value Decomposition (a
    mathematical algorithm to reduce the dimensions)
  • Result shorter document vectors, where the
    vector entries may represent concepts
  • E.g., ideally, one vector entry will represent
    occurrences of all synonyms of dog

6
LSI advantages and disadvantages
  • Advantages
  • May find relevant documents even if they do not
    contain the users query words
  • Dimension reduction will speed query-time
    calculations (and also clustering applications)
  • Disadvantages
  • Slow to calculate and very complicated
  • May confuse users

7
Clustering
  • The VSM can be used to cluster documents by
    calculating their similarity using the cosine
    measure
  • Clustering techniques include
  • k-means
  • multi-dimensional scaling
  • Self-organising maps

8
Clustering, continued
  • Clustering documents can help
  • Show users what information is available
  • Show how their documents relate to others in the
    system
  • Present the results of search queries by topic

9
(No Transcript)
10
  • Modern Information Retrieval. Baeza-Yates
    Ribeiro-Neto. Addison Wesley. In library
  • Introduction to Modern Information Retrieval.
    Salton. in library
  • Arasu, A., Cho, J., Garcia-Molina, H., Paepcke,
    A., Raghavan, S. (2001). Searching the Web, ACM
    Transactions on Internet Technology, 1(1), 2-43.
    ACM digi lib
  • Brin, S., Page, L. (1998). The anatomy of a
    large scale hypertextual Web search engine.
    Computer Networks and ISDN Systems, 30(1-7),
    107-117. find on the web
  • Introduction to Modern Information Retrieval
    (2ed). Chowdhury. Facet Publishing. not in
    library
Write a Comment
User Comments (0)
About PowerShow.com