Stemming Algorithms - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Stemming Algorithms

Description:

2. Overview. Stemming provides searchers with ways of finding morphological variants of ... Efficiency at search time and index file compression ... – PowerPoint PPT presentation

Number of Views:433
Avg rating:3.0/5.0
Slides: 12
Provided by: LIB160
Category:

less

Transcript and Presenter's Notes

Title: Stemming Algorithms


1
Stemming Algorithms
2
Overview
  • Stemming provides searchers with ways of finding
    morphological variants of search terms
  • Stemming, stemmed, stem
  • Stemming can be used to reduce the size of index
    files
  • Indexing-time stemming
  • Efficiency at search time and index file
    compression
  • Information about the full terms will be lost, or
    additional storage will be required to store both
    the stemmed and un-stemmed forms

3
Overview (Cont.)
  • Criteria for judging stemmers
  • Correctness
  • Over-stemming too much of a term is removed
  • Can cause unrelated terms to be conflated ?
    retrieval of non-relevant documents
  • Under-stemming the removal of too little of a
    term
  • Prevent related terms from being conflated ?
    relevant documents will not be retrieved
  • Retrieval effectiveness precision and recall
  • Compression performance

4
Taxonomy for Stemming Algorithms
Conflation (Stemming) Methods
Manual
Automatic (Stemmers)
N-gram
Affix Removal
SuccessorVariety
TableLookup
Longest Match
Simple Removal
5
Example of Search-Time Stemming
  • In the CATALOG system, terms are stemmed at
    search time rather than at indexing time
  • Provide a naïve system user with the advantage of
    term conflation while requiring little knowledge
    of the system or of searching techniques
  • User can turn off the stemmer
  • Having a user select the terms from the set found
    by the stemmer also reduces the likelihood of
    false matches

6
Table Lookup
  • Store a table of all index terms and their stems
  • Terms from queries and indexes could then be
    stemmed via table lookup
  • Problems
  • No such data for English
  • Domain-dependent vocabulary may not use standard
    English
  • Storage overhead

7
Affix Removal Stemmers
  • Affix removal algorithms remove suffixes and/or
    prefixes from terms leaving a stem
  • Most stemmers are iterative longest match
    stemmers
  • Remove the longest possible string of characters
    from a word according to a set of rules
  • This process is repeated until no more characters
    can be removed
  • Porter algorithm is an affix removal stemmer
  • Consist of a set of condition/action rules
  • Conditions on the stem, conditions on the suffix,
    and conditions on the rules

8
Porters Algorithm -- Condition
  • Stem conditions
  • C(VC)mV
  • V vowels (a, e, i, o, u, and y if preceded by a
    consonant)
  • C consonants
  • optional occurrence
  • ltXgt the stem ends with a given letter X
  • v the stem contains a vowel
  • d the stem ends in a double consonant
  • o the stem ends with a consonant-vowel-consonan
    t sequence, where the final consonant is not w,
    x, or y
  • Suffix conditions (current_suffix pattern)
  • Rule condition (rule was used)

9
Measure Example
10
Porters Algorithm Actions and Steps
  • Actions are rewrite rules of the form
  • Old_suffix ? new_suffix
  • Rules are divided into steps
  • The rules in a step are examined in sequence, and
    only one rule from a step can apply
  • The longest possible suffix is always removed
    because of the ordering of the rules within a step
  • Algorithms
  • step1a(word)
  • step1b(stem)
  • if (the second or third rule of step 1b was
    used)
  • step1b1(stem)
  • step1c(stem)
  • step2(stem)
  • step3(stem)
  • step4(stem)
  • step5a(stem)
  • step5b(stem)

11
Summary
  • Stemming will, in general, increase recall at the
    cost of deceased precision
  • Studies of the effects of stemming on retrieval
    effectiveness are equivocal, but in general
    stemming has either no effect, or a positive
    effect, on retrieval performance where the
    measures used include both recall and precision
  • Stemming can have a marked effect on the size of
    indexing files, sometimes decreasing the size of
    the file as much as 50 percent
Write a Comment
User Comments (0)
About PowerShow.com