INSYS 300 Text Analysis - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

INSYS 300 Text Analysis

Description:

... 'root' word. Techniques used to find out the root/stem of a word: ... linguistics, finding the root form of a word is 'morphological analysis' (morph==shape) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 28
Provided by: xlin2
Category:
Tags: insys | analysis | root | text | word

less

Transcript and Presenter's Notes

Title: INSYS 300 Text Analysis


1
INSYS 300Text Analysis
2
Indexing for Actual Texts
  • People generally want retrieval based on
    concepts
  • Computers work with tokens (dont care about the
    meaning).
  • Words are somewhere in between
  • We want to get the meaningful words but deal with
    them as tokens

3
Some Types of Text Analysis for Text Search
  • Word (token) extraction
  • Stop words
  • Stemming
  • Frequency counts and Zipfs Law

4
Stop Words
  • Many of the most frequently used words in English
    are worthless in searching they generally
    glue sentences together but they usually dont
    carry meaning.
  • We ignore them in making an index so we call them
    stop words
  • Examples the, of, and, to, .
  • Why do we need to remove stop words?
  • Reduce indexing file size
  • Stop words account 20-30 of total word counts.
  • Improve efficiency
  • stop words are not useful for searching
  • stop words always have a large number of hits

5
Stop Words (Contd)
  • Potential problems of removing stop words
  • A small stop list does not improve indexing much
  • A large stop list may eliminate some words that
    might be useful for someone or for some purposes
  • Stop words might be part of phrases
  • Needs to process for both indexing and queries.

6
Stemming
  • In English we have many forms of a root word.
  • Techniques used to find out the root/stem of a
    word
  • lookup user engineering
  • user 15 engineering 12
  • users 4 engineered 23
  • used 5 engineer 12
  • using 5
  • stem use engineer

7
Advantages of Stemming
  • Improving Effectiveness
  • matching similar words
  • Reducing index size
  • Combing words with same roots may reduce index
    size as much as 40-50.
  • Criteria for stemming
  • correctness
  • retrieval effectiveness
  • compression performance

8
Basic Stemming Methods
  • We want a simple form of morphological analysis
    so we do stemming with tables and rules
  • Remove ending (such as s, ing)
  • if a word ends with a consonant other than s,
  • followed by an s, then delete s.
  • if a word ends in es, drop the s.
  • if a word ends in ing, delete the ing unless the
    remaining word consists only of one letter or of
    th.
  • If a word ends with ed, preceded by a consonant,
    delete the ed unless this leaves only a single
    letter.
  • ...

9
Basic Stemming (contd)
  • Transform the remainder
  • if a word ends with ies but not eies or
    aies then ies --gt y.

10
Porter Stemming Algorithm
  • A set of condition/action rules
  • condition on the stem
  • condition on the suffix
  • condition on the rules
  • different combination of conditions will activate
    different rules.
  • Implementation
  • stem.c
  • Stem(word)
  • ..
  • ReplaceEnd(word, step1a_rule)
  • ruleReplaceEnd(word, step1b_rule)
  • if (rule106) (rule 107)
  • ReplaceEnd(word, 1b1_rule)

11
Soundex Sound-Based Coding
  • Soundex rules
  • letter Numeric equivalent
  • B, F, P, V 1
  • C, G, J, K, Q, S, X, Z 2
  • D, T, 3
  • L 4
  • M, N, 5
  • R, 6
  • A, E, H, I, O, U, W, Y not coded
  • Robert 6163
  • Words which sound similar often have same codes
  • A high compression rate
  • Robert 68 bits versus 616334 bits
  • The code is not unique (bad /bed)

12
Morphological Analysis
  • In linguistics, finding the root form of a word
    is morphological analysis (morphshape).
  • English is very odd
  • Sanction to approve
  • Sanction to forbid

13
Example 3 N-gram Coding
  • A (letter) n-gram is n-consecutive letters
  • Di-gram (or bi-gram) is 2 consecutive letters
  • Tri-gram is 3 consecutive letters
  • All di-grams of the word statistics are
  • st ta at ti is st ti ic cs
  • Unique at cs ic is st ta ti
  • All di-grams of statistical are
  • st ta at ti is st ti ic ca al
  • Unique al at ca ic is st ta ti
  • Note the similarity to the di-grams for
    statisitcs

14
Calculating N-Gram Similarity
  • The similarity of two words can be calculated by
  • Where
  • A is the number of unique di-grams in the first
    word
  • B is the number of unique di-grams in the second
    word
  • C is the number of unique di-grams shared by A
    and B

15
Letter and Word Frequency Counts
  • The idea
  • The best a computer can do is counting numbers
  • counts the number of times a word occurred in a
    document
  • counts the number of documents in a collection
    that contains a word
  • Using occurrence frequencies to indicate relative
    importance of a word in a document
  • if a word appears often in a document, the
    document likely deals with subjects related to
    the word.

16
Frequency Counts (Contd)
  • Using occurrence frequencies to select most
    useful words to index a document collection
  • if a word appears in every document, it is not a
    good indexing word
  • if a word appears in only one or two documents,
    it may not be a good indexing word

17
Effects of Frequency
  • Use frequency data
  • decide head threshold
  • decide tail threshold
  • decide variance of counting

18
More about Counting
  • Zipfs Law
  • in a large, well-written English document,
  • r f c
  • where
  • r is the rank frequency,
  • f is the number of times the given
    word is used in the document
  • c is a constant.

19
Many Properties of Text Approximate Zipfs Law
  • Word frequencies (from the Brown Corpus)
  • Word Rank Freq
    RankFreq
  • the 1 0.070 0.070
  • of 2 0.036 0.072
  • and 3 0.029 0.057
  • to 4 0.026 0.104
  • a 5 0.023 0.065
  • in 6 0.021
    0.126
  • that 7 0.011
    0.077
  • is 8 0.010
    0.080
  • was 9 0.010 0.090
  • he 10 0.010
    0.100

20
Many Properties of Text Approximate Zipfs Law
  • English Letter Usage Statistics
  • Letter use frequencies
  • E 72881 12.4
  • T 52397 8.9
  • A 47072 8.0
  • O 45116 7.6
  • N 41316 7.0
  • I 39710 6.7
  • H 38334 6.5

21
Many Properties of Text Approximate Zipfs Law
  • Doubled letter frequencies
  • LL 2979 20.6
  • EE 2146 14.8
  • SS 2128 14.7
  • OO 2064 14.3
  • TT 1169 8.1
  • RR 1068 7.4
  • -- 701 4.8
  • PP 628 4.3
  • FF 430 2.9

22
Many Properties of Text Approximate Zipfs Law
  • Initial letter frequencies
  • T 20665 15.2
  • A 15564 11.4
  • H 11623 8.5
  • W 9597 7.0
  • I 9468 6.9
  • S 9376 6.9
  • O 8205 6.0
  • M 6293 4.6
  • B 5831 4.2

23
Many Properties of Text Approximate Zipfs Law
  • Ending letter frequencies
  • E 26439 19.4
  • D 17313 12.7
  • S 14737 10.8
  • T 13685 10.0
  • N 10525 7.7
  • R 9491 6.9
  • Y 7915 5.8
  • O 6226 4.5

24
Summary Automatic indexing
  • 1. Parse (divide up) individual words (tokens)
  • 2. Remove stop words.
  • 3. Do stemming
  • 4. Create indexing structure
  • invert indexing
  • other structures

25
Frequencies and Vector Weighting
  • A document is represented as a vector (Salton)
  • (W1, W2, , Wn)
  • Binary weights
  • Wi 1 if the corresponding term is in the
    document
  • Wi 0 if the term is not in the document
  • TF (Term Frequency)
  • Wi tfi where tfi is the number of times the
    term occurred in the document
  • TFIDF (Inverse Document Frequency)
  • Wi tfiidfitfi(1log(N/dfi)) where dfi is the
    number of documents contains the term i, and N
    the total number of documents in the collection.
  • Could have other weights
  • Extra weight if a term is in the title of a
    document

26
Importance of Associations
  • Counting word pairs
  • If two words appear together very often
    (co-location), they are likely to be a phrase
  • Counting document pairs
  • if two documents have many common words, they are
    likely related

27
More on Associations
  • Counting citation pairs
  • If documents A and B both cite document C, D,
    then A and B might be related.
  • If documents C and D often be cited together,
    they are likely related.
  • Counting link patterns
  • Get all pages that have links to my pages.
  • Get all pages that contain similar links to my
    pages
  • Page Rank (Google)
Write a Comment
User Comments (0)
About PowerShow.com