Title: INSYS 300 Text Analysis
1INSYS 300Text Analysis
2Indexing for Actual Texts
- People generally want retrieval based on
concepts - Computers work with tokens (dont care about the
meaning). - Words are somewhere in between
- We want to get the meaningful words but deal with
them as tokens
3Some Types of Text Analysis for Text Search
- Word (token) extraction
- Stop words
- Stemming
- Frequency counts and Zipfs Law
4Stop Words
- Many of the most frequently used words in English
are worthless in searching they generally
glue sentences together but they usually dont
carry meaning. - We ignore them in making an index so we call them
stop words - Examples the, of, and, to, .
- Why do we need to remove stop words?
- Reduce indexing file size
- Stop words account 20-30 of total word counts.
- Improve efficiency
- stop words are not useful for searching
- stop words always have a large number of hits
5Stop Words (Contd)
- Potential problems of removing stop words
- A small stop list does not improve indexing much
- A large stop list may eliminate some words that
might be useful for someone or for some purposes - Stop words might be part of phrases
- Needs to process for both indexing and queries.
6Stemming
- In English we have many forms of a root word.
- Techniques used to find out the root/stem of a
word - lookup user engineering
- user 15 engineering 12
- users 4 engineered 23
- used 5 engineer 12
- using 5
- stem use engineer
7Advantages of Stemming
- Improving Effectiveness
- matching similar words
- Reducing index size
- Combing words with same roots may reduce index
size as much as 40-50. - Criteria for stemming
- correctness
- retrieval effectiveness
- compression performance
8Basic Stemming Methods
- We want a simple form of morphological analysis
so we do stemming with tables and rules - Remove ending (such as s, ing)
- if a word ends with a consonant other than s,
- followed by an s, then delete s.
- if a word ends in es, drop the s.
- if a word ends in ing, delete the ing unless the
remaining word consists only of one letter or of
th. - If a word ends with ed, preceded by a consonant,
delete the ed unless this leaves only a single
letter. - ...
9Basic Stemming (contd)
- Transform the remainder
- if a word ends with ies but not eies or
aies then ies --gt y.
10 Porter Stemming Algorithm
- A set of condition/action rules
- condition on the stem
- condition on the suffix
- condition on the rules
- different combination of conditions will activate
different rules. - Implementation
- stem.c
- Stem(word)
- ..
- ReplaceEnd(word, step1a_rule)
- ruleReplaceEnd(word, step1b_rule)
- if (rule106) (rule 107)
- ReplaceEnd(word, 1b1_rule)
-
11 Soundex Sound-Based Coding
- Soundex rules
- letter Numeric equivalent
- B, F, P, V 1
- C, G, J, K, Q, S, X, Z 2
- D, T, 3
- L 4
- M, N, 5
- R, 6
- A, E, H, I, O, U, W, Y not coded
- Robert 6163
- Words which sound similar often have same codes
- A high compression rate
- Robert 68 bits versus 616334 bits
- The code is not unique (bad /bed)
12Morphological Analysis
- In linguistics, finding the root form of a word
is morphological analysis (morphshape). - English is very odd
- Sanction to approve
- Sanction to forbid
13Example 3 N-gram Coding
- A (letter) n-gram is n-consecutive letters
- Di-gram (or bi-gram) is 2 consecutive letters
- Tri-gram is 3 consecutive letters
- All di-grams of the word statistics are
- st ta at ti is st ti ic cs
- Unique at cs ic is st ta ti
- All di-grams of statistical are
- st ta at ti is st ti ic ca al
- Unique al at ca ic is st ta ti
- Note the similarity to the di-grams for
statisitcs
14Calculating N-Gram Similarity
- The similarity of two words can be calculated by
- Where
- A is the number of unique di-grams in the first
word - B is the number of unique di-grams in the second
word - C is the number of unique di-grams shared by A
and B
15Letter and Word Frequency Counts
- The idea
- The best a computer can do is counting numbers
- counts the number of times a word occurred in a
document - counts the number of documents in a collection
that contains a word - Using occurrence frequencies to indicate relative
importance of a word in a document - if a word appears often in a document, the
document likely deals with subjects related to
the word.
16Frequency Counts (Contd)
- Using occurrence frequencies to select most
useful words to index a document collection - if a word appears in every document, it is not a
good indexing word - if a word appears in only one or two documents,
it may not be a good indexing word
17Effects of Frequency
- Use frequency data
- decide head threshold
- decide tail threshold
- decide variance of counting
-
18More about Counting
- Zipfs Law
- in a large, well-written English document,
- r f c
- where
- r is the rank frequency,
- f is the number of times the given
word is used in the document - c is a constant.
19Many Properties of Text Approximate Zipfs Law
- Word frequencies (from the Brown Corpus)
- Word Rank Freq
RankFreq - the 1 0.070 0.070
- of 2 0.036 0.072
- and 3 0.029 0.057
- to 4 0.026 0.104
- a 5 0.023 0.065
- in 6 0.021
0.126 - that 7 0.011
0.077 - is 8 0.010
0.080 - was 9 0.010 0.090
- he 10 0.010
0.100
20Many Properties of Text Approximate Zipfs Law
- English Letter Usage Statistics
- Letter use frequencies
- E 72881 12.4
- T 52397 8.9
- A 47072 8.0
- O 45116 7.6
- N 41316 7.0
- I 39710 6.7
- H 38334 6.5
21Many Properties of Text Approximate Zipfs Law
- Doubled letter frequencies
- LL 2979 20.6
- EE 2146 14.8
- SS 2128 14.7
- OO 2064 14.3
- TT 1169 8.1
- RR 1068 7.4
- -- 701 4.8
- PP 628 4.3
- FF 430 2.9
22Many Properties of Text Approximate Zipfs Law
- Initial letter frequencies
- T 20665 15.2
- A 15564 11.4
- H 11623 8.5
- W 9597 7.0
- I 9468 6.9
- S 9376 6.9
- O 8205 6.0
- M 6293 4.6
- B 5831 4.2
23Many Properties of Text Approximate Zipfs Law
- Ending letter frequencies
- E 26439 19.4
- D 17313 12.7
- S 14737 10.8
- T 13685 10.0
- N 10525 7.7
- R 9491 6.9
- Y 7915 5.8
- O 6226 4.5
24Summary Automatic indexing
- 1. Parse (divide up) individual words (tokens)
- 2. Remove stop words.
- 3. Do stemming
- 4. Create indexing structure
- invert indexing
- other structures
25Frequencies and Vector Weighting
- A document is represented as a vector (Salton)
- (W1, W2, , Wn)
- Binary weights
- Wi 1 if the corresponding term is in the
document - Wi 0 if the term is not in the document
- TF (Term Frequency)
- Wi tfi where tfi is the number of times the
term occurred in the document - TFIDF (Inverse Document Frequency)
- Wi tfiidfitfi(1log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection. - Could have other weights
- Extra weight if a term is in the title of a
document
26Importance of Associations
- Counting word pairs
- If two words appear together very often
(co-location), they are likely to be a phrase - Counting document pairs
- if two documents have many common words, they are
likely related
27More on Associations
- Counting citation pairs
- If documents A and B both cite document C, D,
then A and B might be related. - If documents C and D often be cited together,
they are likely related. - Counting link patterns
- Get all pages that have links to my pages.
- Get all pages that contain similar links to my
pages - Page Rank (Google)