INSYS 300 Text Analysis - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

INSYS 300 Text Analysis

Description:

... 'root' word. Techniques used to find out the root/stem of a word: ... linguistics, finding the root form of a word is 'morphological analysis' (morph==shape) ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 28

Provided by: xlin2

Category:

more less

Transcript and Presenter's Notes

Title: INSYS 300 Text Analysis

1
INSYS 300Text Analysis
2
Indexing for Actual Texts

People generally want retrieval based on
concepts
Computers work with tokens (dont care about the
meaning).
Words are somewhere in between
We want to get the meaningful words but deal with
them as tokens

3
Some Types of Text Analysis for Text Search

Word (token) extraction
Stop words
Stemming
Frequency counts and Zipfs Law

4
Stop Words

Many of the most frequently used words in English
are worthless in searching they generally
glue sentences together but they usually dont
carry meaning.
We ignore them in making an index so we call them
stop words
Examples the, of, and, to, .
Why do we need to remove stop words?
Reduce indexing file size
Stop words account 20-30 of total word counts.
Improve efficiency
stop words are not useful for searching
stop words always have a large number of hits

5
Stop Words (Contd)

Potential problems of removing stop words
A small stop list does not improve indexing much
A large stop list may eliminate some words that
might be useful for someone or for some purposes
Stop words might be part of phrases
Needs to process for both indexing and queries.

6
Stemming

In English we have many forms of a root word.
Techniques used to find out the root/stem of a
word
lookup user engineering
user 15 engineering 12
users 4 engineered 23
used 5 engineer 12
using 5
stem use engineer

7
Advantages of Stemming

Improving Effectiveness
matching similar words
Reducing index size
Combing words with same roots may reduce index
size as much as 40-50.
Criteria for stemming
correctness
retrieval effectiveness
compression performance

8
Basic Stemming Methods

We want a simple form of morphological analysis
so we do stemming with tables and rules
Remove ending (such as s, ing)
if a word ends with a consonant other than s,
followed by an s, then delete s.
if a word ends in es, drop the s.
if a word ends in ing, delete the ing unless the
remaining word consists only of one letter or of
th.
If a word ends with ed, preceded by a consonant,
delete the ed unless this leaves only a single
letter.
...

9
Basic Stemming (contd)

Transform the remainder
if a word ends with ies but not eies or
aies then ies --gt y.

10
Porter Stemming Algorithm

A set of condition/action rules
condition on the stem
condition on the suffix
condition on the rules
different combination of conditions will activate
different rules.
Implementation
stem.c
Stem(word)
..
ReplaceEnd(word, step1a_rule)
ruleReplaceEnd(word, step1b_rule)
if (rule106) (rule 107)
ReplaceEnd(word, 1b1_rule)

11
Soundex Sound-Based Coding

Soundex rules
letter Numeric equivalent
B, F, P, V 1
C, G, J, K, Q, S, X, Z 2
D, T, 3
L 4
M, N, 5
R, 6
A, E, H, I, O, U, W, Y not coded
Robert 6163
Words which sound similar often have same codes
A high compression rate
Robert 68 bits versus 616334 bits
The code is not unique (bad /bed)

12
Morphological Analysis

In linguistics, finding the root form of a word
is morphological analysis (morphshape).
English is very odd
Sanction to approve
Sanction to forbid

13
Example 3 N-gram Coding

A (letter) n-gram is n-consecutive letters
Di-gram (or bi-gram) is 2 consecutive letters
Tri-gram is 3 consecutive letters
All di-grams of the word statistics are
st ta at ti is st ti ic cs
Unique at cs ic is st ta ti
All di-grams of statistical are
st ta at ti is st ti ic ca al
Unique al at ca ic is st ta ti
Note the similarity to the di-grams for
statisitcs

14
Calculating N-Gram Similarity

The similarity of two words can be calculated by
Where
A is the number of unique di-grams in the first
word
B is the number of unique di-grams in the second
word
C is the number of unique di-grams shared by A
and B

15
Letter and Word Frequency Counts

The idea
The best a computer can do is counting numbers
counts the number of times a word occurred in a
document
counts the number of documents in a collection
that contains a word
Using occurrence frequencies to indicate relative
importance of a word in a document
if a word appears often in a document, the
document likely deals with subjects related to
the word.

16
Frequency Counts (Contd)

Using occurrence frequencies to select most
useful words to index a document collection
if a word appears in every document, it is not a
good indexing word
if a word appears in only one or two documents,
it may not be a good indexing word

17
Effects of Frequency

Use frequency data
decide head threshold
decide tail threshold
decide variance of counting

18
More about Counting

Zipfs Law
in a large, well-written English document,
r f c
where
r is the rank frequency,
f is the number of times the given
word is used in the document
c is a constant.

19
Many Properties of Text Approximate Zipfs Law

Word frequencies (from the Brown Corpus)
Word Rank Freq
RankFreq
the 1 0.070 0.070
of 2 0.036 0.072
and 3 0.029 0.057
to 4 0.026 0.104
a 5 0.023 0.065
in 6 0.021
0.126
that 7 0.011
0.077
is 8 0.010
0.080
was 9 0.010 0.090
he 10 0.010
0.100

20
Many Properties of Text Approximate Zipfs Law

English Letter Usage Statistics
Letter use frequencies
E 72881 12.4
T 52397 8.9
A 47072 8.0
O 45116 7.6
N 41316 7.0
I 39710 6.7
H 38334 6.5

21
Many Properties of Text Approximate Zipfs Law

Doubled letter frequencies
LL 2979 20.6
EE 2146 14.8
SS 2128 14.7
OO 2064 14.3
TT 1169 8.1
RR 1068 7.4
-- 701 4.8
PP 628 4.3
FF 430 2.9

22
Many Properties of Text Approximate Zipfs Law

Initial letter frequencies
T 20665 15.2
A 15564 11.4
H 11623 8.5
W 9597 7.0
I 9468 6.9
S 9376 6.9
O 8205 6.0
M 6293 4.6
B 5831 4.2

23
Many Properties of Text Approximate Zipfs Law

Ending letter frequencies
E 26439 19.4
D 17313 12.7
S 14737 10.8
T 13685 10.0
N 10525 7.7
R 9491 6.9
Y 7915 5.8
O 6226 4.5

24
Summary Automatic indexing

1. Parse (divide up) individual words (tokens)
2. Remove stop words.
3. Do stemming
4. Create indexing structure
invert indexing
other structures

25
Frequencies and Vector Weighting

A document is represented as a vector (Salton)
(W1, W2, , Wn)
Binary weights
Wi 1 if the corresponding term is in the
document
Wi 0 if the term is not in the document
TF (Term Frequency)
Wi tfi where tfi is the number of times the
term occurred in the document
TFIDF (Inverse Document Frequency)
Wi tfiidfitfi(1log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.
Could have other weights
Extra weight if a term is in the title of a
document

26
Importance of Associations

Counting word pairs
If two words appear together very often
(co-location), they are likely to be a phrase
Counting document pairs
if two documents have many common words, they are
likely related

27
More on Associations