Title: SIMS 290-2: Applied Natural Language Processing
1SIMS 290-2 Applied Natural Language Processing
Marti Hearst Oct 23, 2006 (Slides developed by
Preslav Nakov)
2Today
- Feature selection
- TF.IDF Term Weighting
- Weka Input File Format
3Features for Text Categorization
- Linguistic features
- Words
- lowercase? (should we convert to?)
- normalized? (e.g. texts ? text)
- Phrases
- Word-level n-grams
- Character-level n-grams
- Punctuation
- Part of Speech
-
- Non-linguistic features
- document formatting
- informative character sequences (e.g. lt)
4When Do We NeedFeature Selection?
- If the algorithm cannot handle all possible
features - e.g. language identification for 100 languages
using all words - text classification using n-grams
- Good features can result in higher accuracy
- What if we just keep all features?
- Even the unreliable features can be helpful.
- But we need to weight them
- In the extreme case, the bad features can have a
weight of 0 (or very close), which is a form of
feature selection!
5Why Feature Selection?
- Not all features are equally good!
- Bad features best to remove
- Infrequent
- unlikely to be seen again
- co-occurrence with a class can be due to chance
- Too frequent
- mostly function words
- Uniform across all categories
- Good features should be kept
- Co-occur with a particular category
- Do not co-occur with other categories
- The rest good to keep
6Types Of Feature Selection?
- Feature selection reduces the number of features
- Usually
- Eliminating features
- Weighting features
- Normalizing features
- Sometimes by transforming parameters
- e.g. Latent Semantic Indexing using Singular
Value Decomposition - Method may depend on problem type
- For classification and filtering, may want to
use information from example documents to guide
selection.
7Feature Selection
- Task independent methods
- Document Frequency (DF)
- Term Strength (TS)
- Task-dependent methods
- Information Gain (IG)
- Mutual Information (MI)
- ?2 statistic (CHI)
- Empirically compared by Yang Pedersen (1997)
8Pedersen Yang Experiments
- Compared feature selection methods for text
categorization - 5 feature selection methods
- DF, MI, CHI, (IG, TS)
- Features were just words, not phrases
- 2 classifiers
- kNN k-Nearest Neighbor
- LLSF Linear Least Squares Fit
- 2 data collections
- Reuters-22173
- OHSUMED subset of MEDLINE (19901991 used)
9Document Frequency (DF)
- DF number of documents a term appears in
- Based on Zipfs Law
- Remove the rare terms (seen 1-2 times)
- Spurious
- Unreliable can be just noise
- Unlikely to appear in new documents
- Plus
- Easy to compute
- Task independent do not need to know the
classes - Minus
- Ad hoc criterion
- For some applications, rare terms can be good
discriminators (e.g., in IR)
10Stop Word Removal
- Common words from a predefined list
- Mostly from closed-class categories
- unlikely to have a new word added
- include auxiliaries, conjunctions, determiners,
prepositions, pronouns, articles - But also some open-class words like numerals
- Bad discriminators
- uniformly spread across all classes
- can be safely removed from the vocabulary
- Is this always a good idea? (e.g. author
identification)
11?2 statistic (CHI)
- ?2 statistic (pronounced kai square)
- A commonly used method of comparing
proportions. - Measures the lack of independence between a term
and a category (Yang Pedersen)
12?2 statistic (CHI)
- Is jaguar a good predictor for the auto
class? - We want to compare
- the observed distribution above and
- null hypothesis that jaguar and auto are
independent
Term jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
13?2 statistic (CHI)
- Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect? - If independent Pr(j,a) Pr(j) ? Pr(a)
- So, there would be N ? Pr(j,a), i.e. N ? Pr(j)
? Pr(a) - Pr(j) (23)/N
- Pr(a) (2500)/N
- N235009500
- Which N?(5/N)?(502/N)2510/N2510/10005 ? 0.25
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
14?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500
Class ? auto 3 9500
expected fe
observed fo
15?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
16?2 statistic (CHI)
?2 is interested in (fo fe)2/fe summed over all
table entries The null hypothesis is rejected
with confidence .999, since 12.9 gt 10.83 (the
value for .999 confidence).
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
17?2 statistic (CHI)
There is a simpler formula for ?2
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
18?2 statistic (CHI)
- How to use ?2 for multiple categories?
- Compute ?2 for each category and then combine
- To require a feature to discriminate well across
all categories, then we need to take the expected
value of ?2 - Or to weight for a single category, take the
maximum
19?2 statistic (CHI)
- Pluses
- normalized and thus comparable across terms
- ?2(t,c) is 0, when t and c are independent
- can be compared to ?2 distribution, 1 degree of
freedom - Minuses
- unreliable for low frequency terms
20Information Gain
- A measure of importance of the feature for
predicting the presence of the class. - Has an information theoretic justification
- Defined as
- The number of bits of information gained by
knowing the term is present or absent - Based on Information Theory
- We wont go into this in detail here.
21Information Gain (IG)
IG number of bits of information gained by
knowing the term is present or absent t is
the term being scored, ci is a class variable
entropy H(c)
specific conditional entropy H(ct)
specific conditional entropy H(ct)
22Mutual Information (MI)
- The probability of seeing x and y together
- vs
- The probably of seeing x anywhere times the
probability of seeing y anywhere (independently). - MI log ( P(x,y) / P(x)P(y) )
- log(P(x,y)) log(P(x)P(y))
- From Bayes law P(x,y) P(xy)P(y)
- log(P(xy)P(y)) log(P(x)P(y))
- MI log(P(xy) log(P(x))
-
-
23Mutual Information (MI)
rare terms get higher scores
Approximation
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
does not use term absence
24Using Mutual Information
- Compute MI for each category and then combine
- If we want to discriminate well across all
categories, then we need to take the expected
value of MI -
- To discriminate well for a single category, then
we take the maximum
25Mutual Information
- Pluses
- I(t,c) is 0, when t and c are independent
- Has a sound information-theoretic interpretation
- Minuses
- Small numbers produce unreliable results
- Does not use term absence
26CHI max, IG, DF
Term strength
Mutual information
From Yang Pedersen 97
27Feature Comparison
- DF, IG and CHI are good and strongly correlated
- thus using DF is good, cheap and task
independent - can be used when IG and CHI are too expensive
- MI is bad
- favors rare terms (which are typically bad)
28Term Weighting
- In the study just shown, terms were (mainly)
treated as binary features - If a term occurred in a document, it was assigned
1 - Else 0
- Often it us useful to weight the selected
features - Standard technique tf.idf
29TF.IDF Term Weighting
- TF term frequency
- definition TF tij
- frequency of term i in document j
- purpose makes the frequent words for the
document more important - IDF inverted document frequency
- definition IDF log(N/ni)
- ni number of documents containing term i
- N total number of documents
- purpose makes rare words across documents more
important - TF.IDF (for term i in document j)
- definition tij ? log(N/ni)
30Term Normalization
- Combine different words into a single
representation - Stemming/morphological analysis
- bought, buy, buys -gt buy
- General word categories
- 23.45, 5.30 Yen -gt MONEY
- 1984, 10,000 -gt DATE, NUM
- PERSON
- ORGANIZATION
- (Covered in Information Extraction segment)
- Generalize with lexical hierarchies
- WordNet, MeSH
- (Covered later in the semester)
31What Do People Do In Practice?
- Feature selection
- infrequent term removal
- infrequent across the whole collection (i.e. DF)
- seen in a single document
- most frequent term removal (i.e. stop words)
- Normalization
- Stemming. (often)
- Word classes (sometimes)
- Feature weighting TF.IDF or IDF
- Dimensionality reduction (sometimes)
32Weka
- Java-based tool for large-scale machine-learning
problems - Tailored towards text analysis
- http//weka.sourceforge.net/wekadoc/
33Weka Input Format
- Expects a particular input file format
- Called ARFF Attribute-Relation File Format
- Consists of a Header and a Data section
http//weka.sourceforge.net/wekadoc/index.php/enA
RFF_(3.4.6)
34WEKA File Format ARFF
_at_relation heart-disease-simplified _at_attribute
age numeric _at_attribute sex female,
male _at_attribute chest_pain_type typ_angina,
asympt, non_anginal, atyp_angina _at_attribute
cholesterol numeric _at_attribute exercise_induced_an
gina no, yes _at_attribute class present,
not_present _at_data 63,male,typ_angina,233,no,not_
present 67,male,asympt,286,yes,present 67,male,asy
mpt,229,yes,present 38,female,non_anginal,?,no,not
_present ...
Numerical attribute
Nominal attribute
- Other attribute types
- String
- Date
Missing value
http//weka.sourceforge.net/wekadoc/index.php/enA
RFF_(3.4.6)
35WEKA Sparse File Format
- Value 0 is not represented explicitly
- Same header (i.e _at_relation and _at_attribute tags)
- the _at_data section is different
- Instead of
- _at_data
- 0, X, 0, Y, "class A"
- 0, 0, W, 0, "class B"
- We have
- _at_data
- 1 X, 3 Y, 4 "class A"
- 2 W, 4 "class B"
- This saves LOTS of space for text applications.
- Why?
36Next Time
- Wed Guest lecture by Peter Jackson
- Pure and Applied Research in NLP The Good, the
Bad, and the Lucky. - Following week
- Text Categorization Algorithms
- How to use Weka