Title: SIMS 290-2: Applied Natural Language Processing
1SIMS 290-2 Applied Natural Language Processing
Barbara Rosario Sept 27, 2004
2Today
- Classification
- Text categorization (and other applications)
- Various issues regarding classification
- Clustering vs. classification, binary vs.
multi-way, flat vs. hierarchical classification - Introduce the steps necessary for a
classification task - Define classes
- Label text
- Features
- Training and evaluation of a classifier
3Classification
- Goal Assign objects from a universe to two or
more classes or categories - Examples
- Problem Object
Categories - Tagging Word POS
- Sense Disambiguation Word The
words senses - Information retrieval Document
Relevant/not relevant - Sentiment classification Document
Positive/negative - Author identification Document Authors
4Author identification
- They agreed that Mrs. X should only hear of the
departure of the family, without being alarmed on
the score of the gentleman's conduct but even
this partial communication gave her a great deal
of concern, and she bewailed it as exceedingly
unlucky that the ladies should happen to go away,
just as they were all getting so intimate
together. - Gas looming through the fog in divers places in
the streets, much as the sun may, from the
spongey fields, be seen to loom by husbandman and
ploughboy. Most of the shops lighted two hours
before their time--as the gas seems to know, for
it has a haggard and unwilling look. The raw
afternoon is rawest, and the dense fog is
densest, and the muddy streets are muddiest near
that leaden-headed old obstruction, appropriate
ornament for the threshold of a leaden-headed old
corporation, Temple Bar.
5Author identification
- Jane Austen (1775-1817), Pride and Prejudice
- Charles Dickens (1812-70), Bleak House
6Author identification
- Federalist papers
- 77 short essays written in 1787-1788 by Hamilton,
Jay and Madison to persuade NY to ratify the US
Constitution published under a pseudonym - The authorships of 12 papers was in dispute
(disputed papers) - In 1964 Mosteller and Wallace solved the problem
- They identified 70 function words as good
candidates for authorships analysis - Using statistical inference they concluded the
author was Madison
7Function words for Author Identification
8Function words for Author Identification
9Classification
- Goal Assign objects from a universe to two or
more classes or categories - Examples
- Problem Object
Categories
Author identification Document
Authors Language identification Document
Language
10Language identification
- Tutti gli esseri umani nascono liberi ed eguali
in dignità e diritti. Essi sono dotati di ragione
e di coscienza e devono agire gli uni verso gli
altri in spirito di fratellanza. - Alle Menschen sind frei und gleich an Würde und
Rechten geboren. Sie sind mit Vernunft und
Gewissen begabt und sollen einander im Geist der
Brüderlichkeit begegnen. - Universal Declaration of Human Rights, UN, in 363
languages
11Language identification
- égaux
- eguali
- iguales
- edistämään
- Ü
12Classification
- Goal Assign objects from a universe to two or
more classes or categories - Examples
- Problem Object
Categories - Author identification Document Authors
- Language identification Document Language
- Text categorization Document Topics
13Text categorization
- Topic categorization classify the document into
semantics topics
The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.
14Text categorization
- http//news.google.com/
- Reuters
- Collection of (21,578) newswire documents.
- For research purposes a standard text collection
to compare systems and algorithms - 135 valid topics categories
15Reuters
16Reuters
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
17Text categorization examples
- Topic categorization
- http//news.google.com/
- Reuters.
- Spam filtering
- Determine if a mail message is spam (or not)
- Customer service message classification
18Classification vs. Clustering
- Classification assumes labeled data we know how
many classes there are and we have examples for
each class (labeled data). - Classification is supervised
- In Clustering we dont have labeled data we just
assume that there is a natural division in the
data and we may not know how many divisions
(clusters) there are - Clustering is unsupervised
19(No Transcript)
20(No Transcript)
21Classification
Class1
Class2
22Classification
Class1
Class2
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Clustering
28Categories (Labels, Classes)
- Labeling data
- 2 problems
- Decide the possible classes (which ones, how
many) - Domain and application dependent
- http//news.google.com
- Label text
- Difficult, time consuming, inconsistency between
annotators
29Reuters
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
Why not topic policy ?
30Binary vs. multi-way classification
- Binary classification two classes
- Multi-way classification more than two classes
- Sometime it can be convenient to treat a
multi-way problem like a binary one one class
versus all the others, for all classes
31 Flat vs. Hierarchical classification
- Flat classification relations between the
classes undetermined - Hierarchical classification hierarchy where each
node is the sub-class of its parents node
32Single- vs. multi-category classification
- In single-category text classification each text
belongs to exactly one category - In multi-category text classification, each text
can have zero or more categories
33LabeledText class in NLTK
- LabeledText class
- gtgtgt text "Seven-time Formula One champion
Michael Schumacher took on the Shanghai circuit
Saturday in qualifying for the first Chinese
Grand Prix." - gtgtgt label sport
- gtgtgt labeled_text LabeledText(text, label)
- gtgtgt labeled_text.text()
- Seven-time Formula One champion Michael
Schumacher took on the Shanghai circuit Saturday
in qualifying for the first Chinese Grand Prix. - gtgtgt labeled_text.label()
- sport
34NLTK The Classifier Interface
- classify determines which label is most
appropriate for a given text token, and returns a
labeled text token with that label. - labels returns the list of category labels that
are used by the classifier. - gtgtgt token Token(The World Health Organization
is recommending more importance be attached to
the prevention of heart disease and other
cardiovascular ailments rather than focusing on
treatment.) - gtgtgt my_classifier.classify(token)
- The World Health Organization is recommending
more importance be attached to the prevention of
heart disease and other cardiovascular ailments
rather than focusing on treatment./ health - gtgtgt my_classifier.labels()
- ("sport", "health", "world",)
35Features
- gtgtgt text "Seven-time Formula One champion
Michael Schumacher took on the Shanghai circuit
Saturday in qualifying for the first Chinese
Grand Prix." - gtgtgt label sport
- gtgtgt labeled_text LabeledText(text, label)
- Here the classification takes as input the whole
string - Whats the problem with that?
- What are the features that could be useful for
this example?
36Feature terminology
- Feature An aspect of the text that is relevant
to the task - Some typical features
- Words present in text
- Frequency of words
- Capitalization
- Are there NE?
- WordNet
- Others?
37Feature terminology
- Feature An aspect of the text that is relevant
to the task - Feature value the realization of the feature in
the text - Words present in text Kerry, Schumacher, China
- Frequency of word Kerry(10), Schumacher(1)
- Are there dates? Yes/no
- Are there PERSONS? Yes/no
- Are there ORGANIZATIONS? Yes/no
- WordNet Holonyms (China is part of Asia),
Synonyms(China, People's Republic of China, mainla
nd China)
38Feature Types
- Boolean (or Binary) Features
- Features that generate boolean (binary) values.
- Boolean features are the simplest and the most
common type of feature. - f1(text) 1 if text contain Kerry
- 0 otherwise
- f2(text) 1 if text contain PERSON
- 0 otherwise
39Feature Types
- Integer Features
- Features that generate integer values.
- Integer features can be used to give classifiers
access to more precise information about the
text. - f1(text) Number of times text contains Kerry
- f2(text) Number of times text contains PERSON
40Features in NLTK
- Feature Detectors
- Features can be defined using feature detector
functions, which map LabeledTexts to values - Method detect, which takes a labeled text, and
returns a feature value. - gtgtgt def ball(ltext)
- return (ball in ltext.text())
- gtgtgt fdetector FunctionFeatureDetector(ball)
- gtgtgt document1 "John threw the ball over the
fence".split() - gtgtgt fdetector.detect(LabeledText(document1)
- 1
- gtgtgt document2 "Mary solved the
equation".split() - gtgtgt fdetector.detect(LabeledText(document2)
- 0
41Features in NLTK
- Feature Detector Lists data structures that
represent the feature detector functions for a
set of features. - Feature Value Lists
42Feature selection
- How do we choose the right features?
- Next lecture
43Classification
- Define classes
- Label text
- Extract Features
- Choose a classifier
- gtgtgt my_classifier.classify(token)
- The Naive Bayes Classifier
- NN (perceptron)
- SVM
- . (next Monday)
- Train it (and test it)
- Use it to classify new examples
44Training
- (Well see what we mean exactly with training
when well talk about the algorithms) - Adaptation of the classifier to the data
- Usually the classifier is defined by a set of
parameters - Training is the procedure for finding a good
set of parameters - Goodness is determined by an optimization
criterion such as misclassification rate - Some classifiers are guaranteed to find the
optimal set of parameters
45(Linear) Classification
Class1
Linear classifier g(x) wx w0
parameters w, w0
Class2
46(Linear) Classification
Class1
Linear classifier g(x) wx w0
Changing the parameters w, w0
Class2
47(Linear) Classification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
48(Linear) Classification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
49(Linear) Classification
Choose the classier with the lower rate of
misclassification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
50Testing, evaluation of the classifier
- After choosing the parameters of the classifiers
(i.e. after training it) we need to test how well
its doing on a test set (not included in the
training set) - Calculate misclassification on the test set
51Evaluating classifiers
- Contingency table for the evaluation of a binary
classifier
GREEN is correct RED is correct
GREEN was assigned a b
RED was assigned c d
- Accuracy (ad)/(abcd)
- Precision P_GREEN a/(ab), P_ RED d/(cd)
- Recall R_GREEN a/(ac), R_ RED d/(bd)
52Training size
- The more the better! (usually)
- Results for text classification
53Training size
54Training size
55Training Size
56Next Time and Upcoming
- Define classes
- Label text
- Features (Wednesday)
- Classifiers (next week)
- The Naive Bayes Classifier
- NN (perceptron)
- SVM
- Decision trees
- K nearest neighbor
- Maximum Entropy models