Title: Introduction to Automatic Text Classification
1Introduction to Automatic Text Classification
2Overview
- What is Text Classification (TC)
- Motivation of Automatic TC
- How Automatic TC is done
- Preprocessing
- k-Nearest Neighbour
- How we know it works
- Example Email Classification
- Summary
3What is Text Classification
- TC is commonly referred to as the task of
classifying natural language documents into a
pre-defined set of semantic categories. - For example Entertainment, Health, Business,
Technology etc.
4Motivation of Automatic TC
- Categorised data are easier for users to browse
- Organisational view of data provides more
effective retrieval - Efficient search is not enough
5(No Transcript)
6Motivation of Automatic TC
- Manual text classification is time-consuming and
expensive - MEDLINE (National Library of Medicine) indexed
over 600k citations in 2006 using MEdical Subject
Headings (23,000 categories) - Yahoo! Directories over 500k categories
7Motivation of Automatic TC
- Fatal drug mix killed US RB star
- Grammy-nominated RB star Gerald Levert was
killed by an accidental mixture of
over-the-counter and prescription drugs according
to a US coroner. - The singer, who died last November, had pain
killers, anxiety medication and allergy drugs in
his bloodstream, said Cleveland coroner Kevin
Chartrand. - The official cause of death was acute
intoxication, and the death was ruled to be
accidental, he said. - Levert found fame in RB trio LeVert, and had a
UK top 10 hit with Casanova. - He also recorded as a solo artist, and worked
with soul legends such as Anita Baker, Barry
White and Patti LaBelle. - --- BBC Sunday, 11 February 2007, 1303 GMT
Category Music? Health? Entertainment? RB? USA?
Medicine? UK?
8How Automatic TC is doneLearning Task
- Binary setting
- Simplest problem e.g., spam vs non-spam
- Multi-Class setting
- E.g. the task of classifying a news story into
one of the categories in BBC directory - Can be treated as n binary tasks
- Multi-Label setting
- One document can be in multiple, exactly one or
no category at all
9How Automatic TC is done Knowledge Engineering
- In the late 1980s
- Knowledge Engineering
- Experts hand-craft classification rules
- Rules
- Rule 1(RB or star or soul ) and (singer or
artist ) Music - Rule 2(drug or prescription ) and medication
Medicine - Rule 3(anxiety or pain or allergy) and acute
Health - Rule 4 (play or fame ) and award
Entertainment - Rule
10How Automatic TC is done Knowledge Engineering
- Still inefficient and impractical when
- Number of categories is large
- Category definitions can change over time
- Personalised application where an
expert/knowledge engineer is unavailable - Inconsistency issues as rule set gets larger
11How Automatic TC is done Machine Learning
- Since 1990s
- The learning algorithm is given a small set of
manually classified documents (training
documents/dataset) - Documents to be classified are test
documents/dataset - Produces a classification rule automatically
- A.k.a a supervised learning problem
- But, how do we make the learning algorithm learn
from the training documents?
12How Automatic TC is done Machine Learning -
Preprocessing
- Pre-processing
- Representing Text
- Bag-of-words approach Term Frequency (TF)
- Feature selection
- Stopword removal
- Feature construction
- Stemming
- Term weighting DF, IDF
- bag-of-words approach may not be the best method
for other languages
13How Automatic TC is done Machine Learning -
Preprocessing
- Fatal drug mix killed US RB star
- Grammy-nominated RB star Gerald Levert was
killed by an accidental mixture of
over-the-counter and prescription drugs according
to a US coroner. - The singer, who died last November, had pain
killers, anxiety medication and allergy drugs in
his bloodstream, said Cleveland coroner Kevin
Chartrand. - The official cause of death was acute
intoxication, and the death was ruled to be
accidental, he said. - Levert found fame in RB trio LeVert, and had a
UK top 10 hit with Casanova. - He also recorded as a solo artist, and worked
with soul legends such as Anita Baker, Barry
White and Patti LaBelle. - --- BBC Sunday, 11 February 2007, 1303 GMT
14How Automatic TC is done Machine Learning -
Preprocessing
- Fatal drug mix killed US RB star
- Grammy-nominated RB star Gerald Levert was
killed by an accidental mixture of
over-the-counter and prescription drugs according
to a US coroner. - The singer, who died last November, had pain
killers, anxiety medication and allergy drugs in
his bloodstream, said Cleveland coroner Kevin
Chartrand. - The official cause of death was acute
intoxication, and the death was ruled to be
accidental, he said. - Levert found fame in RB trio LeVert, and had a
UK top 10 hit with Casanova. - He also recorded as a solo artist, and worked
with soul legends such as Anita Baker, Barry
White and Patti LaBelle. - --- BBC Sunday, 11 February 2007, 1303 GMT
15How Automatic TC is done Machine Learning -
Preprocessing
- Fatal drug mix killed US RB star
- Grammy-nominated RB star Gerald Levert was
killed by an accidental mixture of
over-the-counter and prescription drugs according
to a US coroner. - The singer, who died last November, had pain
killers, anxiety medication and allergy drugs in
his bloodstream, said Cleveland coroner Kevin
Chartrand. - The official cause of death was acute
intoxication, and the death was ruled to be
accidental, he said. - Levert found fame in RB trio LeVert, and had a
UK top 10 hit with Casanova. - He also recorded as a solo artist, and worked
with soul legends such as Anita Baker, Barry
White and Patti LaBelle. - --- BBC Sunday, 11 February 2007, 1303 GMT
16How Automatic TC is done Machine Learning -
Preprocessing
- Fatal drug mix killed US RB star
- Grammy-nominated RB star Gerald Levert was
killed by an accidental mixture of
over-the-counter and prescription drugs according
to a US coroner. - The singer, who died last November, had pain
killers, anxiety medication and allergy drugs in
his bloodstream, said Cleveland coroner Kevin
Chartrand. - The official cause of death was acute
intoxication, and the death was ruled to be
accidental, he said. - Levert found fame in RB trio LeVert, and had a
UK top 10 hit with Casanova. - He also recorded as a solo artist, and worked
with soul legends such as Anita Baker, Barry
White and Patti LaBelle. - --- BBC Sunday, 11 February 2007, 1303 GMT
17How Automatic TC is done Machine Learning -
Preprocessing
- Fatal drug mix killed US RB star
- Grammy-nominated RB star Gerald Levert was
killed by an accidental mixture of
over-the-counter and prescription drugs according
to a US coroner. - The singer, who died last November, had pain
killers, anxiety medication and allergy drugs in
his bloodstream, said Cleveland coroner Kevin
Chartrand. - The official cause of death was acute
intoxication, and the death was ruled to be
accidental, he said. - Levert found fame in RB trio LeVert, and had a
UK top 10 hit with Casanova. - He also recorded as a solo artist, and worked
with soul legends such as Anita Baker, Barry
White and Patti LaBelle. - --- BBC Sunday, 11 February 2007, 1303 GMT
18How Automatic TC is done Machine Learning -
Preprocessing
- Fatal drug mix killed US RB star
- Grammy-nominated RB star Gerald Levert was
killed by an accidental mixture of
over-the-counter and prescription drugs according
to a US coroner. - The singer, who died last November, had pain
killers, anxiety medication and allergy drugs in
his bloodstream, said Cleveland coroner Kevin
Chartrand. - The official cause of death was acute
intoxication, and the death was ruled to be
accidental, he said. - Levert found fame in RB trio LeVert, and had a
UK top 10 hit with Casanova. - He also recorded as a solo artist, and worked
with soul legends such as Anita Baker, Barry
White and Patti LaBelle. - --- BBC Sunday, 11 February 2007, 1303 GMT
19How Automatic TC is done Machine Learning - kNN
- k-Nearest Neighbour (kNN)
- Documents located close to each other are more
likely to belong to the same class - k is a pre-defined parameter, which determines
how many neighbouring training documents to be
considered when classifying a test document - k is an integer 1, 3 ,5, 7, 10
- Cosine Similarity is commonly used to determine
the closeness of two documents
20How Automatic TC is done Machine Learning - kNN
21How Automatic TC is done Machine Learning - kNN
22How Automatic TC is done Machine Learning - kNN
- Weighted-sum voting scheme
23How Automatic TC is done Machine Learning - kNN
- The score for a category is the sum of the
similarity scores between the point to be
classified and all of its k-neighbours that
belong to the given category. - To restatewhere x is the new point c is a
class (e.g. black or white)d is a classified
point among the k-nearest neighbours of
xsim(x,d) is the similarity between x and
dI(d,c) 1 if point d belongs to class
cI(d,c) 0 otherwise.
24Exercise
- Imagine a language that is made up with five
English letters, A, B, C, D and E with B, D and E
being stopwords. The kNN system has been
trained with 3 training documents, which belong
to TWO different categories (see below) and the
task is to classify a new document (test
document) into one of the two categories using
the process of automatic text classification with
kNN (k1). - Preprocessed Training Documents
Unpreprocessed Test Document
25How we know it works
- Given n test documents and m category in
consideration, a classifier makes n ? m binary
decisions. A two-by-two contingency table can be
computed for each category
26How we know it works
- Performance measures
- Precision (p)
- Recall (r)
- F1-measure
- Accuracy
27How we know it works
- Precision TP/(TPFP) where TP FP gt 0
(otherwise undefined). - Of the times we predicted it was in class, how
often are we correct? - Recall TP/(TPFN) where TP FN gt 0 (o.w.
undefined). - Did we find all of those that belonged in the
class?
28How we know it works
- F1-measure 2(p ?r)/(p r)
- The weighted harmonic mean of precision and
recall - Single performance measure to compare different
learning algorithms - Accuracy No. TP for all categories
- No. all test documents
29Example Email Classification
- Emails are classified into folders
- Multi-class setting
- Emails are constantly being received
- kNN is updated weekly, i.e. add received emails
that were foldered to the training dataset - Text in email body and sender field is used to
represent an email - BOW representation, stemming but no stopword
removal - Dataset Enron Email Corpus
30Example Email Classification
- Results
- User ID 5 received 87 emails in 18 weeks and
keeps them in 7 folders - kNN correctly classified 72 emails
- Accuracy 72 / 87 0.8276 82.76
- User ID 70 received 881 emails in 114 weeks and
keeps them in 69 folders - kNN correctly classified 517 emails
- Accuracy 517 / 881 0.5868 58.68
- More folders means more complex classification
problem
31Summary
- Categorised data means more effective retrieval
and search - Exponential growth of the number of electronic
documents makes automatic TC is a must - Simple yet robust techniques can deliver
practical solutions to real-world problems - kNN is one of the most effective methods (and
arguably the simplest) - Personal Information Management (PIM) is a new
direction for TC
32Other Resources
- Sebastiani, F. Machine Learning in Automated Text
Categorization, ACM Computing Surveys, Vol. 34,
No. 1, 2002. - Joachims, T. Learning to Classify Text Using
Support Vector Machines Methods, Theory and
Algorithms, Kluwer Academic Publishers, 2002