Title: Introduction to Automatic Email Classification
1Introduction to Automatic Email Classification
- Shih-Wen (George) Ke
- 7th Dec 2005
2Overview
- Introduction to Enron Corpus
- Traditional Text Classification vs Email
Classification - Recent Work on Enron Corpus
- Our Work on Enron Corpus
- Summary
- Future Research Directions in Information
Retrieval - Further Discussion
3Overview
- The nature of email classification is very
different to that of traditional text
classification tasks. - Email is time-dependent, poorly structured and
written in informal format and no standard ways
of preparing and evaluating email datasets have
been proposed.
4Introduction
- Automatic Email Classification dates back to mid
90s - Email Classification received little attention
until recently because there was no standard
email dataset available - Enron Email Corpus available in March 2004
5Introduction Enron Corpus
- Distributed by William Cohen at Carnegie Mellon
Uni. - Consists of 517,431 messages that belong to 150
users of Enron Corporation - Most users use folders to categorise their emails
- Upper bound for the number of folders appears to
be the log of the number of messages (Klimt
Yang, 2004)
6Email Classification Assumptions
- Categorise email into folders a.k.a. email
foldering - Only personal and professional emails are
considered here - Assume that users use folders to organise their
emails - Other methods of organising emails, e.g. flag or
label, are not considered here although they may
provide more information in Email Classification
7Recent Work on Enron Corpus
Bekkerman et al. (2004) Klimt Yang (2004)
Mono Multiple-classification Multiple-classification
Accuracy (TP/N) PR, Micro Macro F1
SVM performed best in most cases, but not statistically significant Newly created folders adversely affect performance Performance does not necessarily improve as the training set size grows Incoming emails are more related to those recently received than those received long ago Enron is suitable for email classification evaluation Body field is the most useful feature followed by From Email threads can be a valuable asset to email classification but they are difficult to detect and evaluate Foldering strategies differ individually
8Our Work on Enron Corpus- Introduction
- Users sometimes forget which folders they have
created or which folders they should file the
email under - So users tend to create new (duplicate) folders
- Newly created folders adversely affect
performance (Bekkerman et al., 2004) - Reduce the likelihood of users creating duplicate
folders by improving the accuracy of assigning
incoming emails to folders that were created in
the first place - Compare state-of-the-art classifiers (kNN, SVM)
and our own classifier - PERC in a simulation of
real-time situation using various parameter
settings
9Our Work on Enron Corpus- The PERC
- The PERC Classifier (PERsonal email Classifier)
- Find a centroid ci for each category Ci
- For each test document x
- Find k nearest neighbouring training documents
to x - Similarity between x and the training document
dj is added to similarity between x and ci - Sort similarity scores sim(x,Ci) in descending
order - Decision to assign x to Ci can be made using
various thresholding strategies
10Our Work on Enron Corpus- The PERC
- The PERC Classifier (PERsonal email Classifier)
-
- where y(dj,Ci) 0,1 is the classification
for training document dj with respect to category
Ci sim(x,dj) is the similarity between test
document x and training document dj and
sim(x,ci) is the similarity between test document
x and the centroid ci of the category that dj
belongs to.
11Rationale for the Hybrid Approach
- Centroid method overcomes data sparseness emails
tend to be short. - kNN allows the topic of a folder to drift over
time. Considering the vector space locally allows
matching against features which are currently
dominant.
12Our Work on Enron Corpus- Results
SVM1 (c1,j1), SVM2 (c0.01,j1) Micro-averaging
and Macro-average F1 over all users with standard
deviation for kNN, SVM and PERC For
Macro-averaging evaluations, PERC significantly
outperformed kNN (t2.786, p0.032), SVM1
(t2.533, p0.044) and SVM2 (t5.926, p0.001)
13Our Work on Enron Corpus- Conclusions
- PERC has the highest accuracy of assigning test
documents to small folders - kNN and PERC performed better with smaller k
- Parameters of SVM can be sensitive to the number
of training documents available - Investigate various parameter settings and
training/test sets splits - Use of time will be investigated
- A questionnaire-based study is being conducted in
order to indicate the behaviour of real users in
email management
14Future Research Directions in IR
- Use of time information
- Training/test sets splits
- Feature extraction, selection
- Document representation
- Qualitative evaluation
- Threads detection, TDT for email
- Mining sequential patterns
- Burst of activity (Kleinberg, 2002)
15References
- Bekkerman, R., McCallum, A. and Huang, G. (2004)
Automatic Categorization of Email into Folders
Benchmark Experiments on Enron and SRI Corpora.
Technical Report IR-418, CIIR, University of
Massachusetts. - Kleinberg, J. (2002) Bursty and Hierarchical
Structure in Streams. In ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining. - Klimt, B. Yang, Y. (2004) The Enron Corpus A
New Dataset for Email Classification Research.
European Conference on Machine Learning.