Title: Ranking Users for Intelligent Message Addressing
1Ranking Users for Intelligent Message Addressing
- Vitor R. Carvalho and William Cohen
- Carnegie Mellon University
- Glasgow, April 2nd 2008
2Outline
- Intelligent Message Addressing
- Models
- Data Experiments
- Email Auto-completion
- Mozilla Thunderbird Extension
- Learning to Rank Results
3(No Transcript)
4Ramesh Nallapati ltramesh_at_cs.cmu.edugt
Add
William Cohen ltwcohen_at_cs.cmu.edugt
Add
Akiko Matsui ltakiko_at_cs.cmu.edugt
Add
Yifen Huang lthyfen_at_andrew.cmu.edugt
Add
5Ramesh Nallapati ltramesh_at_cs.cmu.edugt
Add
William Cohen ltwcohen_at_cs.cmu.edugt
Add
Akiko Matsui ltakiko_at_cs.cmu.edugt
Add
Yifen Huang lthyfen_at_andrew.cmu.edugt
Add
6Ramesh Nallapati ltramesh_at_cs.cmu.edugt
Add
Akiko Matsui ltakiko_at_cs.cmu.edugt
Add
Yifen Huang lthyfen_at_andrew.cmu.edugt
Add
7einat lteinat_at_cs.cmu.edugt Add
Ramesh Nallapati ltramesh_at_cs.cmu.edugt
Add
Jon Elsas ltjelsas_at_cs.cmu.edugt
Add
Andrew Arnold ltaard_at_andrew.cmu.edugt
Add
8einat lteinat_at_cs.cmu.edugt Add
Ramesh Nallapati ltramesh_at_cs.cmu.edugt
Add
Jon Elsas ltjelsas_at_cs.cmu.edugt
Add
Andrew Arnold ltaard_at_andrew.cmu.edugt
Add
9Ramesh Nallapati ltramesh_at_cs.cmu.edugt
Add
Jon Elsas ltjelsas_at_cs.cmu.edugt
Add
Andrew Arnold ltaard_at_andrew.cmu.edugt
Add
10Tom Mitchell lttom_at_cs.cmu.edugt
Add
Andrew Arnold ltaard_at_andrew.cmu.edugt
Add
Jon Elsas ltjelsas_at_cs.cmu.edugt
Add
Frank Lin ltfrank_at_cs.cmu.edugt
Add
11Tom Mitchell lttom_at_cs.cmu.edugt
Add
Andrew Arnold ltaard_at_andrew.cmu.edugt
Add
Jon Elsas ltjelsas_at_cs.cmu.edugt
Add
Frank Lin ltfrank_at_cs.cmu.edugt
Add
12The Task Intelligent Message Addressing
- Predicting likely recipients of email messages
given - (1) contents of message being composed
- (2) other recipients already specified
- (3) a few initial letters of the intended
recipient contact (intelligent auto-completion).
13What for?
- Identifying people related to specific topics (or
have specific relevant skills.) - Relation to Expert Finding
- Email message ? (long) query
- Email addresses ? experts
- Improved Email Address Auto-completion
- Prevent high-cost management errors
- People just forget to add important recipients
- preventing costly misunderstandings
- communication delays
- missed opportunities.
Dom et al, 03 Campbell et al,03
Particularly in large corporations
14How Frequent are These Errors?
- Grep for forgot, sorry or accident
- in the Enron Email corpus - half a million real
email messages from a large corporation. - Sorry, I forgot to CC you his final offer
- Oops, I forgot to send it to Vince.
- Adding John to the discussion..(sorry John)
- Sorry....missed your name on the cc list!.
- More frequent than expected
- at least 9.27 of the users forgot to add a
desired email recipient. - At least 20.52 of the users were not included as
recipients (even though they were intended
recipients) in at least one received message. - Lowerbound
15Two Ranking Tasks
TOCCBCC Prediction
CCBCC Prediction
16Models
- Non-textual Models
- Frequency only
- Recency only
- Expert Finding Models Balog et al, 2006
- M1 Candidate Model
- M2 Document Model
- Rocchio (TFIDF)
- K-Nearest Neighbors (KNN)
- Rank Aggregation of the above
17Non-Textual Models
- Frequency model
- Rank by total number of messages in training set
- Recency Model
Exponential decay on chronologically ordered
messages.
18Expert Search Models
- M1 Candidate Model Balog et al, 2006
- M2 Document Model Balog et al, 2006
f(doc,ca) is estimated as user centric (UC) or
document centric (DC)
19Other Models
- Rocchio (TFIDF) Joachims, 1997 Salton
Buckley, 1988 - K-Nearest Neighbors Yang Liu, 1999
20Model Parameters
- Chosen from preliminary tests.
- Recency b 100 10,20,50,100,200,500
- KNN, K30 3,5,10,20,30,40,50,10
0 - Rocchios b 0 -0,0.1,0.25,0.5
21Data Enron Email Collection
- Some good reasons
- Large, half a million messages
- Natural work-related email, not email lists
- Public and free
- Different roles managers, assistants, etc.
- Unfortunates
- No clear message thread information
- No complete Address Book information
- no first/last/full names of many recipients
22Enron Data Preprocessing
- Setup a realistic temporal setup (per user)
- For each user, 10 (most recent) sent messages
will be used as test - 36 users
- All users had their Address Books (AB) extracted
CCBCC
TOCCBCC
23Enron Data Preprocessing
- Bag-of-words representation
- Message were represented as the union of BOW of
body and BOW of subject - Removed inconsistencies and repeated messages
- Disambiguated Several Enron addresses
- Stop words removed, No stemming
- Self-addressed messages were removed
24Threading
- No explicit thread information in Enron Try to
reconstruct. - Build Message Thread Set MTS(msg)
- set of messages with same subject as the
current one.
25Results
26Results
27Results
28Rank Aggregation
Ranking combined by Reciprocal Rank
29Rank Aggregation Results
30Observations
- Threading improves MAP for all models
- KNN seems is best choice overall
- document-model with focus on a few top docs
- Data Fusion method for rank aggregation improved
performance significantly - Base systems making different types of mistakes
31Intelligent Email Auto-completion
TOCCBCC
CCBCC
32Intelligent Email Auto-completion
33 Mozilla Thunderbird extension (Cut Once)
Suggestions Click to add
34 Mozilla Thunderbird extension (Cut Once)
- Interested?
- Just google
- mozilla extension carnegie mellon
- User Study using Cut Once
- Insteadwrite-then-address behavior
35Related Work
- Expert finding in Email
- Dom et al.(SIGMOD-03), Campbell et al(CIKM-03)
- Soboroff, Craswell, de Vries (TREC-Enterprise
2005-06-07) Expert finding task on the W3C
corpus - CC Prediction
- Short paper with initial idea. one single user,
limited evaluation, not public data Pal
McCallum, 06
36Can we do better ranking?
- Learning to Rank machine learning to improve
ranking - Feature-based ranking function
- Many recently proposed methods
- RankSVM
- ListNet
- RankBoost
- Perceptron Variations
- Online, scalable.
Joachims, KDD-02
Cao et al., ICML-07
Freund et al, 2003
Elsas, Carvalho Carbonell, WSDM-08
37Learning to Rank Recipients
- Ranking scores as features
- Textual Scores (KNN)
- Network Scores
- Frequency score
- Recency score
- Co-Occurrence Features
Combine textual scores with other network
features
Textual Feature (KNN scores)
Network Features
38Learning to Rank Recipients Results
39Conclusions
- Problem Predicting recipients of email messages
- Useful for email auto-completion, finding related
people, and management addressing errors - Evidence from Large email collection
- 2 subtasks TOCCBCC and CCBCC
- Various models KNN best model in general
- Rank Aggregation improved performance
- Improvements in Email-auto completion
- Thunderbird Extension (Cut Once)
- Promising Results on learning to rank recipients
40Thank you
41Thank you
42Comments
(Thanks, reviewers!)
- No account for email structural info (body ?
subject ? quoted) - Identifying Name entities (Dear Mr. X, etc.)
- Implicitly doing, but could be better
- Enron did not provide many first/last names
- Fair estimation of f(doc,ca) on email?
- Might explain weaker performance of M2 models.