Title: Houston, TX 77002. Had another sleepless night Sun. an
1Modeling Identity in Archival Collections of
Email A Preliminary study
- Tamer Elsayed and Douglas W. Oard
Institute for Advanced Computer Studies
Department of Computer Science
College of Information Studies
Conference on Email and Anti-Spam (CEAS), July
28th, 2006
2Real Problem
Clinton White House
search request
Tobacco Policy
32 million emails
80,000
hired 25 persons
for 6 months
200,000
3Email Search
Searcher
- Meaning ? Modeling Content
- People ? Modeling Identity
4Identity
Nickname
Nickname
sent email to
Name
Name
Email Address
Email Address
Sender
Receivers
Email
sent
received
mentioned to
mentioned
mentions
Mentioned
Email Address
Name
Nickname
5Outline
- Problem
- Identity Resolution Architecture
- Evaluation
- Conclusion
6Entity Example
Nickname
Name
Robert Bruce
Bob
Main Headers (915) Quoted Headers (8)
Salutations (7) Free Signatures (9)
Email Address
robert.bruce_at_enron.com
Static Signature (140)
Robert E. Bruce Senior Counsel Enron North
America Corp. T (713) 345-7780 F (713)
646-3393 robert.bruce_at_enron.com
Signature Block
7Enron Collection
- Example of large organizational collection
- CMU version
- about half million emails
- 133,581 unique email addresses
- 52 of emails are duplicates!
- same address, subject, body
8Typical Enron Email
Message-ID lt1494.1584620.JavaMail.evans_at_thymegt Da
te Mon, 30 Jul 2001 124048 -0700 (PDT) From
elizabeth.sager_at_enron.com To sstack_at_reliant.com S
ubject RE Shhhh.... it's a SURPRISE ! X-From
Sager, Elizabeth lt/OENRON/OUNA/CNRECIPIENTS/CN
ESAGERgt X-To 'SStack_at_reliant.com_at_ENRON'
Message Header
Hi Shari
Salutation
Message Body
Main Body
Hope all is well. Count me in for the group
present. See ya next week if not earlier
Liza Elizabeth Sager 713-853-6349
Signature Block
-----Original Message----- From
SStack_at_reliant.com_at_ENRON Sent Monday, July 30,
2001 224 PM To Sager, Elizabeth Murphy,
Harlan jcrespo_at_hess.com wfhenze_at_jonesday.com Cc
ntillett_at_reliant.com Subject Shhhh.... it's a
SURPRISE !
Quoted Header
Quoted Text
Quoted Main Body
Please call me (713) 207-5233
Thanks! Shari
Quoted Signature
9Identity Resolution Architecture
Entities
Clustering Associations
Address-Nickname Associations
Address-Name Associations
Address-Address Associations
Nickname Extraction
Salutation lines
Signature lines
Extraction from Quoted Header
Signature Line Detection
Salutation Line Detection
Main body
Quoted headers
Extraction from Main Header
Body and Quoted Text Separation
Unique emails
Duplicate Detection
10Extraction From Main Headers
Name-Address Association
- Message-ID lt1486175.1075858665169.JavaMail.evans_at_
thymegt - Date Wed, 26 Sep 2001 092519 -0700 (PDT)
- From jmathes_at_nbchamber.com
- To mark.vandini_at_enron.com, steve.urbon_at_enron.com,
- sapienza.tony_at_enron.com, o'rourke.tom_at_enron.com,
lyons.tom_at_enron.com - Subject New Email Address
- X-From Jim Mathes ltjmathes_at_nbchamber.comgt
- X-To Vandini, Mark ltMark_Vandini_at_nstaronline.comgt
, Urbon Steve ltsurbon_at_s-t.comgt, - Tony Sapienza ltsapiena_at_gftusa.comgt, Tom O'Rourke
lttom_at_plymouthchamber.comgt, - Tom Lyons lttlyons_at_frfive.comgt, Tom Hodgson
ltsheriff_at_BCSO-MA.orggt - X-cc
- X-bcc
- We have just launched our "New Improved
Website", - www.newbedfordchamber.com
- and I have a new email address
- jmathes_at_newbedfordchamber.com
Address-Address Association
Name-Address Association
11Extraction From Quoted Headers
- Hi Jeff,
- Did you get our registration packet? If not, stop
by and pick one up - because you need it. Make sure you get the one
for new students. - Shawn
- On Wednesday, November 03, 1999 1118 AM, Jeff
Dasovich - SMTPjdasovic_at_enron.com wrote
- gt
- gt
- gt ok, don't shoot me, but what's the deadline for
scheduling for classes? - gt
- gt signed,
- gt clueless
Name-Address Association
---------------------- Forwarded by Elizabeth
Sager/HOU/ECT on 02/09/2000 1202 PM
--------------------------- "Patricia Young"
ltPYoung_at_eei.orggt on 02/09/2000 085059 AM To
Elizabeth Sager/HOU/ECT_at_ECT cc Subject If
possible, would you forward your resume to me
electronically? Thanks. If possible, would
you forward your resume to me electronically?
Thanks.
Name-Address Association
12Signature Salutation Detection
From susan.scott_at_enron.com
Had another sleepless night Sun. and finally took
some Unisom and had a good night's sleep last
night. What a relief. I have really never had
this problem before. It's good to have a lot of
energy, but you have to shut down sometime. Am
sending you my travel schedule for next week.
The following week (May 29 - June 2) I'm
planning to be in SF also, but I'm not sure I'll
actually have to be there that long. Have a
good afternoon! love, sooz Procurement,
Logistics, and Contracts Enron Broadband
Services, Inc. 1400 Smith, Suite
EB-4573A Houston, TX 77002
The week is going OK. All the tennis and
swimming has left me with sore muscles so this
is my night off. Am planning to do some more
house chores so I do not end up with another
weekend like the last. I'm still planning on
coming to Austin next weekend, I'm just not sure
when, but I'll let you know. Call if you get
lonely! Love, Sooz Procurement, Logistics, and
Contracts Enron Broadband Services, Inc. 1400
Smith, Suite EB-4573A Houston, TX 77002
The kiddies are going back to school already so
now would be a good time to plan a trip to D.C.
at last. Maybe early Sept? Also I'd be game for
a girls' trip to Destin. Time to
work! Love, -Sooz Procurement, Logistics, and
Contracts Enron Broadband Services, Inc. 1400
Smith, Suite EB-4573A Houston, TX 77002
13Nickname Extraction
From susan.scott_at_enron.com
Had another sleepless night Sun. and finally took
some Unisom and had a good night's sleep last
night. What a relief. I have really never had
this problem before. It's good to have a lot of
energy, but you have to shut down sometime. Am
sending you my travel schedule for next week.
The following week (May 29 - June 2) I'm
planning to be in SF also, but I'm not sure I'll
actually have to be there that long. Have a
good afternoon! love, sooz Procurement,
Logistics, and Contracts Enron Broadband
Services, Inc. 1400 Smith, Suite
EB-4573A Houston, TX 77002
nickname
- 3,151 address-nickname associations
14Identifying Entities
Nickname
Name
Robert Bruce
Bob
Main Headers (915) Quoted Headers (8)
Salutations (7) Free Signatures (9)
3,151 addr-nickname
82,084 addr-name
Email Address
robert.bruce_at_enron.com
19,708 addr-addr
Main Headers (7)
Static Signature (140)
Email Address
Robert E. Bruce Senior Counsel Enron North
America Corp. T (713) 345-7780 F (713)
646-3393 robert.bruce_at_enron.com
rbruce_at_hotmail.com
Quoted Headers (5)
Signature Block
Robert
66,715 entities
Name
15Outline
- Problem
- Identity Resolution Architecture
- Evaluation
- Conclusion
- Future Work
16Stratified Sampling
17Judgment Process
Incorrect
- kmpresto_at_msn.com ?? "home email"
- terrie.james_at_enron.com ?? "alexis james-petty"
Correct but not informative
june-deadrick_at_reliantenergy.com ?? june
deadrick robbie.lewis_at_enron.com ?? robbie lewis
Correct and somewhat informative
terriecovarrubias_at_hotmail.com ?? "terrie
covarrubias" randal.maffett_at_enron.com ?? "randy"
Correct and very informative
lemelpe_at_nu.com ?? "phyllis" piazzet_at_wharton.upen
n.edu ?? "tom"
18Evaluation Measures
Judged Associations
Correct
Very Informative
Informative
19Accuracy
- 100 accuracy with multiple sources of evidence.
- Address-name association was nearly perfect
- 80 minimum accuracy in address-nickname
- 96.7 entity accuracy
Address-Name Associations
Address-Nickname Associations
Address-Address Associations
20Informativeness
Address-Name Associations
Address-Nickname Associations
Address-Address Associations
21Outline
- Problem
- Identity Resolution Architecture
- Evaluation
- Conclusion
22Conclusion
- Introduced a computational model of identity
- a set of simple techniques put together
- provide a useful baseline
- assessed its potential utility in the context of
one fairly complex email collection - Automatic detection of nicknames in salutations
and signature lines. - Most informative results from weakest evidence
least accurate - Accuracy and informativeness are both important
23Limitations
- Email address associated with single identity
- Strength of evidence not exploited
- Heuristics hand-tuned for Enron collection
- Focus on personal attributes
- No reconciliation of multiple identities for
single person - No attempt to classify identities as machines or
groups - Recall?
24 25 26Future Work
- extend the model to exploit temporal features and
behavioral evidence - implement machine learning techniques
- perform ablation studies
- characterize the coverage of our methods in more
detail - replicate this work in other contexts
- integrate these techniques with the ultimate
applications for which computational models of
identity are needed (e.g., social network
analysis).
27Helping in Judgments
28Identity Framework
Person
Group
Machine
Identity
Identity
Identity
Entity
Entity
Entity
Entity
Entity
Entity
Candidates
29Modeling Identity
- Attributes (stable explicit features)
- email addresses, names, nickname, contact info
- Associations
- Link attributes together
- Based on observations
- Entities
- Representation of an identity
- Set of attributes in undirected graph
- Linked by weighted associations
30Identifying Entities
- First round
- limited transitive closure
- Merging associations
- based on unique attributes
- Address-address associations
- No use of strength of evidence yet
- 66,715 entities
- Covering 77,420 unique email address (58 of all
addresses)
31Related Work
- Attribute/association extraction
- Name recognition and reference resolution
- Applications
- Social network analysis
- Finding experts
32Unjudged Associations
Address-Name Associations
Address-Nickname Associations
Address-Address Associations
Only 19 ? 3