Houston, TX 77002. Had another sleepless night Sun. an - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Houston, TX 77002. Had another sleepless night Sun. an

Description:

Houston, TX 77002. Had another sleepless night Sun. and finally took some Unisom and had a good ... terrie.james_at_enron.com 'alexis james-petty' june-deadrick ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 33
Provided by: tamarme
Category:

less

Transcript and Presenter's Notes

Title: Houston, TX 77002. Had another sleepless night Sun. an


1
Modeling Identity in Archival Collections of
Email A Preliminary study
  • Tamer Elsayed and Douglas W. Oard

Institute for Advanced Computer Studies
Department of Computer Science
College of Information Studies
Conference on Email and Anti-Spam (CEAS), July
28th, 2006
2
Real Problem
Clinton White House
search request
Tobacco Policy
32 million emails
80,000
hired 25 persons
for 6 months
200,000
3
Email Search
Searcher
  • Meaning ? Modeling Content
  • People ? Modeling Identity

4
Identity
Nickname
Nickname
sent email to
Name
Name
Email Address
Email Address
Sender
Receivers
Email
sent
received
mentioned to
mentioned
mentions
Mentioned
Email Address
Name
Nickname
5
Outline
  • Problem
  • Identity Resolution Architecture
  • Evaluation
  • Conclusion

6
Entity Example
Nickname
Name
Robert Bruce
Bob
Main Headers (915) Quoted Headers (8)
Salutations (7) Free Signatures (9)
Email Address
robert.bruce_at_enron.com
Static Signature (140)
Robert E. Bruce Senior Counsel Enron North
America Corp. T (713) 345-7780 F (713)
646-3393 robert.bruce_at_enron.com
Signature Block
7
Enron Collection
  • Example of large organizational collection
  • CMU version
  • about half million emails
  • 133,581 unique email addresses
  • 52 of emails are duplicates!
  • same address, subject, body

8
Typical Enron Email
Message-ID lt1494.1584620.JavaMail.evans_at_thymegt Da
te Mon, 30 Jul 2001 124048 -0700 (PDT) From
elizabeth.sager_at_enron.com To sstack_at_reliant.com S
ubject RE Shhhh.... it's a SURPRISE ! X-From
Sager, Elizabeth lt/OENRON/OUNA/CNRECIPIENTS/CN
ESAGERgt X-To 'SStack_at_reliant.com_at_ENRON'
Message Header
Hi Shari
Salutation
Message Body
Main Body
Hope all is well. Count me in for the group
present. See ya next week if not earlier
Liza Elizabeth Sager 713-853-6349
Signature Block
-----Original Message----- From
SStack_at_reliant.com_at_ENRON Sent Monday, July 30,
2001 224 PM To Sager, Elizabeth Murphy,
Harlan jcrespo_at_hess.com wfhenze_at_jonesday.com Cc
ntillett_at_reliant.com Subject Shhhh.... it's a
SURPRISE !
Quoted Header
Quoted Text
Quoted Main Body
Please call me (713) 207-5233
Thanks! Shari
Quoted Signature
9
Identity Resolution Architecture
Entities
Clustering Associations
Address-Nickname Associations
Address-Name Associations
Address-Address Associations
Nickname Extraction
Salutation lines
Signature lines
Extraction from Quoted Header
Signature Line Detection
Salutation Line Detection
Main body
Quoted headers
Extraction from Main Header
Body and Quoted Text Separation
Unique emails
Duplicate Detection
10
Extraction From Main Headers
Name-Address Association
  • Message-ID lt1486175.1075858665169.JavaMail.evans_at_
    thymegt
  • Date Wed, 26 Sep 2001 092519 -0700 (PDT)
  • From jmathes_at_nbchamber.com
  • To mark.vandini_at_enron.com, steve.urbon_at_enron.com,
  • sapienza.tony_at_enron.com, o'rourke.tom_at_enron.com,
    lyons.tom_at_enron.com
  • Subject New Email Address
  • X-From Jim Mathes ltjmathes_at_nbchamber.comgt
  • X-To Vandini, Mark ltMark_Vandini_at_nstaronline.comgt
    , Urbon Steve ltsurbon_at_s-t.comgt,
  • Tony Sapienza ltsapiena_at_gftusa.comgt, Tom O'Rourke
    lttom_at_plymouthchamber.comgt,
  • Tom Lyons lttlyons_at_frfive.comgt, Tom Hodgson
    ltsheriff_at_BCSO-MA.orggt
  • X-cc
  • X-bcc
  • We have just launched our "New Improved
    Website",
  • www.newbedfordchamber.com
  • and I have a new email address
  • jmathes_at_newbedfordchamber.com

Address-Address Association
Name-Address Association
11
Extraction From Quoted Headers
  • Hi Jeff,
  • Did you get our registration packet? If not, stop
    by and pick one up
  • because you need it. Make sure you get the one
    for new students.
  • Shawn
  • On Wednesday, November 03, 1999 1118 AM, Jeff
    Dasovich
  • SMTPjdasovic_at_enron.com wrote
  • gt
  • gt
  • gt ok, don't shoot me, but what's the deadline for
    scheduling for classes?
  • gt
  • gt signed,
  • gt clueless

Name-Address Association
---------------------- Forwarded by Elizabeth
Sager/HOU/ECT on 02/09/2000 1202 PM
--------------------------- "Patricia Young"
ltPYoung_at_eei.orggt on 02/09/2000 085059 AM To
Elizabeth Sager/HOU/ECT_at_ECT cc Subject If
possible, would you forward your resume to me
electronically? Thanks. If possible, would
you forward your resume to me electronically?
Thanks.
Name-Address Association
12
Signature Salutation Detection
From susan.scott_at_enron.com
Had another sleepless night Sun. and finally took
some Unisom and had a good night's sleep last
night. What a relief. I have really never had
this problem before. It's good to have a lot of
energy, but you have to shut down sometime. Am
sending you my travel schedule for next week.
The following week (May 29 - June 2) I'm
planning to be in SF also, but I'm not sure I'll
actually have to be there that long. Have a
good afternoon! love, sooz Procurement,
Logistics, and Contracts Enron Broadband
Services, Inc. 1400 Smith, Suite
EB-4573A Houston, TX 77002
The week is going OK. All the tennis and
swimming has left me with sore muscles so this
is my night off. Am planning to do some more
house chores so I do not end up with another
weekend like the last. I'm still planning on
coming to Austin next weekend, I'm just not sure
when, but I'll let you know. Call if you get
lonely! Love, Sooz Procurement, Logistics, and
Contracts Enron Broadband Services, Inc. 1400
Smith, Suite EB-4573A Houston, TX 77002
The kiddies are going back to school already so
now would be a good time to plan a trip to D.C.
at last. Maybe early Sept? Also I'd be game for
a girls' trip to Destin. Time to
work! Love, -Sooz Procurement, Logistics, and
Contracts Enron Broadband Services, Inc. 1400
Smith, Suite EB-4573A Houston, TX 77002
13
Nickname Extraction
From susan.scott_at_enron.com
Had another sleepless night Sun. and finally took
some Unisom and had a good night's sleep last
night. What a relief. I have really never had
this problem before. It's good to have a lot of
energy, but you have to shut down sometime. Am
sending you my travel schedule for next week.
The following week (May 29 - June 2) I'm
planning to be in SF also, but I'm not sure I'll
actually have to be there that long. Have a
good afternoon! love, sooz Procurement,
Logistics, and Contracts Enron Broadband
Services, Inc. 1400 Smith, Suite
EB-4573A Houston, TX 77002
nickname
  • 3,151 address-nickname associations

14
Identifying Entities
Nickname
Name
Robert Bruce
Bob
Main Headers (915) Quoted Headers (8)
Salutations (7) Free Signatures (9)
3,151 addr-nickname
82,084 addr-name
Email Address
robert.bruce_at_enron.com
19,708 addr-addr
Main Headers (7)
Static Signature (140)
Email Address
Robert E. Bruce Senior Counsel Enron North
America Corp. T (713) 345-7780 F (713)
646-3393 robert.bruce_at_enron.com
rbruce_at_hotmail.com
Quoted Headers (5)
Signature Block
Robert
66,715 entities
Name
15
Outline
  • Problem
  • Identity Resolution Architecture
  • Evaluation
  • Conclusion
  • Future Work

16
Stratified Sampling
17
Judgment Process
Incorrect
  • kmpresto_at_msn.com ?? "home email"
  • terrie.james_at_enron.com ?? "alexis james-petty"

Correct but not informative
june-deadrick_at_reliantenergy.com ?? june
deadrick robbie.lewis_at_enron.com ?? robbie lewis
Correct and somewhat informative
terriecovarrubias_at_hotmail.com ?? "terrie
covarrubias" randal.maffett_at_enron.com ?? "randy"
Correct and very informative
lemelpe_at_nu.com ?? "phyllis" piazzet_at_wharton.upen
n.edu ?? "tom"
18
Evaluation Measures
Judged Associations
Correct
Very Informative
Informative
19
Accuracy
  • 100 accuracy with multiple sources of evidence.
  • Address-name association was nearly perfect
  • 80 minimum accuracy in address-nickname
  • 96.7 entity accuracy

Address-Name Associations
Address-Nickname Associations
Address-Address Associations
20
Informativeness
Address-Name Associations
Address-Nickname Associations
Address-Address Associations
21
Outline
  • Problem
  • Identity Resolution Architecture
  • Evaluation
  • Conclusion

22
Conclusion
  • Introduced a computational model of identity
  • a set of simple techniques put together
  • provide a useful baseline
  • assessed its potential utility in the context of
    one fairly complex email collection
  • Automatic detection of nicknames in salutations
    and signature lines.
  • Most informative results from weakest evidence
    least accurate
  • Accuracy and informativeness are both important

23
Limitations
  • Email address associated with single identity
  • Strength of evidence not exploited
  • Heuristics hand-tuned for Enron collection
  • Focus on personal attributes
  • No reconciliation of multiple identities for
    single person
  • No attempt to classify identities as machines or
    groups
  • Recall?

24
  • Thank You!
  • Questions?

25
  • Backup

26
Future Work
  • extend the model to exploit temporal features and
    behavioral evidence
  • implement machine learning techniques
  • perform ablation studies
  • characterize the coverage of our methods in more
    detail
  • replicate this work in other contexts
  • integrate these techniques with the ultimate
    applications for which computational models of
    identity are needed (e.g., social network
    analysis).

27
Helping in Judgments
28
Identity Framework
Person
Group
Machine
Identity
Identity
Identity
Entity
Entity
Entity
Entity
Entity
Entity
Candidates
29
Modeling Identity
  • Attributes (stable explicit features)
  • email addresses, names, nickname, contact info
  • Associations
  • Link attributes together
  • Based on observations
  • Entities
  • Representation of an identity
  • Set of attributes in undirected graph
  • Linked by weighted associations

30
Identifying Entities
  • First round
  • limited transitive closure
  • Merging associations
  • based on unique attributes
  • Address-address associations
  • No use of strength of evidence yet
  • 66,715 entities
  • Covering 77,420 unique email address (58 of all
    addresses)

31
Related Work
  • Attribute/association extraction
  • Name recognition and reference resolution
  • Applications
  • Social network analysis
  • Finding experts

32
Unjudged Associations
Address-Name Associations
Address-Nickname Associations
Address-Address Associations
Only 19 ? 3
Write a Comment
User Comments (0)
About PowerShow.com