The Enron and W3C Collections - PowerPoint PPT Presentation

About This Presentation
Title:

The Enron and W3C Collections

Description:

Variants of Email Search. Searcher. Collection. The Enron and W3C Collections. Rich multimodal data ... 150 sets of rescued Outlook email folders ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 23
Provided by: TamarME1
Category:

less

Transcript and Presenter's Notes

Title: The Enron and W3C Collections


1
The Enron and W3C Collections
  • Tamer Elsayed and Douglas W. Oard

University of Maryland
ICAIL 2007, DESI Workshop, June 4th, 2007
2
Variants of Email Search
Searcher
Participant Non-participant
Personal My own emails ShneidermansPostels
Organization Help desks White House Enron
Public Online communities Usenet news W3C
Collection
3
The (Extended) Enron Collection
  • Rich multimodal data
  • Emails
  • Phone calls
  • Databases

4
The (Extended) Enron Collection
  • Public version of Enron collection (CMU)
  • 150 sets of rescued Outlook email folders
  • 517,431 emails, 52 duplicates, 133,581 unique
    addresses
  • Subset annotated w/genre, speech act, mentioned
    calls,
  • Extended Enron email collection (Aspen Systems)
  • Attachments, additional email (later release,
    redaction)
  • Phone calls from/to Enron traders (Shohomish PUD)
  • Transcribed subset from 52 DVDs of recorded audio
  • Recovered from scanned transcripts using OCR
  • 93 annotated with date, time, participants,
    mentioned names, mentioned emails, mentioned
    meetings, ...
  • Relational databases (Aspen Systems)

5
Cross-References
6
Phone Call Transcripts
Message-ID lt24-20010126-19435570-20020114-Rgt Mess
age-Type PhoneCall Date Fri, 26 Jan 2001
194355 -0600 (CST) From shari.stack_at_enron.com
To greg.wolfe_at_enron.com Parties
shari.stack_at_enron.com, greg.wolfe_at_enron.com Subjec
t Snohornish deal, Houston Chronicle Article,
Bonuses e-mail, Houston Chronicle Article, Deal,
email to Jane King Subject-TimePos 145, 313,
713, 775, 920, 1018 InCallNames Christian, Ken
Lay, Greg, Chris Foster, Stewie, Stewie, Mike,
Mike, Laverado, Mike, Kim, Shari, Greg, Forney,
Stewie, Jane King, Shari InCallNames-TimePos 42,
81, 90, 95, 96, 143, 146, 190, 262, 266, 522,
580, 780, 1007, 1018, 1038, 1067 Keywords CDWR,
email, email Keywords-TimePos 55, 689,
1038 X-From Stack, Shari ltgt X-To Wolfe, Greg
ltgt X-Parties Stack, Shari ltgt, Wolfe, Greg
ltgt X-AudioFile 24-20010126-19435570-20020114-R.wa
v X-TranscriptFile 24-20010126-19435570-20020114-
R.txt SHARI STACK Hey. GREG WOLFE All right,
let me get my fax machine workin'. Uh -
laughs SHARI laughs She's like, it was so
easy, I could make you a lot of money laughs.
She's like, he said it so desperate. She goes I
hate to laugh at people, but - laughs GREG Did
you, um, did you, ah, ah tell her about the, ah,
that voice mail? SHARI Yeah, I said - I said
Greg inaudible he's got the - they got a mob
connection langhs - his friend threw away the
business card after the meeting.both
laughing SHARI But, my God - my God, and so
anyway, have you talked to Chnstian about this
'cause Christian apparently talked to him twice
today. GREG Oh, he sent a - Christian sent an
e-mail shortly after, you know, that, and said
we're not doin' business with this guy. SHARI
laughs GREG Ah, so I still don't understand
why this guy's trying to get in the middle of us
and CDWR and I guess - SHARI laughs
7
Typical Enron Email
Message-ID lt1494.1584620.JavaMail.evans_at_thymegt Da
te Mon, 30 Jul 2001 124048 -0700 (PDT) From
elizabeth.sager_at_enron.com To sstack_at_reliant.com S
ubject RE Shhhh.... it's a SURPRISE ! X-From
Sager, Elizabeth lt/OENRON/OUNA/CNRECIPIENTS/CN
ESAGERgt X-To 'SStack_at_reliant.com_at_ENRON'
Message Header
Hi Shari
Salutation
Message Body
Main Body
Hope all is well. Count me in for the group
present. See ya next week if not earlier
Liza Elizabeth Sager 713-853-6349
Signature Block
-----Original Message----- From
SStack_at_reliant.com_at_ENRON Sent Monday, July 30,
2001 224 PM To Sager, Elizabeth Murphy,
Harlan jcrespo_at_hess.com wfhenze_at_jonesday.com Cc
ntillett_at_reliant.com Subject Shhhh.... it's a
SURPRISE !
Quoted Header
Quoted Text
Quoted Main Body
Please call me (713) 207-5233
Thanks! Shari
Quoted Signature
8
Research Problems (Enron)
  • Threading
  • Email Classification
  • Social Network Analysis
  • Mention Resolution

9
Who is that Sheila?
  • Date Wed Dec 20 085700 EST 2000
  • From Kay Mann ltkay.mann_at_enron.comgt
  • To Suzanne Adams ltsuzanne.adams_at_enron.comgt
  • Subject Re GE Conference Call has be
    rescheduled
  • Did Sheila want Scott to participate? Looks like
    the call
  • will be too late for him.

?
Sheila
10
Rich Evidence about Identity
m scott
susan m scott
m..scott_at_enron.com
susan scott
suebob
sue
sscott
susan
susan scott
sscott5_at_enron.com
sscott5
susan
susan m scott
friday
com members
scott susan
scott.susan_at_enron.com
66,715 models
susan m scott
susan scott
11
Test Collection of Mention Resolution
Test Collections
Enron-all
Enron-subset
Sager
Shapiro
Candidates Candidates Candidates
Collection Emails Identities Queries Min. Avg. Max.
Sager 1,628 627 51 1 4 11
Shapiro 974 855 49 1 8 21
Enron-subset 54,018 27,340 78 1 152 489
Enron-all 248,451 123,783 78 3 518 1785
12
Evaluation
  • Task
  • named-mention ? ranked list of people
  • Measures
  • Mean Reciprocal Rank
  • Success _at_ K
  • Success _at_ 1
  • Confidence-based scoring

13
Limitations (Mention Resolution)
  • Small number of queries
  • Only resolved by Enron employees
  • Much easier
  • Most of participants are outsides
  • Measures focus only on accuracy

14
Identity-Content Interplay
SocialContext
Search for People
Search for Content
TopicalContext
15
W3C Collection
  • Set of mailing lists
  • public not private
  • Topically-oriented
  • 175,000 emails
  • Introduced at TREC 2005
  • 50 topics (x 2 years)
  • relevance judgments available for ad-hoc retrieval

16
Research Problems (W3C)
  • Expert Finding
  • Topic ? ranked list of experts
  • Know-item Retrieval
  • Query ? ranked list of emails
  • Discussion Search (i.e., ad-hoc retrieval)
  • Pro/con retrieval
  • Query ? ranked list of emails

17
Topic Type Analysis
Find categories amenable to pro/con
classification (TREC 2005-Enterprise Track)
18
Limitations (Pro/Con Retrieval)
  • Not private/personal communication
  • Mailing lists ? receivers are hidden
  • Topical categories are unbalanced
  • Developed by researchers NOT users

19
Related Projects
  • Others working with CMUs Enron emails
  • Berkeley, CMU, U Mass, SIAM Workshop
  • University of Southern California ISI/ICT
  • eArchivarius, Postel collection (Anton Leuski)
  • Georgia Tech Research Institute PERPOS
  • Presidential records (Bill Underwood)

20
Conclusion
  • Two email test collections
  • Public
  • Hundreds of thousands of emails
  • Annotated emails and transcripts
  • Tasks and ground truth
  • Need for real user needs
  • Development of evaluation measures for utility

21
For More Information
  • Joint Institute for Knowledge Discovery
  • http//www.umiacs.umd.edu/jikd

22
Running System
Write a Comment
User Comments (0)
About PowerShow.com