Title: EPCA Integration
1Interactive Information Extractionand Social
Network AnalysisAndrew McCallum Information
Extraction and Synthesis LaboratoryUMass
Amherst
2The Application
Workplace effectiveness Ability to leverage
network of acquaintances The power of your
little black book But filling Contacts DB by
hand is tedious, and incomplete.
Contacts DB
Email Inbox
Automatically
WWW
3DEX Overview
CRF
WWW
Email
names
4DEX Example
To Andrew McCallum mccallum_at_cs.umass.edu Subjec
t ...
First Name Andrew
Middle Name Kachites
Last Name McCallum
JobTitle Associate Professor
Company University of Massachusetts
Street Address 140 Governors Dr.
City Amherst
State MA
Zip 01003
Company Phone (413) 545-1323
Links Fernando Pereira, Sam Roweis,
Key Words Information extraction, social network,
Search for new people
5Summary of Results
Example keywords extracted
Person Keywords
William Cohen Logic programming Text categorization Data integration Rule learning
Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables
Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies
Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence
Contact info and name extraction performance (25
fields)
Token Acc Field Prec Field Recall Field F1
CRF 94.50 85.73 76.33 80.76
- Expert Finding When solving some task, find
friends-of-friends with relevant expertise.
Avoid stove-piping in large orgs by
automatically suggesting collaborators. Given a
task, automatically suggest the right team for
the job. (Hiring aid!) - Social Network Analysis Understand the social
structure of your organization. Suggest
structural changes for improved efficiency.
6Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
7Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
80. Segmenting and labeling sequence
dataLinear-chain CRFs
Lafferty, McCallum, Pereira 2001
PER O O TIME O O
ORG O LOC
...
y
Named entity labels
...
x
CALO email words
Dave , The Friday meeting with Tembec in NY
Leveraging data from KnowItAll,Etzioni et al,
2004 UPenn help.
Enron email labeled by Michael Collins, et
al. 1200 entities
Field F1 DATE 0.8483 TIME 0.7939 LOCATION 0.64
76 PERSON 0.8439 ORGANIZATION 0.5987 ACRONYM 0.2
804 PHONE 0.7943 MONEY 0.7143 PERCENT 0.9091 OV
ERALL 0.7282
From monika.causholli_at_enron.com Dave, The
Friday meeting with Tembec in NY has been
postponed until next week. Attached is the
information you requested. Let me know if you
need anything else. Also did Doug give you the
data about consumer products? Cheers, Monica
Li, McCallum, unpublished, 2004
9Motivation
- Capture confidence of records in extracted
database - Alerts data mining to possible errors in
database
First Name Last Name Confidence
Bill Gates 0.96
Bill banks 0.43
10Confidence Estimation inLinear-chain CRFs
Culotta, McCallum 2004
Finite State Lattice
output sequence
y
y
y
y
y
t2
t3
t
-
1
t
t1
ORG
OTHER
Lattice ofFSM states
. . .
PERSON
TITLE
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input sequence
said Arden Bement NSF Director
11Confidence Estimation inLinear-chain CRFs
Culotta, McCallum 2004
Constrained Forward-Backward
output sequence
y
y
y
y
y
t2
t3
t
-
1
t
t1
ORG
OTHER
Lattice ofFSM states
. . .
PERSON
TITLE
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input sequence
said Arden Bement NSF Director
12Forward-Backward Confidence Estimationimproves
accuracy/coverage
ourforward-backwardconfidence
optimal
traditionaltoken-wiseconfidence
no use ofconfidence
13Application of Confidence Estimation
- Interactive Information Extraction
- To correct predictions, direct user to least
confident field
14Interactive Information Extraction
- IE algorithm calculates confidence scores
- UI uses confidence scores to alert user to
possible errors - IE algorithm takes corrections into account and
propagates correction to other fields
15User Correction
- User Corrects a field, e.g. dragging Stanley to
the First Name field
x1 x2
x3 x4 x5
First Name
Last Name
Address Line
Charles
Stanley
100
Charles
Street
y1 y2
y3 y4 y5
16Remove Paths
- User Corrects a field, e.g. dragging Stanley to
the First Name field
x1 x2
x3 x4 x5
First Name
Last Name
Address Line
Charles
Stanley
100
Charles
Street
y1 y2
y3 y4 y5
17Constrained Viterbi
- Viterbi algorithm is constrained to pass through
the designated state.
x1 x2
x3 x4 x5
First Name
Last Name
Address Line
Charles
Stanley
100
Charles
Street
y1 y2
y3 y4 y5
Adjacent field changed Correction Propagation
18Constrained Viterbi
- After fixing least confident field,constrained
Viterbi automatically reduces error by another
23. - Recent work reduces annotation effort further
- simplifies annotation to multiple-choice
First Name Last Name City
Bill Gates Redmond WA
Bill Gates Redmond
A) B)
19User feedback in the wildas labeling
Labeling for Classification
Seminar How to Organize your Life by Jane
Smith, Stevenson Smith Mezzanine Level,
Papadapoulos Sq 330 pm Thursday March 31 In
this seminar we will learn how to use CALO to...
Seminar announcement
Todo request
Other
Easy Often found in user interfaces e.g. CALO
IRIS, Apple Mail
20Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information Extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
21Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
Interface presents top hypothesized segmentations
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
user corrects labels, not segmentations
22Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
Interface presents top hypothesized segmentations
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
29 percent reduction in user actions needed to
train
23Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
24Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Emailed seminar annmt entities
Email English words
60k words training.
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Too little labeled training data.
25Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Train on related task with more data.
Newswire named entities
Newswire English words
200k words training.
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
26Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
At test time, label email with newswire NEs...
Newswire named entities
Email English words
27Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
then use these labels as features for final task
Emailed seminar annmt entities
Newswire named entities
Email English words
28Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Use joint inference at test time.
Seminar Announcement entities
Newswire named entities
English words
An alternative to hierarchical Bayes. Neednt
know anything about parameterization of subtask.
Accuracy No transfer lt Cascaded Transfer lt
Joint Inference Transfer
29CRF Transfer Learning Results
Sutton, McCallum, 2005
Seminar Announcements Dataset Freitag
1998 CRF location speaker stime etime
overall No transfer 73.7 81.0 99.1 97.3
87.8 Cascaded transfer 74.2 84.3 99.2 96.0
88.4 Joint transfer 76.3 85.3 99.1 96.0 89.2
New best published accuracy on common dataset
30A Conditional Random Field for Discriminatively-tr
ained Finite-state String Edit Distance
- Andrew McCallum
- Kedar Bellare
- Fernando Pereira
Thanks to Charles Sutton, Xuerui Wang and Mikhail
Bilenko for helpful discussions.
31String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y.
32String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
Apex International Hotel Grassmarket Street
Apex Internatl Grasmarket Street
Records are duplicates of the same hotel?
33String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
- Biological Sequences
AGCTCTTACGATAGAGGACTCCAGA
AGGTCTTACCAAAGAGGACTTCAGA
34String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
- Biological Sequences
- Machine Translation
Il a achete une pomme
He bought an apple
35String Edit Distance
- Distance between sequences x and y
- cost of lowest-cost sequence of edit operations
that transform string x into y. - Applications
- Database Record Deduplication
- Biological Sequences
- Machine Translation
- Textual Entailment
He bought a new car last night
He purchased a brand new automobile yesterday
evening
36Levenshtein Distance
1966
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
37Levenshtein Distance
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
Dynamic program
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
D(i,j) score of best alignment
from x1... xi to y1... yj.
insert
D(i-1,j-1) ?(xi?yj ) D(i,j) min
D(i-1,j) 1 D(i,j-1) 1
subst
total cost distance
38Levenshtein Distancewith Markov Dependencies
repeated delete is cheaper
Cost after a c i d s copy Copy a
character from x to y 0 0 0 0 insert Insert a
character into y 1 1 1 delete Delete a
character from y 1 1 1 subst Substitute one
character for another 1 1 1 1
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
subst
copy
insert
delete
3D DP table
39?Ristad Yianilos (1997)
Essentially a Pair-HMM, generating a
edit/state/alignment-sequence and two strings
Learn via EM Expectation step Calculate
likelihood of alignment paths Maximization
step Make those paths more likely.
40Ristad Yianilos Regrets
- ?Limited features of input strings
- Examine only single character pair at a time
- Difficult to use upcoming string context,
lexicons, ... - Example Senator John Green John Green
- Limited edit operations
- Difficult to generate arbitrary jumps in both
strings - Example UMass University of Massachusetts.
- Trained only on positive match data
- Doesnt include information-rich near misses
- Example ACM SIGIR ? ACM SIGCHI
So, consider model trained by conditional
probability
41Conditional Probability (Sequence) Models
- We prefer a model that is trained to maximize a
conditional probability rather than joint
probabilityP(yx) instead of P(y,x) - Can examine features, but not responsible for
generating them. - Dont have to explicitly model their dependencies.
42From HMMs to Conditional Random Fields
Linear-chain
Lafferty, McCallum, Pereira 2001
yt-1
yt
yt1
Joint
...
...
xt
xt1
xt-1
43(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
where
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
2
t
3
t
t
1
-
t
1
said Jones a Microsoft VP
input seq
44CRF String Edit Distance
x1
string 1 alignment string 2
W i l l i a m _ W . _ C o h o n W i l l l e a
m _ C o h e n
a.i1 a.e a.i2
1 2 3 4 4 5
6 7 8 9 10 11 12 13 14 15 16
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
subst
subst
insert
delete
delete
delete
1 2 3 4 5 6
7 8 8 8 8 9 10 11 12 13 14
x2
joint complete data likelihood
conditional complete data likelihood
45CRF String Edit Distance FSM
subst
copy
insert
delete
46CRF String Edit Distance FSM
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
47CRF String Edit Distance FSM
x1 Tommi Jaakkola x2 Tommi Jakola
subst
copy
Probability summed over all alignments in match
states 0.8
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.2
non-match m 0
insert
delete
48CRF String Edit Distance FSM
x1 Tom Dietterich x2 Tom Dean
subst
copy
Probability summed over all alignments in match
states 0.1
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.9
non-match m 0
insert
delete
49Parameter Estimation
Given training set of string pairs and
match/non-match labels, objective fn is the
incomplete log likelihood
- Expectation Maximization
- E-step Estimate distribution over alignments,
, using current parameters - M-step Change parameters to maximize the
complete (penalized) log likelihood, with an
iterative quasi-Newton method (BFGS)
This is conditional EM, but avoid complexities
of Jebara 1998, because no need to solve
M-step in closed form.
50Efficient Training
- Dynamic programming table is 3Dx1 x2
100, S 12, .... 120,000 entries - Use beam search during E-stepPal, Sutton,
McCallum 2005 - Unlike completely observed CRFs, objective
function is not convex. - Initialize parameters not at zero, but so as to
yield a reasonable initial edit distance.
51What Alignments are Learned?
x1 Tommi Jaakkola x2 Tommi Jakola
T o m m i J a a k k o l a T o m m i J a k
o l a
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
52What Alignments are Learned?
x1 Bruce Croft x2 Tom Dean
subst
copy
match m 1
insert
delete
Start
B r u c e C r o f t T o m D e a n
subst
copy
non-match m 0
insert
delete
53What Alignments are Learned?
x1 Jaime Carbonell x2 Jamie Callan
subst
copy
match m 1
insert
delete
Start
J a i m e C a r b o n e l
l J a m i e C a l l a n
subst
copy
non-match m 0
insert
delete
54Example Learned Alignment
55Summary of Advantages
- Arbitrary features of the input strings
- Examine past, future context
- Use lexicons, WordNet
- Extremely flexible edit operations
- Single operation may make arbitrary jumps in both
strings, of size determined by input features - Discriminative Training
- Maximize ability to predict match vs non-match
56Experimental ResultsData Sets
- Restaurant name, Restaurant address
- 864 records, 112 matches
- E.g. Abes Bar Grill, E. Main St
Abes Grill, East Main Street - People names, UIS DB generator
- synthetic noise
- E.g. John Smith vs Snith, John
- CiteSeer Citations
- In four sections Reason, Face, Reinforce,
Constraint - E.g. Rusell Norvig, Artificial Intelligence
A Modern... Russell Norvig,
Artificial Intelligence An Intro...
57Experimental ResultsFeatures
- same, different
- same-alphabetic, different alphbetic
- same-numeric, different-numeric
- punctuation1, punctuation2
- alphabet-mismatch, numeric-mismatch
- end-of-1, end-of-2
- same-next-character, different-next-character
58Experimental ResultsEdit Operations
- insert, delete, substitute/copy
- swap-two-characters
- skip-word-if-in-lexicon
- skip-parenthesized-words
- skip-any-word
- substitute-word-pairs-in-translation-lexicon
- skip-word-if-present-in-other-string
59Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
Restaurant address 0.686 0.712 0.380 0.532
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913
Restaurant name 0.290 0.354 0.365 0.433
Distance metric Levenshtein Learned
Leven. Vector Learned Vector
60Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913 0.
964 0.918 0.917 0.976
Restaurant name 0.290 0.354 0.365 0.433 0.448
Restaurant address 0.686 0.712 0.380 0.532 0.783
Distance metric Levenshtein Learned
Leven. Vector Learned Vector CRF Edit Distance
61Experimental Results
Data set person names, with word-order noise
added
F1 0.856 0.981
Without skip-if-present-in-other-string With
skip-if-present-in-other-string
62Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
63Joint Co-reference Decisions,Discriminative Model
Culotta McCallum 2005
People
Stuart Russell
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
64Co-reference for Multiple Entity Types
Culotta McCallum 2005
People
Organizations
Stuart Russell
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Berkeley
Y/N
Y/N
Y/N
S. Russel
Berkeley
65Joint Co-reference of Multiple Entity Types
Culotta McCallum 2005
People
Organizations
Stuart Russell
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Berkeley
Y/N
Y/N
Y/N
Reduces error by 22
S. Russel
Berkeley
66Joint Co-reference Experimental Results
Culotta McCallum 2005
CiteSeer Dataset 1500 citations, 900 unique
papers, 350 unique venues Paper
Venue indep joint indep joint constraint 88.
9 91.0 79.4 94.1 reinforce 92.2 92.2 56.5 60.1
face 88.2 93.7 80.9 82.8 reason 97.4 97.0 75
.6 79.5 Micro Average 91.7 93.4 73.1 79.1 ?
error20 ?error22
67Outline
- Information Extraction
- Learning in the wild
- Transfer learning
- Identity Uncertainty
- Modeling Groups, Roles and Topics
68Social network from my email
69Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
GenerativeProcess
Example
For each document
70 Iraq war 30 US election
Sample a distributionover topics, ?
For each word in doc
Iraq war
Sample a topic, z
Sample a wordfrom the topic, w
bombing
70Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
71Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
72From LDA to Author-Recipient-Topic
(ART)
73Inference and Estimation
- Gibbs Sampling
- Easy to implement
- Reasonably fast
r
74Outline
a
- Email, motivation
- ART Graphical Model.
- Experimental Results
- Enron Email (corpus)
- Academic Email (one person)
- RART Roles for ART
- Group-Topic Model
- Experiments on voting data
- Voting data from U.S. Senate and the U.N.
a
75Enron Email Corpus
- 250k email messages
- 23k people
Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
76Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
77Topics, and prominent sender/receiversdiscovered
by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
78Comparing Role Discovery
Traditional SNA
Author-Topic
ART
connection strength (A,B)
distribution over recipients
distribution over authored topics
distribution over authored topics
79Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
80Comparing Role Discovery Tracy Geaconne ? Rod
Hayslett
Traditional SNA
Author-Topic
ART
Very similar
Not very similar
Different roles
Geaconne Secretary Hayslett Vice President
CTO
81Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
82McCallum Email Corpus 2004
- January - October 2004
- 23k email messages
- 825 people
From kate_at_cs.umass.edu Subject NIPS and
.... Date June 14, 2004 22741 PM EDT To
mccallum_at_cs.umass.edu There is pertinent stuff
on the first yellow folder that is completed
either travel or other things, so please sign
that first folder anyway. Then, here is the
reminder of the things I'm still waiting
for NIPS registration receipt. CALO
registration receipt. Thanks, Kate
83McCallum Email Blockstructure
84Four most prominent topicsin discussions with
____?
85(No Transcript)
86Two most prominent topicsin discussions with
____?
87Topic 37
88Topic 40
89(No Transcript)
90Outline
a
- Email, motivation
- ART Graphical Model.
- Experimental Results
- Enron Email (corpus)
- Academic Email (one person)
- RART Roles for ART
- Group-Topic Model
- Experiments on voting data
- Voting data from U.S. Senate and the U.N.
a
a
91Role-Author-Recipient-Topic Models
92Results with RARTPeople in Role 3 in
Academic Email
- olc lead Linux sysadmin
- gauthier sysadmin for CIIR group
- irsystem mailing list CIIR sysadmins
- system mailing list for dept. sysadmins
- allan Prof., chair of computing committee
- valerie second Linux sysadmin
- tech mailing list for dept. hardware
- steve head of dept. I.T. support
93Roles for allan (James Allan)
- Role 3 I.T. support
- Role 2 Natural Language researcher
Roles for pereira (Fernando Pereira)
- Role 2 Natural Language researcher
- Role 4 SRI CALO project participant
- Role 6 Grant proposal writer
- Role 10 Grant proposal coordinator
- Role 8 Guests at McCallums house
94Outline
a
- Email, motivation
- ART Graphical Model.
- Experimental Results
- Enron Email (corpus)
- Academic Email (one person)
- RART Roles for ART
- Group-Topic Model
- Experiments on voting data
- Voting data from U.S. Senate and the U.N.
a
a
a
95ART RART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
96A Group ModelStochastic Blockstructures Model
97Group-Topic Model
Wang, Mohanty, McCallum 2005
98U.S. Senate Data sets
- 3426 bills from 16 years of voting records from
the U.S. Senate - Yea / Nea / Abstain (absent)
- Each bill comes with an abstract (text describing
the contents of the bill).
99Topics Discovered
Traditional Mixtures of Unigrams
Group- Topic Model
100Groups Discovered
Agreement Index
101Senators who change Coalition Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
102U.N. Data Set
- 931 U.N. Resolutions, voted on by 192 countries,
from 1990-2003. - Yes / No / Abstain votes
- List of keywords summarizes the content of the
resolution. - Also experiments later with resolutions from
1960-2003
103Topics Discovered
Traditional mixture of unigrams
Group-TopicModel
104GroupsDiscovered
105Groups and Topics, Trends over Time