EPCA Integration - PowerPoint PPT Presentation

About This Presentation
Title:

EPCA Integration

Description:

people.cs.umass.edu – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 90
Provided by: Team153
Category:

less

Transcript and Presenter's Notes

Title: EPCA Integration


1
Interactive Information Extractionand Social
Network AnalysisAndrew McCallum Information
Extraction and Synthesis LaboratoryUMass
Amherst
2
The Application
Workplace effectiveness Ability to leverage
network of acquaintances The power of your
little black book But filling Contacts DB by
hand is tedious, and incomplete.
Contacts DB
Email Inbox
Automatically
WWW
3
DEX Overview
CRF
WWW
Email
names
4
DEX Example
To Andrew McCallum mccallum_at_cs.umass.edu Subjec
t ...
First Name Andrew
Middle Name Kachites
Last Name McCallum
JobTitle Associate Professor
Company University of Massachusetts
Street Address 140 Governors Dr.
City Amherst
State MA
Zip 01003
Company Phone (413) 545-1323
Links Fernando Pereira, Sam Roweis,
Key Words Information extraction, social network,
Search for new people
5
Summary of Results
Example keywords extracted
Person Keywords
William Cohen Logic programming Text categorization Data integration Rule learning
Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables
Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies
Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence
Contact info and name extraction performance (25
fields)
Token Acc Field Prec Field Recall Field F1
CRF 94.50 85.73 76.33 80.76
  1. Expert Finding When solving some task, find
    friends-of-friends with relevant expertise.
    Avoid stove-piping in large orgs by
    automatically suggesting collaborators. Given a
    task, automatically suggest the right team for
    the job. (Hiring aid!)
  2. Social Network Analysis Understand the social
    structure of your organization. Suggest
    structural changes for improved efficiency.

6
Outline
  • Information Extraction
  • Learning in the wild
  • Transfer learning
  • Identity Uncertainty
  • Modeling Groups, Roles and Topics

7
Outline
  • Information Extraction
  • Learning in the wild
  • Transfer learning
  • Identity Uncertainty
  • Modeling Groups, Roles and Topics

8
0. Segmenting and labeling sequence
dataLinear-chain CRFs
Lafferty, McCallum, Pereira 2001
PER O O TIME O O
ORG O LOC
...
y
Named entity labels
...
x
CALO email words
Dave , The Friday meeting with Tembec in NY

Leveraging data from KnowItAll,Etzioni et al,
2004 UPenn help.
Enron email labeled by Michael Collins, et
al. 1200 entities
Field F1 DATE 0.8483 TIME 0.7939 LOCATION 0.64
76 PERSON 0.8439 ORGANIZATION 0.5987 ACRONYM 0.2
804 PHONE 0.7943 MONEY 0.7143 PERCENT 0.9091 OV
ERALL 0.7282
From monika.causholli_at_enron.com Dave, The
Friday meeting with Tembec in NY has been
postponed until next week. Attached is the
information you requested. Let me know if you
need anything else. Also did Doug give you the
data about consumer products? Cheers, Monica
Li, McCallum, unpublished, 2004
9
Motivation
  • Capture confidence of records in extracted
    database
  • Alerts data mining to possible errors in
    database

First Name Last Name Confidence
Bill Gates 0.96
Bill banks 0.43
10
Confidence Estimation inLinear-chain CRFs
Culotta, McCallum 2004
Finite State Lattice
output sequence
y
y
y
y
y
t2
t3
t
-
1
t
t1
ORG
OTHER
Lattice ofFSM states
. . .
PERSON
TITLE
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input sequence
said Arden Bement NSF Director
11
Confidence Estimation inLinear-chain CRFs
Culotta, McCallum 2004
Constrained Forward-Backward
output sequence
y
y
y
y
y
t2
t3
t
-
1
t
t1
ORG
OTHER
Lattice ofFSM states
. . .
PERSON
TITLE
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input sequence
said Arden Bement NSF Director
12
Forward-Backward Confidence Estimationimproves
accuracy/coverage
ourforward-backwardconfidence
optimal
traditionaltoken-wiseconfidence
no use ofconfidence
13
Application of Confidence Estimation
  • Interactive Information Extraction
  • To correct predictions, direct user to least
    confident field

14
Interactive Information Extraction
  • IE algorithm calculates confidence scores
  • UI uses confidence scores to alert user to
    possible errors
  • IE algorithm takes corrections into account and
    propagates correction to other fields

15
User Correction
  • User Corrects a field, e.g. dragging Stanley to
    the First Name field

x1 x2
x3 x4 x5
First Name
Last Name
Address Line
Charles
Stanley
100
Charles
Street
y1 y2
y3 y4 y5
16
Remove Paths
  • User Corrects a field, e.g. dragging Stanley to
    the First Name field

x1 x2
x3 x4 x5
First Name
Last Name
Address Line
Charles
Stanley
100
Charles
Street
y1 y2
y3 y4 y5
17
Constrained Viterbi
  • Viterbi algorithm is constrained to pass through
    the designated state.

x1 x2
x3 x4 x5
First Name
Last Name
Address Line
Charles
Stanley
100
Charles
Street
y1 y2
y3 y4 y5
Adjacent field changed Correction Propagation
18
Constrained Viterbi
  • After fixing least confident field,constrained
    Viterbi automatically reduces error by another
    23.
  • Recent work reduces annotation effort further
  • simplifies annotation to multiple-choice

First Name Last Name City
Bill Gates Redmond WA
Bill Gates Redmond
A) B)
19
User feedback in the wildas labeling
Labeling for Classification
Seminar How to Organize your Life by Jane
Smith, Stevenson Smith Mezzanine Level,
Papadapoulos Sq 330 pm Thursday March 31 In
this seminar we will learn how to use CALO to...
Seminar announcement
Todo request
Other
Easy Often found in user interfaces e.g. CALO
IRIS, Apple Mail
20
Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information Extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
21
Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
Interface presents top hypothesized segmentations
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
user corrects labels, not segmentations
22
Multiple-choice Annotation forLearning
Extractors in the wild
Culotta, McCallum 2005
Task Information extraction.Fields NAME
COMPANY ADDRESS (and others)
Jane Smith , Stevenson Smith , Mezzanine Level,
Papadopoulos Sq.
Interface presents top hypothesized segmentations
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
Jane Smith , Stevenson Smith Mezzanine Level ,
Papadopoulos Sq.
29 percent reduction in user actions needed to
train
23
Outline
  • Information Extraction
  • Learning in the wild
  • Transfer learning
  • Identity Uncertainty
  • Modeling Groups, Roles and Topics

24
Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Emailed seminar annmt entities
Email English words
60k words training.
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Too little labeled training data.
25
Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Train on related task with more data.
Newswire named entities
Newswire English words
200k words training.
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
26
Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
At test time, label email with newswire NEs...
Newswire named entities
Email English words
27
Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
then use these labels as features for final task
Emailed seminar annmt entities
Newswire named entities
Email English words
28
Piecewise Training in Factorial CRFsfor Transfer
Learning
Sutton, McCallum, 2005
Use joint inference at test time.
Seminar Announcement entities
Newswire named entities
English words
An alternative to hierarchical Bayes. Neednt
know anything about parameterization of subtask.
Accuracy No transfer lt Cascaded Transfer lt
Joint Inference Transfer
29
CRF Transfer Learning Results
Sutton, McCallum, 2005
Seminar Announcements Dataset Freitag
1998 CRF location speaker stime etime
overall No transfer 73.7 81.0 99.1 97.3
87.8 Cascaded transfer 74.2 84.3 99.2 96.0
88.4 Joint transfer 76.3 85.3 99.1 96.0 89.2
New best published accuracy on common dataset
30
A Conditional Random Field for Discriminatively-tr
ained Finite-state String Edit Distance
  • Andrew McCallum
  • Kedar Bellare
  • Fernando Pereira

Thanks to Charles Sutton, Xuerui Wang and Mikhail
Bilenko for helpful discussions.
31
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.

32
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication

Apex International Hotel Grassmarket Street
Apex Internatl Grasmarket Street
Records are duplicates of the same hotel?
33
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication
  • Biological Sequences

AGCTCTTACGATAGAGGACTCCAGA
AGGTCTTACCAAAGAGGACTTCAGA
34
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication
  • Biological Sequences
  • Machine Translation

Il a achete une pomme
He bought an apple
35
String Edit Distance
  • Distance between sequences x and y
  • cost of lowest-cost sequence of edit operations
    that transform string x into y.
  • Applications
  • Database Record Deduplication
  • Biological Sequences
  • Machine Translation
  • Textual Entailment

He bought a new car last night
He purchased a brand new automobile yesterday
evening
36
Levenshtein Distance
1966
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
37
Levenshtein Distance
copy Copy a character from x to y (cost
0) insert Insert a character into y (cost
1) delete Delete a character from y (cost
1) subst Substitute one character for
another (cost 1)
Edit operations
Dynamic program
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
D(i,j) score of best alignment
from x1... xi to y1... yj.
insert
D(i-1,j-1) ?(xi?yj ) D(i,j) min
D(i-1,j) 1 D(i,j-1) 1
subst
total cost distance
38
Levenshtein Distancewith Markov Dependencies
repeated delete is cheaper
Cost after a c i d s copy Copy a
character from x to y 0 0 0 0 insert Insert a
character into y 1 1 1 delete Delete a
character from y 1 1 1 subst Substitute one
character for another 1 1 1 1
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2
3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4
5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4
3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2
subst
copy
insert
delete
3D DP table
39
?Ristad Yianilos (1997)
Essentially a Pair-HMM, generating a
edit/state/alignment-sequence and two strings
Learn via EM Expectation step Calculate
likelihood of alignment paths Maximization
step Make those paths more likely.
40
Ristad Yianilos Regrets
  • ?Limited features of input strings
  • Examine only single character pair at a time
  • Difficult to use upcoming string context,
    lexicons, ...
  • Example Senator John Green John Green
  • Limited edit operations
  • Difficult to generate arbitrary jumps in both
    strings
  • Example UMass University of Massachusetts.
  • Trained only on positive match data
  • Doesnt include information-rich near misses
  • Example ACM SIGIR ? ACM SIGCHI

So, consider model trained by conditional
probability
41
Conditional Probability (Sequence) Models
  • We prefer a model that is trained to maximize a
    conditional probability rather than joint
    probabilityP(yx) instead of P(y,x)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their dependencies.

42
From HMMs to Conditional Random Fields
Linear-chain
Lafferty, McCallum, Pereira 2001
yt-1
yt
yt1
Joint
...
...
xt
xt1
xt-1
43
(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
sequence given input sequence
where
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
2
t
3
t
t
1
-
t
1
said Jones a Microsoft VP
input seq
44
CRF String Edit Distance
x1
string 1 alignment string 2
W i l l i a m _ W . _ C o h o n W i l l l e a
m _ C o h e n
a.i1 a.e a.i2
1 2 3 4 4 5
6 7 8 9 10 11 12 13 14 15 16
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
copy
subst
subst
insert
delete
delete
delete
1 2 3 4 5 6
7 8 8 8 8 9 10 11 12 13 14
x2
joint complete data likelihood
conditional complete data likelihood
45
CRF String Edit Distance FSM
subst
copy
insert
delete
46
CRF String Edit Distance FSM
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
47
CRF String Edit Distance FSM
x1 Tommi Jaakkola x2 Tommi Jakola
subst
copy
Probability summed over all alignments in match
states 0.8
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.2
non-match m 0
insert
delete
48
CRF String Edit Distance FSM
x1 Tom Dietterich x2 Tom Dean
subst
copy
Probability summed over all alignments in match
states 0.1
match m 1
insert
delete
Start
subst
copy
Probability summed over all alignments in
non-match states 0.9
non-match m 0
insert
delete
49
Parameter Estimation
Given training set of string pairs and
match/non-match labels, objective fn is the
incomplete log likelihood
  • Expectation Maximization
  • E-step Estimate distribution over alignments,
    , using current parameters
  • M-step Change parameters to maximize the
    complete (penalized) log likelihood, with an
    iterative quasi-Newton method (BFGS)

This is conditional EM, but avoid complexities
of Jebara 1998, because no need to solve
M-step in closed form.
50
Efficient Training
  • Dynamic programming table is 3Dx1 x2
    100, S 12, .... 120,000 entries
  • Use beam search during E-stepPal, Sutton,
    McCallum 2005
  • Unlike completely observed CRFs, objective
    function is not convex.
  • Initialize parameters not at zero, but so as to
    yield a reasonable initial edit distance.

51
What Alignments are Learned?
x1 Tommi Jaakkola x2 Tommi Jakola
T o m m i J a a k k o l a T o m m i J a k
o l a
subst
copy
match m 1
insert
delete
Start
subst
copy
non-match m 0
insert
delete
52
What Alignments are Learned?
x1 Bruce Croft x2 Tom Dean
subst
copy
match m 1
insert
delete
Start
B r u c e C r o f t T o m D e a n
subst
copy
non-match m 0
insert
delete
53
What Alignments are Learned?
x1 Jaime Carbonell x2 Jamie Callan
subst
copy
match m 1
insert
delete
Start
J a i m e C a r b o n e l
l J a m i e C a l l a n
subst
copy
non-match m 0
insert
delete
54
Example Learned Alignment
55
Summary of Advantages
  • Arbitrary features of the input strings
  • Examine past, future context
  • Use lexicons, WordNet
  • Extremely flexible edit operations
  • Single operation may make arbitrary jumps in both
    strings, of size determined by input features
  • Discriminative Training
  • Maximize ability to predict match vs non-match

56
Experimental ResultsData Sets
  • Restaurant name, Restaurant address
  • 864 records, 112 matches
  • E.g. Abes Bar Grill, E. Main St
    Abes Grill, East Main Street
  • People names, UIS DB generator
  • synthetic noise
  • E.g. John Smith vs Snith, John
  • CiteSeer Citations
  • In four sections Reason, Face, Reinforce,
    Constraint
  • E.g. Rusell Norvig, Artificial Intelligence
    A Modern... Russell Norvig,
    Artificial Intelligence An Intro...

57
Experimental ResultsFeatures
  • same, different
  • same-alphabetic, different alphbetic
  • same-numeric, different-numeric
  • punctuation1, punctuation2
  • alphabet-mismatch, numeric-mismatch
  • end-of-1, end-of-2
  • same-next-character, different-next-character

58
Experimental ResultsEdit Operations
  • insert, delete, substitute/copy
  • swap-two-characters
  • skip-word-if-in-lexicon
  • skip-parenthesized-words
  • skip-any-word
  • substitute-word-pairs-in-translation-lexicon
  • skip-word-if-present-in-other-string

59
Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
Restaurant address 0.686 0.712 0.380 0.532
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913
Restaurant name 0.290 0.354 0.365 0.433
Distance metric Levenshtein Learned
Leven. Vector Learned Vector
60
Experimental Results
Bilenko Mooney 2003
F1 (average of precision and recall)
CiteSeer Reason Face Reinf Constraint
0.927 0.952 0.893 0.924 0.938 0.966 0.907 0.941 0
.897 0.922 0.903 0.923 0.924 0.875 0.808 0.913 0.
964 0.918 0.917 0.976
Restaurant name 0.290 0.354 0.365 0.433 0.448
Restaurant address 0.686 0.712 0.380 0.532 0.783

Distance metric Levenshtein Learned
Leven. Vector Learned Vector CRF Edit Distance
61
Experimental Results
Data set person names, with word-order noise
added
F1 0.856 0.981
Without skip-if-present-in-other-string With
skip-if-present-in-other-string
62
Outline
  • Information Extraction
  • Learning in the wild
  • Transfer learning
  • Identity Uncertainty
  • Modeling Groups, Roles and Topics

63
Joint Co-reference Decisions,Discriminative Model
Culotta McCallum 2005
People
Stuart Russell
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
64
Co-reference for Multiple Entity Types
Culotta McCallum 2005
People
Organizations
Stuart Russell
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Berkeley
Y/N
Y/N
Y/N
S. Russel
Berkeley
65
Joint Co-reference of Multiple Entity Types
Culotta McCallum 2005
People
Organizations
Stuart Russell
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Berkeley
Y/N
Y/N
Y/N
Reduces error by 22
S. Russel
Berkeley
66
Joint Co-reference Experimental Results
Culotta McCallum 2005
CiteSeer Dataset 1500 citations, 900 unique
papers, 350 unique venues Paper
Venue indep joint indep joint constraint 88.
9 91.0 79.4 94.1 reinforce 92.2 92.2 56.5 60.1
face 88.2 93.7 80.9 82.8 reason 97.4 97.0 75
.6 79.5 Micro Average 91.7 93.4 73.1 79.1 ?
error20 ?error22
67
Outline
  • Information Extraction
  • Learning in the wild
  • Transfer learning
  • Identity Uncertainty
  • Modeling Groups, Roles and Topics

68
Social network from my email
69
Clustering words into topics withLatent
Dirichlet Allocation
Blei, Ng, Jordan 2003
GenerativeProcess
Example
For each document
70 Iraq war 30 US election
Sample a distributionover topics, ?
For each word in doc
Iraq war
Sample a topic, z
Sample a wordfrom the topic, w
bombing
70
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
71
Example topicsinduced from a large collection of
text
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
Tennenbaum et al
72
From LDA to Author-Recipient-Topic
(ART)
73
Inference and Estimation
  • Gibbs Sampling
  • Easy to implement
  • Reasonably fast

r
74
Outline
a
  • Email, motivation
  • ART Graphical Model.
  • Experimental Results
  • Enron Email (corpus)
  • Academic Email (one person)
  • RART Roles for ART
  • Group-Topic Model
  • Experiments on voting data
  • Voting data from U.S. Senate and the U.N.

a
75
Enron Email Corpus
  • 250k email messages
  • 23k people

Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
76
Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
77
Topics, and prominent sender/receiversdiscovered
by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
78
Comparing Role Discovery
Traditional SNA
Author-Topic
ART
connection strength (A,B)
distribution over recipients
distribution over authored topics
distribution over authored topics
79
Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
80
Comparing Role Discovery Tracy Geaconne ? Rod
Hayslett
Traditional SNA
Author-Topic
ART
Very similar
Not very similar
Different roles
Geaconne Secretary Hayslett Vice President
CTO
81
Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
82
McCallum Email Corpus 2004
  • January - October 2004
  • 23k email messages
  • 825 people

From kate_at_cs.umass.edu Subject NIPS and
.... Date June 14, 2004 22741 PM EDT To
mccallum_at_cs.umass.edu There is pertinent stuff
on the first yellow folder that is completed
either travel or other things, so please sign
that first folder anyway. Then, here is the
reminder of the things I'm still waiting
for NIPS registration receipt. CALO
registration receipt. Thanks, Kate
83
McCallum Email Blockstructure
84
Four most prominent topicsin discussions with
____?
85
(No Transcript)
86
Two most prominent topicsin discussions with
____?
87
Topic 37
88
Topic 40
89
(No Transcript)
90
Outline
a
  • Email, motivation
  • ART Graphical Model.
  • Experimental Results
  • Enron Email (corpus)
  • Academic Email (one person)
  • RART Roles for ART
  • Group-Topic Model
  • Experiments on voting data
  • Voting data from U.S. Senate and the U.N.

a
a
91
Role-Author-Recipient-Topic Models
92
Results with RARTPeople in Role 3 in
Academic Email
  • olc lead Linux sysadmin
  • gauthier sysadmin for CIIR group
  • irsystem mailing list CIIR sysadmins
  • system mailing list for dept. sysadmins
  • allan Prof., chair of computing committee
  • valerie second Linux sysadmin
  • tech mailing list for dept. hardware
  • steve head of dept. I.T. support

93
Roles for allan (James Allan)
  • Role 3 I.T. support
  • Role 2 Natural Language researcher

Roles for pereira (Fernando Pereira)
  • Role 2 Natural Language researcher
  • Role 4 SRI CALO project participant
  • Role 6 Grant proposal writer
  • Role 10 Grant proposal coordinator
  • Role 8 Guests at McCallums house

94
Outline
a
  • Email, motivation
  • ART Graphical Model.
  • Experimental Results
  • Enron Email (corpus)
  • Academic Email (one person)
  • RART Roles for ART
  • Group-Topic Model
  • Experiments on voting data
  • Voting data from U.S. Senate and the U.N.

a
a
a
95
ART RART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
96
A Group ModelStochastic Blockstructures Model
97
Group-Topic Model
Wang, Mohanty, McCallum 2005
98
U.S. Senate Data sets
  • 3426 bills from 16 years of voting records from
    the U.S. Senate
  • Yea / Nea / Abstain (absent)
  • Each bill comes with an abstract (text describing
    the contents of the bill).

99
Topics Discovered
Traditional Mixtures of Unigrams
Group- Topic Model
100
Groups Discovered
Agreement Index
101
Senators who change Coalition Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
102
U.N. Data Set
  • 931 U.N. Resolutions, voted on by 192 countries,
    from 1990-2003.
  • Yes / No / Abstain votes
  • List of keywords summarizes the content of the
    resolution.
  • Also experiments later with resolutions from
    1960-2003

103
Topics Discovered
Traditional mixture of unigrams
Group-TopicModel
104
GroupsDiscovered
105
Groups and Topics, Trends over Time
Write a Comment
User Comments (0)
About PowerShow.com