Title: Group and Topic Discovery from Relations and Their Attributes
1Group and Topic Discoveryfrom Relations and
Their Attributes
- Xuerui Shorey WangNatasha MohantyAndrew
McCallummccallum_at_cs.umass.edu - Computer Science Department
- University of Massachusetts Amherst
2Abstract
?We present a probabilistic generative model of
entity relationships and their attributes that
simultaneously discovers groups among the
entities and topics among the corresponding
textual attributes. Block-models of
relationship data have been studied in social
network analysis for some time. Here we
simultaneously cluster in several modalities at
once, incorporating the attributes (here, words)
associated with certain relationships.
Significantly, joint inference allows the
discovery of topics to be guided by the emerging
groups, and vice-versa. We present
experimental results on two large data sets
sixteen years of bills put before the U.S.
Senate, comprising their corresponding text and
voting records, and thirteen years of similar
data from the United Nations. We show that in
comparison with traditional, separate
latent-variable models for words or
Blockstructures for votes, the Group-Topic
model's joint inference discovers more cohesive
groups and improved topics.
3Social Network in an Email Dataset
4From LDA to Author-Recipient-Topic
(ART)
5Enron Email Corpus
- 250k email messages
- 23k people
Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
6Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
7Topics, and prominent senders /
receiversdiscovered by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
8Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
9Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
10McCallum Email Corpus 2004
- January - October 2004
- 23k email messages
- 825 people
From kate_at_cs.umass.edu Subject NIPS and
.... Date June 14, 2004 22741 PM EDT To
mccallum_at_cs.umass.edu There is pertinent stuff
on the first yellow folder that is completed
either travel or other things, so please sign
that first folder anyway. Then, here is the
reminder of the things I'm still waiting
for NIPS registration receipt. CALO
registration receipt. Thanks, Kate
11McCallum Email Blockstructure
12Four most prominent topicsin discussions with
____?
13(No Transcript)
14Two most prominent topicsin discussions with
____?
15(No Transcript)
16Pairs with highestrank difference between ART
SNA
5 other professors 3 other ML researchers
17Role-Author-Recipient-Topic Models
18Results with RARTPeople in Role 3 in
Academic Email
- olc lead Linux sysadmin
- gauthier sysadmin for CIIR group
- irsystem mailing list CIIR sysadmins
- system mailing list for dept. sysadmins
- allan Prof., chair of computing committee
- valerie second Linux sysadmin
- tech mailing list for dept. hardware
- steve head of dept. I.T. support
19Roles for allan (James Allan)
- Role 3 I.T. support
- Role 2 Natural Language researcher
Roles for pereira (Fernando Pereira)
- Role 2 Natural Language researcher
- Role 4 SRI CALO project participant
- Role 6 Grant proposal writer
- Role 10 Grant proposal coordinator
- Role 8 Guests at McCallums house
20ART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
21Groups and Topics
- Input
- Observed relations between people
- Attributes on those relations (text, or
categorical) - Output
- Attributes clustered into topics
- Groups of people---varying depending on topic
22Discovering Groups from Observed Set of Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Admiration relations among six high school
students.
23Adjacency Matrix Representing Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
G1
G2
G1
G2
G3
G3
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
24Group Model Partitioning Entities into Groups
Stochastic Blockstructures for Relations Nowicki,
Snijders 2001
Beta
Dirichlet
Multinomial
S number of entities G number of groups
Binomial
Enhanced with arbitrary number of groups in
Kemp, Griffiths, Tenenbaum 2004
25Two Relations with Different Attributes
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Social Admiration Soci(A, B) Soci(A, D) Soci(A,
F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B)
Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C)
Soci(D, E) Soci(E, B) Soci(E, D) Soci(E,
F) Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1 G1 G2 G2 G2
G1
G1
G1
G2
G2
G2
A
C
E
B
D
F
A
C
B
D
E
F
26Simple Topic Model Good for Single Topic
Documents
Mixture of Unigrams
Uniform
Dirichlet
Multinomial
D number of documents T number of topics
number of tokens in document d
27GoalModel relations and their (textual)
attributes simultaneously to obtain better groups
and more meaningful topics.
28The Group-Topic Model Discovering Groups and
Topics Simultaneously
Beta
Uniform
Dirichlet
Multinomial
Dirichlet
Binomial
Multinomial
29Inference and Estimation
- Gibbs Sampling
- Many r.v.s can be integrated out
- Easy to implement
- Reasonably fast
We assume the relationship is symmetric.
30Dataset 1U.S. Senate
- 16 years of voting records in the US Senate (1989
2005) - a Senator may respond Yea or Nay to a resolution
- 3423 resolutions with text attributes (index
terms) - 191 Senators in total across 16 years
S.543 Title An Act to reform Federal deposit
insurance, protect the deposit insurance funds,
recapitalize the Bank Insurance Fund, improve
supervision and regulation of insured depository
institutions, and for other purposes. Sponsor
Sen Riegle, Donald W., Jr. MI (introduced
3/5/1991) Cosponsors (2) Latest Major Action
12/19/1991 Became Public Law No 102-242. Index
terms Banks and banking Accounting
Administrative fees Cost control Credit Deposit
insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen
(D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea
Bradley (D-NJ), Nay Conrad (D-ND), Nay
31Topics Discovered (U.S. Senate)
Education Energy Military Misc. Economic
education energy government federal
school power military labor
aid water foreign insurance
children nuclear tax aid
drug gas congress tax
students petrol aid business
elementary research law employee
prevention pollution policy care
Mixture of Unigrams
Education Domestic Foreign Economic Social Security Medicare
education foreign labor social
school trade insurance security
federal chemicals tax insurance
aid tariff congress medical
government congress income care
tax drugs minimum medicare
energy communicable wage disability
research diseases business assistance
Group-Topic Model
32Senators in the four groups corresponding to
Topic Education Domestic
Group 1 Group 3 Group 4
73 Republicans Krueger(D-TX) Cohen(R-ME) Danforth(R-MO) Durenberger(R-MN) Hatfield(R-OR) Heinz(R-PA) Kassebaum(R-KS) Packwood(R-OR) Specter(R-PA) Snowe(R-ME) Collins(R-ME) Armstrong(R-CO) Garn(R-UT) Humphrey(R-NH) McCain(R-AZ) McClure(R-ID) Roth(R-DE) Symms(R-ID) Wallop(R-WY) Brown(R-CO) DeWine(R-OH) Thompson(R-TN) Fitzgerald(R-IL) Voinovich(R-OH) Miller(D-GA) Coleman(R-MN)
Group 2 Cohen(R-ME) Danforth(R-MO) Durenberger(R-MN) Hatfield(R-OR) Heinz(R-PA) Kassebaum(R-KS) Packwood(R-OR) Specter(R-PA) Snowe(R-ME) Collins(R-ME) Armstrong(R-CO) Garn(R-UT) Humphrey(R-NH) McCain(R-AZ) McClure(R-ID) Roth(R-DE) Symms(R-ID) Wallop(R-WY) Brown(R-CO) DeWine(R-OH) Thompson(R-TN) Fitzgerald(R-IL) Voinovich(R-OH) Miller(D-GA) Coleman(R-MN)
90 Democrats Chafee(R-RI) Jeffords(I-VT) Cohen(R-ME) Danforth(R-MO) Durenberger(R-MN) Hatfield(R-OR) Heinz(R-PA) Kassebaum(R-KS) Packwood(R-OR) Specter(R-PA) Snowe(R-ME) Collins(R-ME) Armstrong(R-CO) Garn(R-UT) Humphrey(R-NH) McCain(R-AZ) McClure(R-ID) Roth(R-DE) Symms(R-ID) Wallop(R-WY) Brown(R-CO) DeWine(R-OH) Thompson(R-TN) Fitzgerald(R-IL) Voinovich(R-OH) Miller(D-GA) Coleman(R-MN)
33Senators in the four groups corresponding to
Topic Economic
Group 1 Group 3 Group 4
65 Democrats Jeffords(I-VT) Baucus(D-MT) Boren(D-OK) Breaux(D-LA) Conrad(D-ND) Dixon(D-IL) Exon(D-NE) Ford(D-KY) Heflin(D-AL) Hollings(D-SC) Johnston(D-LA) Nunn(D-GA) Dorgan(D-ND) Mathews(D-TN) Campbell(D-CO) Landrieu(D-LA) Lincoln(D-AR) Bayh(D-IN) Carper(D-DE) Nelson(D-NE) Byrd(D-WV) DeConcini(D-AZ) Burdick,JocelynBirch(D-ND) Feingold(D-WI) Obama(D-IL) Salazar(D-CO)
Group 2 Baucus(D-MT) Boren(D-OK) Breaux(D-LA) Conrad(D-ND) Dixon(D-IL) Exon(D-NE) Ford(D-KY) Heflin(D-AL) Hollings(D-SC) Johnston(D-LA) Nunn(D-GA) Dorgan(D-ND) Mathews(D-TN) Campbell(D-CO) Landrieu(D-LA) Lincoln(D-AR) Bayh(D-IN) Carper(D-DE) Nelson(D-NE) Byrd(D-WV) DeConcini(D-AZ) Burdick,JocelynBirch(D-ND) Feingold(D-WI) Obama(D-IL) Salazar(D-CO)
101 Republicans Shelby(D-AL) Miller(D-GA) Baucus(D-MT) Boren(D-OK) Breaux(D-LA) Conrad(D-ND) Dixon(D-IL) Exon(D-NE) Ford(D-KY) Heflin(D-AL) Hollings(D-SC) Johnston(D-LA) Nunn(D-GA) Dorgan(D-ND) Mathews(D-TN) Campbell(D-CO) Landrieu(D-LA) Lincoln(D-AR) Bayh(D-IN) Carper(D-DE) Nelson(D-NE) Byrd(D-WV) DeConcini(D-AZ) Burdick,JocelynBirch(D-ND) Feingold(D-WI) Obama(D-IL) Salazar(D-CO)
34Senators Who Change Coalition the most Dependent
on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
35Dataset 2The UN General Assembly
- Voting records of the UN General Assembly (1990 -
2003) - A country may choose to vote Yes, No or Abstain
- 931 resolutions with text attributes (titles)
- 192 countries in total
- Also experiments later with resolutions from
1960-2003
Vote on Permanent Sovereignty of Palestinian
People, 87th plenary meeting The draft
resolution on permanent sovereignty of the
Palestinian people in the occupied Palestinian
territory, including Jerusalem, and of the Arab
population in the occupied Syrian Golan over
their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3
against with 6 abstentions In favour
Afghanistan, Argentina, Belgium, Brazil, Canada,
China, France, Germany, India, Japan, Mexico,
Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey,
and other 126 countries. Against Israel,
Marshall Islands, United States. Abstain
Australia, Cameroon, Georgia, Kazakhstan,
Uzbekistan, Zambia.
36Topics Discovered (UN)
Everything Nuclear Human Rights Security in Middle East
Everything Nuclear Security in Middle East
nuclear rights occupied
weapons human israel
use palestine syria
implementation situation security
countries israel calls
Mixture of Unigrams
Nuclear Non-proliferation Nuclear Arms Race Human Rights
nuclear nuclear rights
states arms human
united prevention palestine
weapons race occupied
nations space israel
Group-TopicModel
37GroupsDiscovered(UN)
The countries list for each group are ordered by
their 2005 GDP (PPP) and only 5 countries are
shown in groups that have more than 5 members.
38Do We Get Better Groups with the GT Model?
Baseline Model GT Model
- Cluster bills into topics using mixture of
unigrams - Apply group model on topic-specific subsets of
bills.
- Jointly cluster topic and groups at the same time
using the GT model.
Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 lt.01
UN 0.8548 0.8664 lt.01
Agreement Index (AI) measures group cohesion.
Higher, better.
39Groups and Topics, Trends over Time (UN)
40Summary
- Traditionally, SNA examines links, but not the
language content on those links. - Presented the Group-Topic (GT) model, a graphical
model augmenting Stochastic Blockstructures with
a words and a latent topic model. - Attributes on relations could also be
categorical, or real-valued. - GT finds the topics that most help predict
relations.