Group and Topic Discovery from Relations and Their Attributes - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Group and Topic Discovery from Relations and Their Attributes

Description:

Title: Automatically Building Special Purpose Search Engines with Machine Learning Author: Andrew McCallum Last modified by: Andrew McCallum Created Date – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 31
Provided by: Andrew1561
Category:

less

Transcript and Presenter's Notes

Title: Group and Topic Discovery from Relations and Their Attributes


1
Group and Topic Discoveryfrom Relations and
Their Attributes
  • Xuerui Shorey WangNatasha MohantyAndrew
    McCallummccallum_at_cs.umass.edu
  • Computer Science Department
  • University of Massachusetts Amherst

2
Abstract
?We present a probabilistic generative model of
entity relationships and their attributes that
simultaneously discovers groups among the
entities and topics among the corresponding
textual attributes. Block-models of
relationship data have been studied in social
network analysis for some time. Here we
simultaneously cluster in several modalities at
once, incorporating the attributes (here, words)
associated with certain relationships.
Significantly, joint inference allows the
discovery of topics to be guided by the emerging
groups, and vice-versa. We present
experimental results on two large data sets
sixteen years of bills put before the U.S.
Senate, comprising their corresponding text and
voting records, and thirteen years of similar
data from the United Nations. We show that in
comparison with traditional, separate
latent-variable models for words or
Blockstructures for votes, the Group-Topic
model's joint inference discovers more cohesive
groups and improved topics.
3
Social Network in an Email Dataset
4
From LDA to Author-Recipient-Topic
(ART)
5
Enron Email Corpus
  • 250k email messages
  • 23k people

Date Wed, 11 Apr 2001 065600 -0700 (PDT) From
debra.perlingiere_at_enron.com To
steve.hooser_at_enron.com Subject
Enron/TransAltaContract dated Jan 1, 2001 Please
see below. Katalin Kiss of TransAlta has
requested an electronic copy of our final draft?
Are you OK with this? If so, the only version I
have is the original draft without
revisions. DP Debra Perlingiere Enron North
America Corp. Legal Department 1400 Smith Street,
EB 3885 Houston, Texas 77002 dperlin_at_enron.com
6
Topics, and prominent senders /
receiversdiscovered by ART
Topic names, by hand
7
Topics, and prominent senders /
receiversdiscovered by ART
Beck Chief Operations Officer
Dasovich Government Relations
Executive Shapiro Vice President of
Regulatory Affairs Steffes Vice President of
Government Affairs
8
Comparing Role Discovery Tracy Geaconne ? Dan
McCarty
Traditional SNA
Author-Topic
ART
Different roles
Different roles
Similar roles
Geaconne Secretary McCarty Vice President
9
Comparing Role Discovery Lynn Blair ? Kimberly
Watson
Traditional SNA
Author-Topic
ART
Very different
Very similar
Different roles
Blair Gas pipeline logistics Watson
Pipeline facilities planning
10
McCallum Email Corpus 2004
  • January - October 2004
  • 23k email messages
  • 825 people

From kate_at_cs.umass.edu Subject NIPS and
.... Date June 14, 2004 22741 PM EDT To
mccallum_at_cs.umass.edu There is pertinent stuff
on the first yellow folder that is completed
either travel or other things, so please sign
that first folder anyway. Then, here is the
reminder of the things I'm still waiting
for NIPS registration receipt. CALO
registration receipt. Thanks, Kate
11
McCallum Email Blockstructure
12
Four most prominent topicsin discussions with
____?
13
(No Transcript)
14
Two most prominent topicsin discussions with
____?
15
(No Transcript)
16
Pairs with highestrank difference between ART
SNA
5 other professors 3 other ML researchers
17
Role-Author-Recipient-Topic Models
18
Results with RARTPeople in Role 3 in
Academic Email
  • olc lead Linux sysadmin
  • gauthier sysadmin for CIIR group
  • irsystem mailing list CIIR sysadmins
  • system mailing list for dept. sysadmins
  • allan Prof., chair of computing committee
  • valerie second Linux sysadmin
  • tech mailing list for dept. hardware
  • steve head of dept. I.T. support

19
Roles for allan (James Allan)
  • Role 3 I.T. support
  • Role 2 Natural Language researcher

Roles for pereira (Fernando Pereira)
  • Role 2 Natural Language researcher
  • Role 4 SRI CALO project participant
  • Role 6 Grant proposal writer
  • Role 10 Grant proposal coordinator
  • Role 8 Guests at McCallums house

20
ART Roles but not Groups
Traditional SNA
Author-Topic
ART
Not
Not
Block structured
Enron TransWestern Division
21
Groups and Topics
  • Input
  • Observed relations between people
  • Attributes on those relations (text, or
    categorical)
  • Output
  • Attributes clustered into topics
  • Groups of people---varying depending on topic

22
Discovering Groups from Observed Set of Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Admiration relations among six high school
students.
23
Adjacency Matrix Representing Relations
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
G1
G2
G1
G2
G3
G3
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
A
C
B
D
E
F
24
Group Model Partitioning Entities into Groups
Stochastic Blockstructures for Relations Nowicki,
Snijders 2001
Beta
Dirichlet
Multinomial
S number of entities G number of groups
Binomial
Enhanced with arbitrary number of groups in
Kemp, Griffiths, Tenenbaum 2004
25
Two Relations with Different Attributes
Student Roster Adams BennettCarterDavis Edward
s Frederking
Academic Admiration Acad(A, B) Acad(C,
B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D,
E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F,
A) Acad(E, C) Acad(F, C)
Social Admiration Soci(A, B) Soci(A, D) Soci(A,
F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B)
Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C)
Soci(D, E) Soci(E, B) Soci(E, D) Soci(E,
F) Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1 G1 G2 G2 G2
G1
G1
G1
G2
G2
G2
A
C
E
B
D
F
A
C
B
D
E
F
26
Simple Topic Model Good for Single Topic
Documents
Mixture of Unigrams
Uniform
Dirichlet
Multinomial
D number of documents T number of topics
number of tokens in document d
27
GoalModel relations and their (textual)
attributes simultaneously to obtain better groups
and more meaningful topics.
28
The Group-Topic Model Discovering Groups and
Topics Simultaneously
Beta
Uniform
Dirichlet
Multinomial
Dirichlet
Binomial
Multinomial
29
Inference and Estimation
  • Gibbs Sampling
  • Many r.v.s can be integrated out
  • Easy to implement
  • Reasonably fast

We assume the relationship is symmetric.
30
Dataset 1U.S. Senate
  • 16 years of voting records in the US Senate (1989
    2005)
  • a Senator may respond Yea or Nay to a resolution
  • 3423 resolutions with text attributes (index
    terms)
  • 191 Senators in total across 16 years

S.543 Title An Act to reform Federal deposit
insurance, protect the deposit insurance funds,
recapitalize the Bank Insurance Fund, improve
supervision and regulation of insured depository
institutions, and for other purposes. Sponsor
Sen Riegle, Donald W., Jr. MI (introduced
3/5/1991) Cosponsors (2) Latest Major Action
12/19/1991 Became Public Law No 102-242. Index
terms Banks and banking Accounting
Administrative fees Cost control Credit Deposit
insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen
(D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea
Bradley (D-NJ), Nay Conrad (D-ND), Nay
31
Topics Discovered (U.S. Senate)
Education Energy Military Misc. Economic
education energy government federal
school power military labor
aid water foreign insurance
children nuclear tax aid
drug gas congress tax
students petrol aid business
elementary research law employee
prevention pollution policy care
Mixture of Unigrams
Education Domestic Foreign Economic Social Security Medicare
education foreign labor social
school trade insurance security
federal chemicals tax insurance
aid tariff congress medical
government congress income care
tax drugs minimum medicare
energy communicable wage disability
research diseases business assistance
Group-Topic Model
32
Senators in the four groups corresponding to
Topic Education Domestic
Group 1 Group 3 Group 4
73 Republicans Krueger(D-TX) Cohen(R-ME) Danforth(R-MO) Durenberger(R-MN) Hatfield(R-OR) Heinz(R-PA) Kassebaum(R-KS) Packwood(R-OR) Specter(R-PA) Snowe(R-ME) Collins(R-ME) Armstrong(R-CO) Garn(R-UT) Humphrey(R-NH) McCain(R-AZ) McClure(R-ID) Roth(R-DE) Symms(R-ID) Wallop(R-WY) Brown(R-CO) DeWine(R-OH) Thompson(R-TN) Fitzgerald(R-IL) Voinovich(R-OH) Miller(D-GA) Coleman(R-MN)
Group 2 Cohen(R-ME) Danforth(R-MO) Durenberger(R-MN) Hatfield(R-OR) Heinz(R-PA) Kassebaum(R-KS) Packwood(R-OR) Specter(R-PA) Snowe(R-ME) Collins(R-ME) Armstrong(R-CO) Garn(R-UT) Humphrey(R-NH) McCain(R-AZ) McClure(R-ID) Roth(R-DE) Symms(R-ID) Wallop(R-WY) Brown(R-CO) DeWine(R-OH) Thompson(R-TN) Fitzgerald(R-IL) Voinovich(R-OH) Miller(D-GA) Coleman(R-MN)
90 Democrats Chafee(R-RI) Jeffords(I-VT) Cohen(R-ME) Danforth(R-MO) Durenberger(R-MN) Hatfield(R-OR) Heinz(R-PA) Kassebaum(R-KS) Packwood(R-OR) Specter(R-PA) Snowe(R-ME) Collins(R-ME) Armstrong(R-CO) Garn(R-UT) Humphrey(R-NH) McCain(R-AZ) McClure(R-ID) Roth(R-DE) Symms(R-ID) Wallop(R-WY) Brown(R-CO) DeWine(R-OH) Thompson(R-TN) Fitzgerald(R-IL) Voinovich(R-OH) Miller(D-GA) Coleman(R-MN)
33
Senators in the four groups corresponding to
Topic Economic
Group 1 Group 3 Group 4
65 Democrats Jeffords(I-VT) Baucus(D-MT) Boren(D-OK) Breaux(D-LA) Conrad(D-ND) Dixon(D-IL) Exon(D-NE) Ford(D-KY) Heflin(D-AL) Hollings(D-SC) Johnston(D-LA) Nunn(D-GA) Dorgan(D-ND) Mathews(D-TN) Campbell(D-CO) Landrieu(D-LA) Lincoln(D-AR) Bayh(D-IN) Carper(D-DE) Nelson(D-NE) Byrd(D-WV) DeConcini(D-AZ) Burdick,JocelynBirch(D-ND) Feingold(D-WI) Obama(D-IL) Salazar(D-CO)
Group 2 Baucus(D-MT) Boren(D-OK) Breaux(D-LA) Conrad(D-ND) Dixon(D-IL) Exon(D-NE) Ford(D-KY) Heflin(D-AL) Hollings(D-SC) Johnston(D-LA) Nunn(D-GA) Dorgan(D-ND) Mathews(D-TN) Campbell(D-CO) Landrieu(D-LA) Lincoln(D-AR) Bayh(D-IN) Carper(D-DE) Nelson(D-NE) Byrd(D-WV) DeConcini(D-AZ) Burdick,JocelynBirch(D-ND) Feingold(D-WI) Obama(D-IL) Salazar(D-CO)
101 Republicans Shelby(D-AL) Miller(D-GA) Baucus(D-MT) Boren(D-OK) Breaux(D-LA) Conrad(D-ND) Dixon(D-IL) Exon(D-NE) Ford(D-KY) Heflin(D-AL) Hollings(D-SC) Johnston(D-LA) Nunn(D-GA) Dorgan(D-ND) Mathews(D-TN) Campbell(D-CO) Landrieu(D-LA) Lincoln(D-AR) Bayh(D-IN) Carper(D-DE) Nelson(D-NE) Byrd(D-WV) DeConcini(D-AZ) Burdick,JocelynBirch(D-ND) Feingold(D-WI) Obama(D-IL) Salazar(D-CO)
34
Senators Who Change Coalition the most Dependent
on Topic
e.g. Senator Shelby (D-AL) votes with the
Republicans on Economic with the Democrats on
Education Domestic with a small group of
maverick Republicans on Social Security Medicaid
35
Dataset 2The UN General Assembly
  • Voting records of the UN General Assembly (1990 -
    2003)
  • A country may choose to vote Yes, No or Abstain
  • 931 resolutions with text attributes (titles)
  • 192 countries in total
  • Also experiments later with resolutions from
    1960-2003

Vote on Permanent Sovereignty of Palestinian
People, 87th plenary meeting The draft
resolution on permanent sovereignty of the
Palestinian people in the occupied Palestinian
territory, including Jerusalem, and of the Arab
population in the occupied Syrian Golan over
their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3
against with 6 abstentions In favour
Afghanistan, Argentina, Belgium, Brazil, Canada,
China, France, Germany, India, Japan, Mexico,
Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey,
and other 126 countries. Against Israel,
Marshall Islands, United States. Abstain
Australia, Cameroon, Georgia, Kazakhstan,
Uzbekistan, Zambia.
36
Topics Discovered (UN)
Everything Nuclear Human Rights Security in Middle East
Everything Nuclear Security in Middle East
nuclear rights occupied
weapons human israel
use palestine syria
implementation situation security
countries israel calls
Mixture of Unigrams
Nuclear Non-proliferation Nuclear Arms Race Human Rights
nuclear nuclear rights
states arms human
united prevention palestine
weapons race occupied
nations space israel
Group-TopicModel
37
GroupsDiscovered(UN)
The countries list for each group are ordered by
their 2005 GDP (PPP) and only 5 countries are
shown in groups that have more than 5 members.
38
Do We Get Better Groups with the GT Model?
Baseline Model GT Model
  1. Cluster bills into topics using mixture of
    unigrams
  2. Apply group model on topic-specific subsets of
    bills.
  1. Jointly cluster topic and groups at the same time
    using the GT model.

Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 lt.01
UN 0.8548 0.8664 lt.01
Agreement Index (AI) measures group cohesion.
Higher, better.
39
Groups and Topics, Trends over Time (UN)
40
Summary
  • Traditionally, SNA examines links, but not the
    language content on those links.
  • Presented the Group-Topic (GT) model, a graphical
    model augmenting Stochastic Blockstructures with
    a words and a latent topic model.
  • Attributes on relations could also be
    categorical, or real-valued.
  • GT finds the topics that most help predict
    relations.
Write a Comment
User Comments (0)
About PowerShow.com