Title: Latent Semantic Analysis Probabilistic Topic Models
1Latent Semantic AnalysisProbabilistic Topic
Models Associative Memory
2The Psychological Problem
- How do we learn semantic structure?
- Covariation between words and the contexts they
appear in (e.g. LSA) - How do we represent semantic structure?
- Semantic Spaces (e.g. LSA)
- Probabilistic Topics
3Latent Semantic Analysis(Landauer Dumais, 1997)
high dimensional space
SVD
STREAM
word-document counts
RIVER
BANK
MONEY
- Each word is a single point in semantic space
- Similarity measured by cosine of angle between
word vectors
4Critical Assumptions of Semantic Spaces (e.g.
LSA)
- Psychological distance should obey three axioms
- Minimality
- Symmetry
- Triangle inequality
5For conceptual relations, violations of distance
axioms often found
- Similarities can often be asymmetric
- North-Korea is more similar to China than
vice versa - Pomegranate is more similar to Apple than
vice versa - Violations of triangle inequality
AC
AB
BC
Euclidian distance AC ? AB BC
6Triangle Inequality in Semantic Spaces might not
always hold
THEATER
w1
w2
w3
SOCCER
PLAY
Euclidian distance AC ? AB BC
Cosine similarity cos(w1,w3)
cos(w1,w2)cos(w2,w3) sin(w1,w2)sin(w2,w3)
7Nearest neighbor problem (Tversky Hutchinson
(1986)
- In similarity data, Fruit is nearest neighbor
in 18 out of 20 fruit words - In 2D solution, Fruit can be nearest neighbor
of at most 5 items - High-dimensional solutions might solve this but
these are less appealing
8Probabilistic Topic Models
- A probabilistic version of LSA no spatial
constraints. - Originated in domain of statistics machine
learning - (e.g., Hoffman, 2001 Blei, Ng, Jordan, 2003)
- Extracts topics from large collections of text
- Topics are interpretable unlike the arbitrary
dimensions of LSA
9Model is Generative
Find parameters that reconstruct data
DATA Corpus of text Word counts for each document
Topic Model
10Probabilistic Topic Models
- Each document is a probability distribution over
topics (distribution over topics gist) - Each topic is a probability distribution over
words
11Document generation as a probabilistic process
- for each document, choosea mixture of topics
- For every word slot, sample a topic 1..T from
the mixture - sample a word from the topic
TOPICS MIXTURE
...
TOPIC
TOPIC
WORD
...
WORD
12Example
money
money
loan
bank
DOCUMENT 1 money1 bank1 bank1 loan1 river2
stream2 bank1 money1 river2 bank1 money1 bank1
loan1 money1 stream2 bank1 money1 bank1 bank1
loan1 river2 stream2 bank1 money1 river2 bank1
money1 bank1 loan1 bank1 money1 stream2
.8
loan
bank
bank
loan
.2
TOPIC 1
.3
DOCUMENT 2 river2 stream2 bank2 stream2 bank2
money1 loan1 river2 stream2 loan1 bank2 river2
bank2 bank1 stream2 river2 loan1 bank2
stream2 bank2 money1 loan1 river2 stream2 bank2
stream2 bank2 money1 river2 stream2 loan1
bank2 river2 bank2 money1 bank1 stream2 river2
bank2 stream2 bank2 money1
river
bank
.7
river
stream
river
bank
stream
TOPIC 2
Bayesian approach use priors Mixture weights
Dirichlet( a ) Mixture components
Dirichlet( b )
Mixture components
Mixture weights
13Inverting (fitting) the model
?
DOCUMENT 1 money? bank? bank? loan? river?
stream? bank? money? river? bank? money? bank?
loan? money? stream? bank? money? bank? bank?
loan? river? stream? bank? money? river? bank?
money? bank? loan? bank? money? stream?
?
TOPIC 1
DOCUMENT 2 river? stream? bank? stream? bank?
money? loan? river? stream? loan? bank? river?
bank? bank? stream? river? loan? bank?
stream? bank? money? loan? river? stream? bank?
stream? bank? money? river? stream? loan?
bank? river? bank? money? bank? stream? river?
bank? stream? bank? money?
?
TOPIC 2
Mixture components
Mixture weights
14Application to corpus data
- TASA corpus text from first grade to college
- representative sample of text
- 26,000 word types (stop words removed)
- 37,000 documents
- 6,000,000 word tokens
15Example topics from an educational corpus (TASA)
- 37K docs, 26K words
- 1700 topics, e.g.
PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRES
S IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM O
FFSET GRAPHIC SURFACE PRODUCED CHARACTERS
PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHA
KESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAM
ATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPE
RA PERFORMED
TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING S
OCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COUR
T GAMES TRY COACH GYM SHOT
JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDA
NT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS
ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL
HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIE
NTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST MET
HOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION S
CIENCE FACTS DATA RESULTS EXPLANATION
STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY T
EACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUD
IED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
16Polysemy
PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRES
S IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM O
FFSET GRAPHIC SURFACE PRODUCED CHARACTERS
PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHA
KESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAM
ATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPE
RA PERFORMED
TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING S
OCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COUR
T GAMES TRY COACH GYM SHOT
JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDA
NT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS
ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL
HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIE
NTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST MET
HOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION S
CIENCE FACTS DATA RESULTS EXPLANATION
STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY T
EACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUD
IED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
17Three documents with the word play(numbers
colors ? topic assignments)
18No Problem of Triangle Inequality
TOPIC 1
TOPIC 2
SOCCER
MAGNETIC
FIELD
Topic structure easily explains violations of
triangle inequality
19Applications
20Enron email data
500,000 emails 5000 authors 1999-2002
21Enron topics
TEXANS WINFOOTBALL FANTASY SPORTSLINE PLAY TEAM G
AME SPORTS GAMES
GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRIT
UAL VISIT
ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING
SAFETY WATER GASOLINE
FERC MARKET ISO COMMISSION ORDER FILING COMMENTS P
RICE CALIFORNIA FILED
POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARK
ET PRICE UTILITY CUSTOMERS ELECTRIC
STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL
POWER BONDS MOU
TIMELINE
May 22, 2000 Start of California energy crisis
22Applying Model to Psychological Data
23Network of Word Associations
BAT
BALL
BASEBALL
GAME
PLAY
STAGE
THEATER
(Association norms by Doug Nelson et al. 1998)
24Explaining structure with topics
BAT
BALL
topic 1
BASEBALL
GAME
PLAY
topic 2
STAGE
THEATER
25Modeling Word Association
- Word association modeled as prediction
- Given that a single word is observed, what future
other words might occur? - Under a single topic assumption
Response
Cue
26Observed associates for the cue play
27Model predictions
RANK 9
28Median rank of first associate
Median Rank
29Recall example study List
- STUDY Bed, Rest, Awake, Tired, Dream, Wake,
Snooze, Blanket, Doze, Slumber, Snore, Nap,
Peace, Yawn, Drowsy - FALSE RECALL Sleep 61
-
30Recall as a reconstructive process
- Reconstruct study list based on the stored gist
- The gist can be represented by a distribution
over topics - Under a single topic assumption
Retrieved word
Study list
31Predictions for the Sleep list
STUDYLIST
EXTRALIST (top 8)