Title: Analyzing unstructured text with topic models
1Analyzing unstructured text with topic models
Mark Steyvers Dep. of Cognitive Sciences Dep.
of Computer Science University of California,
Irvine
collaborators Padhraic Smyth, UC Irvine Tom
Griffiths UC Berkeley
2Analyzing Unstructured Text
- Pennsylvania Gazette
- (1728-1800)
- 80,000 articles
Enron 250,000 emails
NYT 330,000 articles
NSF/ NIH 100,000 grants
AOL queries 20,000,000 queries 650,000 users
16 million Medline articles
3Topic Models and Text Analysis
- Can answer a number of questions
- What is in this corpus?
- What is in this document, paragraph, or sentence?
- What does this person/group of people write
about? - What tags are appropriate for this document?
- What are the topical trends over time?
4Topic Models
- Automatic and unsupervised extraction of semantic
themes from large text collections. - Widely used model in machine learning and text
mining - pLSI Model Hoffman (1999)
- LDA Model Blei, Ng, and Jordan (2001, 2003)
- LDA with Gibbs sampling Griffiths and Steyvers
(2003, 2004)
5Basic Assumptions
- Each topic is a distribution over words
- Each document a mixture of topics
- Each word in a document originates from a single
topic
6Model
- P( words document ) S
P(wordstopic) P (topicdocument) -
Topic probability distribution over words
topic weights for each document
Automatically learned from text corpus
7Toy Example
MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1
MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 ....
1.0
.6
RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1 MONEY1
RIVER2 MONEY1 BANK2 LOAN1 MONEY1 ....
.4
1.0
RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2....
Topics
Topic Weights
Documents and topic assignments
8Statistical Inference
MONEY? BANK BANK? LOAN? BANK? MONEY? BANK? MONEY?
BANK? LOAN? LOAN? BANK? MONEY? ....
?
?
RIVER? MONEY? BANK? STREAM? BANK? BANK? MONEY?
RIVER? MONEY? BANK? LOAN? MONEY? ....
?
RIVER? BANK? STREAM? BANK? RIVER? BANK?....
Topics
Topic Weights
Documents and topic assignments
9Statistical Inference
- Exact inference is intractable
- Markov chain Monte Carlo (MCMC) with Gibbs
sampling - scalable to large document collections (e.g. all
of wikipedia) - parallelizable
- Form of dimensionality reduction
- Number of topics T 502000
10Examples Topics from New York Times
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT
CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDA
Y DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NA
SDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10
500_STOCK_INDEX
WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS
FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIE
S RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRM
S SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING
INVESTMENT_BANKERS INVESTMENT_BANKS
SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED
AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTAC
K NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATI
ONAL QAEDA TERRORIST_ATTACKS
BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS
COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_C
OURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPA
NIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CA
SE GROUP
11Learning multiple meanings of words
PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRES
S IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM O
FFSET GRAPHIC SURFACE PRODUCED CHARACTERS
PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHA
KESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAM
ATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPE
RA PERFORMED
TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING S
OCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COUR
T GAMES TRY COACH GYM SHOT
JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDA
NT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS
ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL
HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIE
NTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST MET
HOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION S
CIENCE FACTS DATA RESULTS EXPLANATION
STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY T
EACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUD
IED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
12Demographic Analysis of Search Queries
13AOL dataset
- Dataset
- - 20,000,000 web queries
- - 650,000 users
- Users were given anonymous user-id
- No demographics in this dataset
14Example query log from user 2178
ID Query Date/Time URL
clicked 2178 dog eats uncooked pasta 2006-05-26
153156 2178 inducing dog vomiting 2006-05-26
153246 http//www.twodogpress.com 2178 inducing
dog vomiting 2006-05-26 153246 http//www.canism
ajor.com 2178 inducing dog vomiting 2006-05-26
153246 http//kitchen.robbiehaf.com 2178 inducin
g dog vomiting 2006-05-26 153246 http//www.dog-
first-aid-101.com 2178 inducing dog
vomiting 2006-05-26 153836 2178 walmart 2006-05-
12 123952 http//www.walmart.com 2178 sears 2006
-05-12 124422 http//www.sears.com 2178 target 2
006-05-12 170536 http//www.target.com 2178 baby
center.com 2006-05-12 174359 http//www.babycent
er.com 2178 google 2006-05-16 105439 http//www.
google.com 2178 fit pregnancy 2006-05-16
153423 2178 baby center 2006-05-16
153722 2178 yahoo.com 2006-05-18
171105 http//www.yahoo.com 2178 applebee's
carside 2006-05-19 192108 http//www.applebees.c
om 2178 baby names 2006-05-20 150238 http//www.
babynames.com 2178 baby names 2006-05-20
150238 http//www.babynamesworld.com 2178 baby
names 2006-05-20 150238 http//www.thinkbabyname
s.com 2178 mortgage calculator 2006-05-24
143905 http//www.bankrate.com 2178 us zip
codes 2006-05-25 212647 http//www.usps.com 2178
us zip codes 2006-05-25 212647 http//www.usps.
com
15Another Query Database
- Not publicly available
- Dataset
- 250,000 users
- 411,000 queries
- Age and gender of users are known
- age brackets 0-12, 13-17, 18-20, 21-24, 25-29,
30-34, 35-44, 45-54, 55-64, 65
16Topic modeling of queries
- Each user searches for a mixture of topics
- Each topic is a probability distribution over
query words -
17Four example topics (out of 200)
auto car parts cars used ford honda truck toyota
party store wedding birthday jewelry ideas cards c
ake gifts
webmd cymbalta xanax gout vicodin effexor predniso
ne lexapro ambien
hannah montana zac efron disney high school
musical miley cyrus hilary duff
Probability distribution over words. Most likely
words listed at the top
18User mixture of topics
auto car parts cars used ford honda truck toyota
party store wedding birthday jewelry ideas cards c
ake gifts
hannah montana zac efron disney high school
musical miley cyrus hilary duff
webmd cymbalta xanax gout vicodin effexor predniso
ne lexapro ambien
80
20
100
User 7654
User 246
19Topic Analysis
- Find likely topics for each demographic bucket
- Find likely demographics given topics
- Whats on the mind of people in different
age-groups?
20poems topic
21myspace topic
22sports topic
23MTV topic
24Clothing Stores topic
25Hairstyles topic
26recipes topic
27Results
- Topic models give quick summaries of demographic
trends in query datasets - Other potential applications
- e.g. blogs, social networking sites, email, etc
- clinical data, e.g. therapy discussions
28Analyzing Emailswho writes on what topics?
29Enron email data
250,000 emails 5000 authors 1999-2002
30Author-topic models
- We can learn the association between authors of
documents and topics - Assume each author works on a mixture of topics
31ENRON Email who writes on certain topics?
... But also over senders (authors) of email.
Most likely authors listed at the top
32Enron email two example topics (T100)
33Detecting Papers on Unusual Topics for Authors
- We can calculate perplexity (unusualness) for
words in a document given an author
Papers ranked by perplexity for M. Jordan
34Author Separation
Can model attribute words to authors correctly
within a document?
35ApplicationFaculty Browser
36Faculty Browser
- Automatically analyzes computer science papers by
UC San Diego and UC Irvine researchers - Finds topically related researchers
37one topic
most prolific researchers in this topic
38one researcher
topics this researcher is interested in
other researchers with similar topical interests
39Inferred network of researchers connected through
topics
40Modeling Extensions
41Entity-topic modeling
330,000 articles 2000-2002
Who is mentioned in what context?
42Extracted Named Entities
Three investigations began Thursday into the
securities and exchange_commission's choice of
william_webster to head a new board overseeing
the accounting profession. house and
senate_democrats called for the resignations of
both judge_webster and harvey_pitt, the
commission's chairman. The white_house expressed
support for judge_webster as well as for
harvey_pitt, who was harshly criticized Thursday
for failing to inform other commissioners before
they approved the choice of judge_webster that he
had led the audit committee of a company facing
fraud accusations. The president still has
confidence in harvey_pitt, said dan_bartlett,
bush's communications director
- Used standard algorithms to extract named
entities - People
- Places
- Organizations
43Standard Topic Model with Entities
44(No Transcript)
45Example of Extracted Entity-Topic Network
46Topic Trends
Tour-de-France
Proportion of words assigned to topic for that
time slice
Quarterly Earnings
Anthrax
47Learning Topic Hierarchies(example psych Review
Abstracts)
THE OF AND TO IN A IS
A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS
ACCOUNT
SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES IN
TERPERSONAL PERSONALITY SAMPLING
MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DI
RECTION CONTOURS SURFACES
DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGE
R EXTINCTION PAIN
RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMUL
I RECALL CHOICE CONDITIONING
SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SE
MANTIC
ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIO
NAL THINKING
GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL G
ROUPS MEMBERS
SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HA
NDEDNESS
REASONING ATTITUDE CONSISTENCY SITUATIONAL INFEREN
CE JUDGMENT PROBABILITIES STATISTICAL
IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT
ORIENTATION HOLOGRAPHIC
CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMU
LATION TOLERANCE RESPONSES
48(No Transcript)
49Hidden Markov Topics Model
- Syntactic dependencies ? short range dependencies
- Semantic dependencies ? long-range
q
Semantic state generate words from topic model
z1
z2
z3
z4
w1
w2
w3
w4
Syntactic states generate words from HMM
s1
s2
s3
s4
(Griffiths, Steyvers, Blei, Tenenbaum, 2004)
50NIPS Semantics
IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VI
EWS PIXEL VISUAL
KERNEL SUPPORT VECTOR SVM KERNELS SPACE FUNCTION
MACHINES SET
NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUT
S WEIGHTS OUTPUTS
EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEA
RNING MIXTURES FUNCTION GATE
MEMBRANE SYNAPTIC CELL CURRENT DENDRITIC POTENTI
AL NEURON CONDUCTANCE CHANNELS
DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR D
ISTRIBUTION EM BAYESIAN PARAMETERS
STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT L
EARNING CLASSES OPTIMAL
NIPS Syntax
IN WITH FOR ON FROM AT USING INTO OVER WITHIN
I X T N - C F P
IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENT
S EXISTS SEEMS
SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE
DESCRIBE SUGGEST
MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD
APPROACH PAPER PROCESS
HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HE
NCE FINALLY
USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESEN
TED DEFINED GENERATED SHOWN
51Random sentence generation
LANGUAGE S RESEARCHERS GIVE THE SPEECH S THE
SOUND FEEL NO LISTENERS S WHICH WAS TO BE
MEANING S HER VOCABULARIES STOPPED WORDS S HE
EXPRESSLY WANTED THAT BETTER VOWEL
52Software
- Public-domain MATLAB toolbox for topic modeling
on the Web - http//psiexp.ss.uci.edu/research/programs_data/t
oolbox.htm