Predictive Profiling from Massive Transactional Data Sets - PowerPoint PPT Presentation

About This Presentation

Title:

Predictive Profiling from Massive Transactional Data Sets

Description:

Title: Predictive Profiling from Massive Transactional Data Sets Author: Information and Computer Science Last modified by: Information and Computer Sciences – PowerPoint PPT presentation

Number of Views:234

Avg rating:3.0/5.0

Slides: 40

Provided by: Informatio381

Learn more at: http://www.datalab.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Predictive Profiling from Massive Transactional Data Sets

1
Statistical Modeling of Large Text
CollectionsPadhraic SmythDepartment of
Computer ScienceUniversity of California, Irvine
MURI Project Kick-off MeetingNovember 18th
2008
2
The Text Revolution

Widespread availability of text in digital form
is driving
many new applications based on automated text
analysis
Categorization/classification
Automated summarization
Machine translation
Information extraction
And so on.

3
The Text Revolution

Widespread availability of text in digital form
is driving
many new applications based on automated text
analysis
Categorization/classification
Automated summarization
Machine translation
Information extraction
And so on.
Most of this work is happening in computing, but
many of the underlying techniques are statistical

4
Motivation
Pennsylvania Gazette 80,000 articles 1728-1800
16 million Medline articles
NYT 1.5 million articles
5
Problems of Interest

What topics do these documents span?
Which documents are about a particular topic?
How have topics changed over time?
What does author X write about?
and so on..

6
Problems of Interest

What topics do these documents span?
Which documents are about a particular topic?
How have topics changed over time?
What does author X write about?
and so on..
Key Ideas
Learn a probabilistic model over words and docs
Treat query-answering as computation of
appropriate conditional probabilities

7
Topic Models for Documents

P( words document ) ??
S P(wordstopic) P (topicdocument)

Topic probability distribution over words
Coefficients for each document
Automatically learned from text corpus
8
Topics Multinomials over Words
9
Topics Multinomials over Words
10
Basic Concepts

Topics distributions over words
Unknown a priori, learned from data
Documents represented as mixtures of topics
Learning algorithm
Gibbs sampling (stochastic search)
Linear time per iteration
Provides a full probabilistic model over words,
documents, and topics
Query answering computation of conditional
probabilities

11
Enron email data
250,000 emails 28,000 individuals 1999-2002
12
Enron email business topics
13
Enron non-work topics
14
Enron public-interest topics...
15
Examples of Topics from New York Times
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT
CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDA
Y DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NA
SDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10
500_STOCK_INDEX
WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS
FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIE
S RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRM
S SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING
INVESTMENT_BANKERS INVESTMENT_BANKS
SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED
AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTAC
K NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATI
ONAL QAEDA TERRORIST_ATTACKS
BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS
COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_C
OURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPA
NIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CA
SE GROUP
16
Topic trends from New York Times
TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE
Tour-de-France
COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNIN
G
Quarterly Earnings
330,000 articles 2000-2002
ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BU
ILDING
Anthrax
17
What does an author write about?

Author Jerry Friedman, Stanford

18
What does an author write about?

Author Jerry Friedman, Stanford
Topic 1 regression, estimate, variance, data,
series,
Topic 2 classification, training, accuracy,
decision, data,
Topic 3 distance, metric, similarity, measure,
nearest,

19
What does an author write about?

Author Jerry Friedman, Stanford
Topic 1 regression, estimate, variance, data,
series,
Topic 2 classification, training, accuracy,
decision, data,
Topic 3 distance, metric, similarity, measure,
nearest,
Author Rakesh Agrawal, IBM

20
What does an author write about?

Author Jerry Friedman, Stanford
Topic 1 regression, estimate, variance, data,
series,
Topic 2 classification, training, accuracy,
decision, data,
Topic 3 distance, metric, similarity, measure,
nearest,
Author Rakesh Agrawal, IBM
- Topic 1 index, data, update, join,
efficient.
- Topic 2 query, database, relational,
optimization, answer.
- Topic 3 data, mining, association, discovery,
attributes,

21
Examples of Data Sets Modeled

1,200 Bible chapters (KJV)
4,000 Blog entries
20,000 PNAS abstracts
80,000 Pennsylvania Gazette articles
250,000 Enron emails
300,000 North Carolina vehicle accident police
reports
500,000 New York Times articles
650,000 CiteSeer abstracts
8 million MEDLINE abstracts
Books by Austen, Dickens, and Melville
..
Exactly the same algorithm used in all cases
and in all cases interpretable topics produced
automatically

22
Related Work

Statistical origins
Latent class models in statistics (late 60s)
Admixture models in genetics
LDA Model Blei, Ng, and Jordan (2003)
Variational EM
Topic Model Griffiths and Steyvers (2004)
Collapsed Gibbs sampler
Alternative approaches
Latent semantic indexing (LSI/LSA)
less interpretable, not appropriate for count
data
Document clustering
simpler but less powerful

23
Clusters v. Topics
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.
24
Clusters v. Topics
One Cluster
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. cluster 88 model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov
25
Clusters v. Topics
Multiple Topics
One Cluster
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. cluster 88 model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov topic 10 state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling topic 37 genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes
26
Extensions

Author-topic models
Authors mixtures over topics
(Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004)
Special-words model
Documents mixtures of topics idiosyncratic
words
(Chemudugunta, Smyth, Steyvers, 2006)
Entity-topic models
Topic models that can reason about entities
(Newman,
Chemudugunta, Smyth, Steyvers, 2006)
See also work by McCallum, Blei, Buntine,
Welling, Fienberg, Xing, etc
Probabilistic basis allows for a wide range of
generalizations

27
Combining Models for Networks and Text
28
Combining Models for Networks and Text
29
Combining Models for Networks and Text
30
Combining Models for Networks and Text
31
Technical Approach and Challenges

Develop flexible probabilistic network models
that can incorporate textual information
e.g., ERGMs with text as node or edge covariates
e.g., latent space models with text-based
covariates
e.g., dynamic relational models with text as edge
covariates
Research challenges
Computational scalability
ERGMS not directly applicable to large text data
sets
What text representation to use
High-dimensional bag of words ?
Low-dimensional latent topics ?
Utility of text
Does the incorporation of textual information
produce more accurate models or predictions? How
can this be quantified?

32
Graphical Model
z
Group Variable
..........
Word 2
Word 1
Word n
33
Graphical Model
z
Group Variable
w
Word
n words
34
Graphical Model
z
Group Variable
w
Word
n words
D documents
35
Mixture Model for Documents
Group Probabilities
a
z
Group Variable
f
Group-Word distributions
w
Word
n words
D documents
36
Clustering with a Mixture Model
Cluster Probabilities
a
z
Cluster Variable
f
Cluster-Word distributions
w
Word
n words
D documents
37
Graphical Model for Topics
Document-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
38
Learning via Gibbs sampling
Document-Topic distributions
q
Gibbs sampler to estimate z for each word
occurrence, marginalizing over
other parameters
z
Topic
f
Topic-Word distributions
w
Word
n
D
39
More Details on Learning