1 Introduction - PowerPoint PPT Presentation

About This Presentation

Title:

1 Introduction

Description:

medic god space kei. patient christian nasa encrypt. year peopl orbit secur ... secur medic year thing. govern question system church. clipper ve high bibl ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 19

Provided by: cis5

Category:

more less

Transcript and Presenter's Notes

Title: 1 Introduction

1
(No Transcript)
2
1 Introduction

ICA proposed as a useful technique for
findingmeaningful directions in multivariate
data
The objective function affects the form of
potential structurediscovered
Here, the problem is partitioning and analysis of
sparse multivariate data
Prior knowledge is used to derive a
computationally inexpensive ICA

3
2 Introduction, continued

Two complementary architectures
Skewness (asymmetry) is the right objective to
optimize
The two tasks will be unified in a single
algorithm
Result - fast convergence -
computational cost linear in training points

separate
Observeddocuments
Documentprototypes
separate
Observedwords
Topic-features
4
3 Data Representation

Vector space representation document t1,
t2, . . . , tT T
T number of words in the dictionary (tens of
thousands)
elements are binary indicators or frequencies
sparse representation
D term ? document matrix (T ? N, N number
of documents)

5
4 Preprocessing

Assumption observations noisy expansion of
some denser group oflatent topics
Number of clusters or topics set a priori
K-dimensional LSA space USED AS
topic-concepts subspace
PCA may lose important data componentssparsity
infrequent, meaningful correlation
less concern
Reconstruction D DKUEVT

6
5 Prototype Documents from
a Corpus

Assumption documents noisy linear mixture
of (independent) document prototypes
N. of prototypes n. of topics
prototypes reside in LSA-space (K
dimensions)
Data projection onto right eigenvectors
variance normalization X(1)E-1VT(DT)UT
(K ? T matrix)
Task find mixing matrix W(1), source documents
S(1) so that X(1)W(1)TS(1) (S(1) K ?
T matrix)

7
6 Prototype Documents from a
Corpus, continued

Basis vectors of topic space assumed different
to separate prototypes, find independent
componentsWords in documents are distributed in
a positively skewed way
Search restricted to skewed (perhaps asymmetric)
distributions
LSA unmixing matrix must be orthogonal (
W(1)-1W(1)T )

W(1)E-1VT
8
7 Prototype Documents from a Corpus,
continued

Objective Skewness measure
Fisher-skewness
Prior knowledge small component mean
projection variance restricted to unity
Simplified objective G(s) ( 3rd order moment)
Prevent degenerate solutions Restrict
wTw1 for stationary points
Solve with gradient methods or iteratively

9
8 Prototype Documents from a Corpus,
continued

Sources positive ? is positive (output
sign is relevant!)
K orthonormal projection directions matrix
iteration
Similar to approximate Newton-Raphson
optimization(FastICA type derivation
small additional term)
Computational complexity O(2K2T KT 4K3)

10
9 Topic Features from
Word Features

Assumption terms noisy linear expansion of
(independent) concepts (topics)
Data compression X(2)E-1UT(D)VT
(K ? N matrix)
Task find unmixing matrix W(2), topic features
S(2) so that X(2)W(2)TS(2) (S(2) K
? N matrix)
This time, use a Clustering criterion

11
10 Topic Features from Word
Features, continued
W(2)E-1UT

Objective function (zkn indicate class of xn)
Stochastic minimization EM-type algorithm

12
11 Topic Features from Word
Features, continued

Comparison approach set of binary classifiers
algorithm
Maximizes skewed, monotonic increasing
function of topic sk skewed prior is
appropriate
Variance normalized after LSA, independent
topics source components aligned to
orthonormal axes
Similar to previous architechture

13
12 Combining the
Tasks

Joint optimization problem
Information from linear outputs and from weights
are complementary Topic clustering weight
peaks representative words projections
clustering information Document
weight peaks clustering information
prototype search projections index
terms
Review the separating weights on
D W(2)TE-1UT

14
13 Combining the Tasks,
continued

Whitening allows inspection but isn't practical
normalize variance along the K principal
directions! D' UE-1UTD
Find new unmixing matrix to maximize
W(2') G(W(2')TUTD') ...
G(W(2')TX(2)) W(2') W(2)
Solve the relation W(2)TUTS(1) W(1)TUTS(
1)
Rewrite objective concatenate data UT,
VT

W(1)W(2)W
15
14 Combining the Tasks,
continued

Resultant algorithm O(2K2(T N) K(T N)
4K3) Inputs D, K 1. Decompose D with
Lanczos algorithm. Retain K
first singular values. Obtain U, E, V. 2. Let X
UT, VT 3. Iterate until convergence O
utputs S?RK?(TN) , W?RK?K
S T document prototypes N topic-features, W
structure information of identified
topics in the corpus

16
15 Simulations

1. Newsgroup data ('sci.crypt', 'sci.med',
'sci.space', 'soc.religion.christian')
kei effect space
peopl encrypt year nasa
christian system call
orbit god chip peopl
dai rutger secur medic
year thing govern question
system church clipper ve
high bibl public
doctor launch question
peopl find man part
escrow patient scienc
find comput studi engin
christ
medic god space kei
patient christian nasa
encrypt year peopl orbit
secur effect rutger
launch govern diseas thing
dai system doctor bibl
mission clipper studi christ
flight chip health
understand engin public call
church shuttl escrow
test point system de
physician question scienc
law
10 most representative words 10 most frequent
wordsselected by algorithm conformal with
human labeling
people god dai
space sex church man
year system issu group life
nasa shuttl term thing
love moon design sexual
year christian jpl
research basi find live earth
cost respons question jesu
orbit human homosexu bibl
christ part discuss refer
read rutger gov
launch fornic faith human ron
dr intercours issu save
venu station law
Simulation 2.10 most representative words,using
5 topics and 2 document classes('sci.space',
'soc.religion.christian')
I II III IV V
17
16 Conclusions

Dependency structure of the splitting in
simulation 2
sci.space soc.religion.christian space
shuttle space shuttle christian
christian christian design (IV) mission
(III) church (I) religion (II)
morality (V)

Clustering and keyword identification by ICA
variant that maximizes skewness
Key assumption asymmetrical latent prior
Joint problem solved (D and DT)
'spatio-temporal' ICA
Algorithm is linear in number of documents,
O(K2N)
Fast convergence (3 - 8 steps)
Potential number of topics can be greater than
indicated bya human labeler discover
subtopics
Hierarchical partitioning possible (recursive
binary splits)