MarcAurelio RanzatoMartin Szummer - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

MarcAurelio RanzatoMartin Szummer

Description:

train the first stage by attaching a Poisson regressor and a classifier to the encoder. ... stake, merger, takeov, acquisit. coconut. soybean, wheat, corn, grain ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 2

Provided by: csp97

Category:

more less

Transcript and Presenter's Notes

Title: MarcAurelio RanzatoMartin Szummer

1
Semi-supervised Learning of Compact Document
Representations with Deep Networks

MarcAurelio Ranzato Martin Szummer
Courant Institute of Mathematical Sciences
Microsoft Research Cambridge

Learning compact representations of documents

Goals
compact representation ? efficient computation
storage
capture document topics while handling
synonymous polysemous words (unlike
traditional vector-space models like tf-idf)
semi-supervised learning preserve given class
information while learning from large unlabelled
collections

Applications
document classification /
clustering
information retrieval

Semi-supervised vs Unsupervised
20 Newsgroups dataset with 2000 words in the
dictionary
Architecture 2000-200-100-20
Classifier Gaussian kernel SVM
Vary the number of labeled samples per class

Deep Networks learn nonlinear representations
that capture high-order correlations between
words
simple layer-wise training 1 fast
feed-forward inference

Architecture of the model
Semi-supervised autoencoders stacked to form a
deep network (example with 3 layers). The system
is trained layer-by-layer. A layer is trained by
coupling the encoder with a decoder
(reconstruction) and a classifier
(classification).

The model is able to exploit even very few
labeled samples
The top level very compact representation with
20 units achieves similar accuracy to the first
layer representation with 200 units.
The unsupervised representation is more
regularized than tf-idf, but it has lost some
information
The top level classifier performs as well as the
SVM classifier.

Exploit labeled data and leverage a corpus of
unlabeled data
the training objective takes into account both
an unsupervised loss (reconstruction of the
input) as well as a supervised loss
(classification error) 2
Training Algorithm
train the first stage by attaching a Poisson
regressor and a classifier to the encoder.
Minimize the sum of reconstruction error
(negative log-likelihood of the data under the
Poisson model) and classification error
(cross-entropy). For unlabeled samples, employ
only the reconstruction objective.

Deep vs linear (LSI TF-IDF)
Reuters-21578 dataset with 12317 words and 91
topics.
Shallow LSI (SVD on tf-idf matrix)
Deep our model with the same units in the
final layer (2 or 3 layers)
Retrieval of 1,3,7,,4095 docs using cosine
similarity on the representation.

encoder
decoder (Poisson regressor)
First stage model

The deep model greatly outperforms the linear
model, when the representation is extremely
compact.
The deep representation gives better precision
and recall than the baseline tf-idf.

classifier (linear)
decoder (Gaussian regressor)

Deep vs Shallow
Experiment as above, but compared with a shallow
1-layer semi-supervised autoencoder.

Upper stagemodel

The deep model greatly outperforms the shallow
model, while the representation is extremely
compact.
Schedule for the number of hidden units in the
deep network gradual decrease.

Deep vs DBN vs SESM
Reuters-21578 dataset with 2000 words and 91
topics.
Deep 2000-200-100-20,7.
Retrieval as above.
Other models DBN pre-trained with RBMs 3,
and deep net trained with SESMs 4 (binary
high-dimensional).

Example of neighboring word stems to a given word
in the 7-dimensional feature
space to which the documents of the Reuters are
mapped after learning.
Ohsumed dataset Architecture 30689 100 10
5 2.

With just 7 units we achieve the same precision
(at k1) as the 1000-bit binary repr. from a
sparse-encoding symmetric machine (SESM)
(2000-1000-1000).
The compact representation yields better
precision than the binary one, and is more
efficient in terms of computational cost and
memory usage.
Deep autoencoder achieves similar accuracy as
DBN.
Fine-tuning is crucial for a DBN pre-trained
with RBMs, not necessary in our model.
Our model can be trained more efficiently than a
DBN pre-trained with RBMs using Contrastive
Divergence 1 .

Varying words in the dictionary
20 Newsgroups with 1K, 2K, 5K and 10K words
Just a single layer (shallow) model with 200
units.
Retrieval task as to right.
The more words that are used at the input,
the better performance the model achieves.

1 G. Hinton, S. Osindero, Y.W. Teh A fast
learning algorithm for deep belief nets Neural
Comp. 2006 2 Y. Bengio, P. Lamblin, D.
Popovici, H. Larochelle Greedy layer-wise
training of deep networks. NIPS 2006 3 G.
Hinton, R.R. Salakhutdinov Reducing the
dimensionality of data with neural networks
Science 2006 4 M. Ranzato, Y. Boureau, Y. LeCun
Sparse feature learning for deep belief
networks NIPS 2007

Write a Comment

User Comments (0)