Title: MarcAurelio RanzatoMartin Szummer
1Semi-supervised Learning of Compact Document
Representations with Deep Networks
- MarcAurelio Ranzato Martin Szummer
- Courant Institute of Mathematical Sciences
Microsoft Research Cambridge
Learning compact representations of documents
- Goals
- compact representation ? efficient computation
storage - capture document topics while handling
synonymous polysemous words (unlike
traditional vector-space models like tf-idf) - semi-supervised learning preserve given class
information while learning from large unlabelled
collections
- Applications
- document classification /
clustering - information retrieval
- Semi-supervised vs Unsupervised
- 20 Newsgroups dataset with 2000 words in the
dictionary - Architecture 2000-200-100-20
- Classifier Gaussian kernel SVM
- Vary the number of labeled samples per class
- Deep Networks learn nonlinear representations
that capture high-order correlations between
words - simple layer-wise training 1 fast
feed-forward inference
Architecture of the model
Semi-supervised autoencoders stacked to form a
deep network (example with 3 layers). The system
is trained layer-by-layer. A layer is trained by
coupling the encoder with a decoder
(reconstruction) and a classifier
(classification).
- The model is able to exploit even very few
labeled samples - The top level very compact representation with
20 units achieves similar accuracy to the first
layer representation with 200 units. - The unsupervised representation is more
regularized than tf-idf, but it has lost some
information - The top level classifier performs as well as the
SVM classifier.
- Exploit labeled data and leverage a corpus of
unlabeled data - the training objective takes into account both
an unsupervised loss (reconstruction of the
input) as well as a supervised loss
(classification error) 2 - Training Algorithm
- train the first stage by attaching a Poisson
regressor and a classifier to the encoder.
Minimize the sum of reconstruction error
(negative log-likelihood of the data under the
Poisson model) and classification error
(cross-entropy). For unlabeled samples, employ
only the reconstruction objective.
- Deep vs linear (LSI TF-IDF)
- Reuters-21578 dataset with 12317 words and 91
topics. - Shallow LSI (SVD on tf-idf matrix)
- Deep our model with the same units in the
final layer (2 or 3 layers) - Retrieval of 1,3,7,,4095 docs using cosine
similarity on the representation.
encoder
decoder (Poisson regressor)
First stage model
- The deep model greatly outperforms the linear
model, when the representation is extremely
compact. - The deep representation gives better precision
and recall than the baseline tf-idf.
classifier (linear)
decoder (Gaussian regressor)
- Deep vs Shallow
- Experiment as above, but compared with a shallow
1-layer semi-supervised autoencoder.
Upper stagemodel
- The deep model greatly outperforms the shallow
model, while the representation is extremely
compact. - Schedule for the number of hidden units in the
deep network gradual decrease.
- Deep vs DBN vs SESM
- Reuters-21578 dataset with 2000 words and 91
topics. - Deep 2000-200-100-20,7.
- Retrieval as above.
- Other models DBN pre-trained with RBMs 3,
and deep net trained with SESMs 4 (binary
high-dimensional).
Example of neighboring word stems to a given word
in the 7-dimensional feature
space to which the documents of the Reuters are
mapped after learning.
Ohsumed dataset Architecture 30689 100 10
5 2.
- With just 7 units we achieve the same precision
(at k1) as the 1000-bit binary repr. from a
sparse-encoding symmetric machine (SESM)
(2000-1000-1000). - The compact representation yields better
precision than the binary one, and is more
efficient in terms of computational cost and
memory usage. - Deep autoencoder achieves similar accuracy as
DBN. - Fine-tuning is crucial for a DBN pre-trained
with RBMs, not necessary in our model. - Our model can be trained more efficiently than a
DBN pre-trained with RBMs using Contrastive
Divergence 1 .
- Varying words in the dictionary
- 20 Newsgroups with 1K, 2K, 5K and 10K words
- Just a single layer (shallow) model with 200
units. - Retrieval task as to right.
- The more words that are used at the input,
the better performance the model achieves.
1 G. Hinton, S. Osindero, Y.W. Teh A fast
learning algorithm for deep belief nets Neural
Comp. 2006 2 Y. Bengio, P. Lamblin, D.
Popovici, H. Larochelle Greedy layer-wise
training of deep networks. NIPS 2006 3 G.
Hinton, R.R. Salakhutdinov Reducing the
dimensionality of data with neural networks
Science 2006 4 M. Ranzato, Y. Boureau, Y. LeCun
Sparse feature learning for deep belief
networks NIPS 2007