Title: An introduction to selftaught learning
1An introduction to self-taught learning
Raina et. al, 2007 Self-taught Learning
Transfer Learning from Unlabeled Data
- Presented by Zenglin Xu
- 10-09-2007
2Outline
- Related learning paradigms
- A self-taught learning algorithm
3Related learning paradigms
- Semi-supervised learning
- Transfer learning
- Multi-task learning
- Domain adaptation
- Biased sample selection
- Self-taught learning
4Semi-supervised learning
- Except the training data (labeled), a large set
of test data (unlabeled) are available - The training data and test data are drawn from
the same distribution - Unlabeled data can be assigned with supervised
learning tasks class labels - Reference
- Chapelle, et. al, 2006 Semi-supervised
learning - Zhu, 2005 Semi-supervised learning literature
survey
5Transfer learning
- Transfer Learning
- The Theory of Transfer of Learning was introduced
by Thorndike and Woodworth (1901). They explored
how individuals would transfer learning in one
context to another context that shared similar
characteristics - Transfer of knowledge from one supervised task to
other Requires labeled data from a different but
related task - E.g., transferring the knowledge from Newsgroup
data to Reuters data - Related work in computer science
- Thrun Mitchell, 1995 Learning one more thing
- Ando Zhang, 2005 A framework for learning
predictive structures from multiple tasks and
unlabeled data
6Multi-task learning
- It learns a problem together with other related
problems at the same time, using a shared
representation. - This often leads to a better model for the main
task, because it allows the learner to use the
commonality among the tasks. - Multi-task learning is a kind of inductive
transfer. - It does this by learning tasks in parallel while
using a shared representation what is learned
for each task can help other tasks be learned
better. - Reference,
- Caruana, 1997 Multitask Learning
- Ben-David Schuller, 2003 Exploiting task
relatedness for multiple task learning
7Domain adaptation
- A term hot in language processing
- Indeed, it can be called transfer learning
- The supervised setting is usually like
- A large pool of out-of-domain labeled data
- A small pool of in-domain labeled data
- Reference
- Daume III, 2007 Frustratingly Easy Domain
Adaptation - Daume III Marcu , 2006 Domain Adaptation for
Statistical Classifiers - Ben-David et. al, 2006 Analysis of
Representations for Domain Adaptation
8Biased sample selection
- Also called Covariance Shift
- It deals with the case that the training data and
test data are selected from different
distributions in the same domain - The objective is to correct the bias
- Reference
- Shimodaira, 2000 Improving predictive inference
under covariate shift - Zadrozny, 2004 Learning and evaluating
classifiers under sample selection bias - Bickel et. al, 2007 Discriminative learning for
differing training and test distributions
9Self-taught learning
- Self-taught learning
- Uses unlabeled data
- Does not require unlabeled data to have same
generative distribution - The unlabeled data can have different labels as
those of the supervised learning tasks data. - Reference
- Raina et. al Self-taught learning transfer
learning from unlabeled data
10(No Transcript)
11Outline
- Related learning paradigms
- A self-taught learning algorithm
- Algorithm
- Experiment
12Sparse coding a self-taught learning algorithm
- Learn high level feature representation using
unlabeled data - E. g. random unlabeled images usually contain
basic visual patterns (like edges) that are
similar to images (like that of elephant) which
needs to be classified - Apply the representation to the labeled data and
use it for classification
13Step 1 learning higher level representations
Given unlabeled data Optimize the
following where
are the basis
vectors and
are the activations
14Bases learned from image patches and speech data
15Step 2 apply the representation to the labeled
data and use it for classification
16High-level features computed
Using a set of 512 learned image bases (Fig 2
left), Figure 3 illustrates a solution to the
previous optimization problem
17High-level features computed
18High-level features computed
19(No Transcript)
20Connection to PCA
21Connection to PCA
- PCA results in linear feature extraction, in that
the features a(i)j are simply a linear function
of the input. - The bases bj should be orthogonal, thus the
number of PCA features cannot be greater than the
dimension n of the input. Sparse coding does not
have either of these limitations
22Outline
- Related Learning paradigms
- A self-taught learning algorithm
- Algorithm
- Experiment
23Experiment setting
24Experiment setting
25Experimental results on image
26Experimental results on characters
27Experimental results on music data
28Experimental results on text data
29Compare with results using features trained on
labeled data
Table 7. Accuracy on the self-taught learning
tasks when sparse coding bases are learned on
unlabeled data (third column), or when principal
components/sparse coding bases are learned on the
labeled training set (fourth/fth column).
30Discussion
- Is it useful to learn a high-level feature
representation in a unified process using both
the labeled data and the unlabeled data? - How the similarity between the labeled data and
the unlabeled data affect the performance? - And more?