Title: Transfer Learning with Applications to Text Classification
1Transfer Learning with Applications to Text
Classification
- Jing Peng
- Computer Science Department
2- Machine learning
- study of algorithms that
- improve performance P
- on some task T
- using experience E
- Well defined learning task ltP,T,Egt
3Learning to recognize targets in images
4Learning to classify text documents
5Learning to build forecasting models
6Growth of Machine Learning
- Machine learning is preferred approach to
- Speech processing
- Computer vision
- Medical diagnosis
- Robot control
- News articles processing
-
- This machine learning niche is growing
- Improved machine learning algorithms
- Lots of data available
- Software too complex to code by hand
-
7Learning
- Given
- Least squares methods
- Learning focuses on minimizing
approximation error
H
8Transfer Learning with Applications to Text
Classification
- Main Challenge
- Transfer learning
- High Dimensional (4000 features)
- Overlapping (lt80 features are the same)
- Solution with performance bounds
9Standard Supervised Learning
training (labeled)?
test (unlabeled)?
Classifier
85.5
New York Times
New York Times
10In Reality
training (labeled)?
test (unlabeled)?
Classifier
64.1
Labeled data not available!
Reuters
New York Times
New York Times
11Domain Difference ? Performance Drop
train
test
ideal setting
Classifier
NYT
NYT
85.5
New York Times
New York Times
realistic setting
Classifier
NYT
Reuters
64.1
Reuters
New York Times
12High Dimensional Data Transfer
- High Dimensional Data
- Text Categorization
- Image Classification
The number of features in our experiments is more
than 4000
- Challenges
- High dimensionality.
- more than training examples
- Euclidean distance becomes meaningless
13Why Dimension Reduction?
DMAX
DMIN
14Curse of Dimensionality
Dimensions
15Curse of Dimensionality
Dimensions
16High Dimensional Data Transfer
- High Dimensional Data
- Text Categorization
- Image Classification
The number of features in our experiments is more
than 4000
- Challenges
- High dimensionality.
- more than training examples
- Euclidean distance becomes meaningless
- Feature sets completely overlapping?
- No. Some less than 80 features are the same.
- Marginally not so related?
- Harder to find transferable structures
- Proper similarity definition.
17PAC (Probably Approximately Correct) learning
requirement
- Training and test distributions must be the same
18Transfer between high dimensional overlapping
distributions
- Overlapping Distributions
Data from two domains may not come from the same
part of space potentially overlap at best.
19Transfer between high dimensional overlapping
distributions
Data from two domains may not come from the same
part of space potentially overlap at best.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
20Transfer between high dimensional overlapping
distributions
Data from two domains may not come from the same
part of space potentially overlap at best.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
21Transfer between high dimensional overlapping
distributions
Data from two domains may not be lying on exactly
the same space, but at most an overlapping one.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
22Transfer between high dimensional overlapping
distributions
Data from two domains may not be lying on exactly
the same space, but at most an overlapping one.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
23Transfer between high dimensional overlapping
distributions
- Problems with overlapping distributions
- Overlapping features alone may not provide
sufficient predictive power
24Transfer between high dimensional overlapping
distributions
- Problems with overlapping distributions
- Overlapping features alone may not provide
sufficient predictive power
f1 f2 f3 label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
25Transfer between high dimensional overlapping
distributions
- Problems with overlapping distributions
- Overlapping features alone may not provide
sufficient predictive power
f1 f2 f3 label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
26Transfer between high dimensional overlapping
distributions
- Problems with overlapping distributions
- Overlapping features alone may not provide
sufficient predictive power
Hard to predict correctly
f1 f2 f3 label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
27Transfer between high dimensional overlapping
distributions
- Overlapping Distributions
- Use the union of all features and fill in missing
values with zeros?
28Transfer between high dimensional overlapping
distributions
- Overlapping Distributions
- Use the union of all features and fill in missing
values with zeros?
f1 f2 f3 label
A 0 1 0.2 1
B 0.09 0 0.1 1
C 0.01 0 0.3 -1
29Transfer between high dimensional overlapping
distributions
- Overlapping Distribution
- Use the union of all features and fill in the
missing values with zeros?
Does it helps?
f1 f2 f3 label
A 0 1 0.2 1
B 0.09 0 0.1 1
C 0.01 0 0.3 -1
30Transfer between high dimensional overlapping
distributions
31Transfer between high dimensional overlapping
distributions
D2 A, B 0.0181 gt D2 A, C 0.0101
32Transfer between high dimensional overlapping
distributions
D2 A, B 0.0181 gt D2 A, C 0.0101
A is mis-classified as in the class of C,
instead of B
33Transfer between high dimensional overlapping
distributions
- When one uses the union of overlapping and
non-overlapping features and replaces missing
values with zero, - distance of two marginal distributions p(x) can
become asymptotically very large as a function of
non-overlapping features - becomes a dominant factor in similarity measure.
34Transfer between high dimensional overlapping
distributions
- High dimensionality can underpin important
features
35Transfer between high dimensional overlapping
distributions
36Transfer between high dimensional overlapping
distributions
The blues are closer to the greens than to
the reds
37LatentMap two step correction
- Missing value regression
- Bring marginal distributions closer
- Latent space dimensionality reduction
- Further bring marginal distributions closer
- Ignore non-important noisy and error imported
features - Identify transferable substructures across two
domains.
38Missing Value Regression
- Predict missing values (recall the previous
example)
39Missing Value Regression
- Predict missing values (recall the previous
example)
40Missing Value Regression
- Predict missing values (recall the previous
example)
1. Project to overlapped feature
41Missing Value Regression
- Predict missing values (recall the previous
example)
2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
42Missing Value Regression
- Predict missing values (recall the previous
example)
2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
43Missing Value Regression
D img(A), B 0.0109 lt D img(A), C
0.0125
- Predict missing values (recall the previous
example)
2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
44Missing Value Regression
D img(A), B 0.0109 lt D img(A), C
0.0125
- Predcit missing values (recall the previous
example)
2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
A is correctly classified as in the same class
as B
45Dimensionality Reduction
46Dimensionality Reduction
Missing Values
47Dimensionality Reduction
Missing Values
Overlapping Features
48Dimensionality Reduction
Missing Values
Missing Values Filled
Overlapping Features
49Dimensionality Reduction
Word vector Matrix
Missing Values
Missing Values Filled
Overlapping Features
50Dimensionality Reduction
- Project the word vector matrix to the most
important and inherent sub-space
51Dimensionality Reduction
- Project the word vector matrix to the most
important and inherent sub-space
52Dimensionality Reduction
- Project the word vector matrix to the most
important and inherent sub-space
Low dimensional representation
53Solution (high dimensionality)
- recall the previous example
54Solution (high dimensionality)
- recall the previous example
55Solution (high dimensionality)
- recall the previous example
The blues are closer to the greens than to the
reds
56Solution (high dimensionality)
- recall the previous example
57Solution (high dimensionality)
The blues are closer to the reds than to the
greens
- recall the previous example
58Properties
- It can bring the marginal distributions of two
domains closer. - - Marginal distributions are brought closer in
high-dimensional space (section 3.2) - - Two marginal distributions are further
minimized in low dimensional space. (theorem
3.2) - It brings two domains conditional distributions
closer. - - Nearby instances from two domains have similar
conditional distributions (section 3.3) - It can reduce domain transfer risk
- - The risk of nearest neighbor classifier can be
bounded in transfer learning settings. (theorem
3.3)
59Experiment (I)?
- Data Sets
- 20 News Groups
- 20000 newsgroup articles
- SRAA (simulated real auto aviation)
- 73128 articles from 4 discussion groups
(simulated auto racing, simulated aviation, real
autos, and real aviation) - Reuters
- 21758 Reuters news articles (1987)
60Experiment (I)?
- Data Sets
- 20 News Groups
- 20000 newsgroup articles
- SRAA (simulated real auto aviation)
- 73128 articles from 4 discussion groups
(simulated auto racing, simulated aviation, real
autos, and real aviation) - Reuters
- 21758 Reuters news articles (1987)
61Experiment (I)?
- Data Sets
- 20 News Groups
- 20000 newsgroup articles
- SRAA (simulated real auto aviation)
- 73128 articles from 4 discussion groups
(simulated auto racing, simulated aviation, real
autos, and real aviation) - Reuters
- 21758 Reuters news articles (1987)
- Baseline methods
- naïve Bayes, logistic regression, SVMs
- Knn-Reg missing value filled without SVD
- pLatentMap SVD but missing value as 0
62Experiment (I)?
- Data Sets
- 20 News Groups
- 20000 newsgroup articles
- SRAA (simulated real auto aviation)
- 73128 articles from 4 discussion groups
- Reuters
- 21758 Reuters news articles
- Baseline methods
- naïve Bayes, logistic regression, SVM
- Knn-Reg missing value filled without SVD
- pLatentMap SVD but missing value as 0
Try to justify the two steps in our framework
63Learning Tasks
64Experiment (II)?
10 win 1 loss
Overall performance
65Experiment (III)?
66Conclusion
- Problem High dimensional overlapping domain
transfer - - text and image categorization
- Step 1 Missing values filling up
- --- Bring two domains marginal distributions
closer - Step 2 SVD dimension reduction
- --- Further bring two marginal distributions
closer (Theorem 3.2) - --- Cluster points from two domains, making
conditional distribution transferable. (Theorem
3.3
67(No Transcript)