Transfer Learning with Applications to Text Classification - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Transfer Learning with Applications to Text Classification

Description:

Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 68
Provided by: bon156
Category:

less

Transcript and Presenter's Notes

Title: Transfer Learning with Applications to Text Classification


1
Transfer Learning with Applications to Text
Classification
  • Jing Peng
  • Computer Science Department

2
  • Machine learning
  • study of algorithms that
  • improve performance P
  • on some task T
  • using experience E
  • Well defined learning task ltP,T,Egt

3
Learning to recognize targets in images
4
Learning to classify text documents
5
Learning to build forecasting models
6
Growth of Machine Learning
  • Machine learning is preferred approach to
  • Speech processing
  • Computer vision
  • Medical diagnosis
  • Robot control
  • News articles processing
  • This machine learning niche is growing
  • Improved machine learning algorithms
  • Lots of data available
  • Software too complex to code by hand

7
Learning
  • Given
  • Least squares methods
  • Learning focuses on minimizing

approximation error
H
8
Transfer Learning with Applications to Text
Classification
  • Main Challenge
  • Transfer learning
  • High Dimensional (4000 features)
  • Overlapping (lt80 features are the same)
  • Solution with performance bounds

9
Standard Supervised Learning
training (labeled)?
test (unlabeled)?
Classifier
85.5
New York Times
New York Times
10
In Reality
training (labeled)?
test (unlabeled)?
Classifier
64.1
Labeled data not available!
Reuters
New York Times
New York Times
11
Domain Difference ? Performance Drop
train
test
ideal setting
Classifier
NYT
NYT
85.5
New York Times
New York Times
realistic setting
Classifier
NYT
Reuters
64.1
Reuters
New York Times
12
High Dimensional Data Transfer
  • High Dimensional Data
  • Text Categorization
  • Image Classification

The number of features in our experiments is more
than 4000
  • Challenges
  • High dimensionality.
  • more than training examples
  • Euclidean distance becomes meaningless

13
Why Dimension Reduction?
DMAX
DMIN
14
Curse of Dimensionality
Dimensions
15
Curse of Dimensionality
Dimensions
16
High Dimensional Data Transfer
  • High Dimensional Data
  • Text Categorization
  • Image Classification

The number of features in our experiments is more
than 4000
  • Challenges
  • High dimensionality.
  • more than training examples
  • Euclidean distance becomes meaningless
  • Feature sets completely overlapping?
  • No. Some less than 80 features are the same.
  • Marginally not so related?
  • Harder to find transferable structures
  • Proper similarity definition.

17
PAC (Probably Approximately Correct) learning
requirement
  • Training and test distributions must be the same

18
Transfer between high dimensional overlapping
distributions
  • Overlapping Distributions

Data from two domains may not come from the same
part of space potentially overlap at best.
19
Transfer between high dimensional overlapping
distributions
  • Overlapping Distribution

Data from two domains may not come from the same
part of space potentially overlap at best.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
20
Transfer between high dimensional overlapping
distributions
  • Overlapping Distribution

Data from two domains may not come from the same
part of space potentially overlap at best.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
21
Transfer between high dimensional overlapping
distributions
  • Overlapping Distribution

Data from two domains may not be lying on exactly
the same space, but at most an overlapping one.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
22
Transfer between high dimensional overlapping
distributions
  • Overlapping Distribution

Data from two domains may not be lying on exactly
the same space, but at most an overlapping one.
x y z label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
23
Transfer between high dimensional overlapping
distributions
  • Problems with overlapping distributions
  • Overlapping features alone may not provide
    sufficient predictive power

24
Transfer between high dimensional overlapping
distributions
  • Problems with overlapping distributions
  • Overlapping features alone may not provide
    sufficient predictive power

f1 f2 f3 label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
25
Transfer between high dimensional overlapping
distributions
  • Problems with overlapping distributions
  • Overlapping features alone may not provide
    sufficient predictive power

f1 f2 f3 label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
26
Transfer between high dimensional overlapping
distributions
  • Problems with overlapping distributions
  • Overlapping features alone may not provide
    sufficient predictive power

Hard to predict correctly
f1 f2 f3 label
A ? 1 0.2 1
B 0.09 ? 0.1 1
C 0.01 ? 0.3 -1
27
Transfer between high dimensional overlapping
distributions
  • Overlapping Distributions
  • Use the union of all features and fill in missing
    values with zeros?

28
Transfer between high dimensional overlapping
distributions
  • Overlapping Distributions
  • Use the union of all features and fill in missing
    values with zeros?

f1 f2 f3 label
A 0 1 0.2 1
B 0.09 0 0.1 1
C 0.01 0 0.3 -1
29
Transfer between high dimensional overlapping
distributions
  • Overlapping Distribution
  • Use the union of all features and fill in the
    missing values with zeros?

Does it helps?
f1 f2 f3 label
A 0 1 0.2 1
B 0.09 0 0.1 1
C 0.01 0 0.3 -1
30
Transfer between high dimensional overlapping
distributions
31
Transfer between high dimensional overlapping
distributions
D2 A, B 0.0181 gt D2 A, C 0.0101
32
Transfer between high dimensional overlapping
distributions
D2 A, B 0.0181 gt D2 A, C 0.0101
A is mis-classified as in the class of C,
instead of B
33
Transfer between high dimensional overlapping
distributions
  • When one uses the union of overlapping and
    non-overlapping features and replaces missing
    values with zero,
  • distance of two marginal distributions p(x) can
    become asymptotically very large as a function of
    non-overlapping features
  • becomes a dominant factor in similarity measure.

34
Transfer between high dimensional overlapping
distributions
  • High dimensionality can underpin important
    features

35
Transfer between high dimensional overlapping
distributions
36
Transfer between high dimensional overlapping
distributions
The blues are closer to the greens than to
the reds
37
LatentMap two step correction
  • Missing value regression
  • Bring marginal distributions closer
  • Latent space dimensionality reduction
  • Further bring marginal distributions closer
  • Ignore non-important noisy and error imported
    features
  • Identify transferable substructures across two
    domains.

38
Missing Value Regression
  • Predict missing values (recall the previous
    example)

39
Missing Value Regression
  • Predict missing values (recall the previous
    example)

40
Missing Value Regression
  • Predict missing values (recall the previous
    example)

1. Project to overlapped feature
41
Missing Value Regression
  • Predict missing values (recall the previous
    example)

2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
42
Missing Value Regression
  • Predict missing values (recall the previous
    example)

2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
43
Missing Value Regression
D img(A), B 0.0109 lt D img(A), C
0.0125
  • Predict missing values (recall the previous
    example)

2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
44
Missing Value Regression
D img(A), B 0.0109 lt D img(A), C
0.0125
  • Predcit missing values (recall the previous
    example)

2. Map from z to x Relationship found
byregression
1. Project to overlapped feature
A is correctly classified as in the same class
as B
45
Dimensionality Reduction
46
Dimensionality Reduction
Missing Values
47
Dimensionality Reduction
Missing Values
Overlapping Features
48
Dimensionality Reduction
Missing Values
Missing Values Filled
Overlapping Features
49
Dimensionality Reduction
Word vector Matrix
Missing Values
Missing Values Filled
Overlapping Features
50
Dimensionality Reduction
  • Project the word vector matrix to the most
    important and inherent sub-space

51
Dimensionality Reduction
  • Project the word vector matrix to the most
    important and inherent sub-space

52
Dimensionality Reduction
  • Project the word vector matrix to the most
    important and inherent sub-space

Low dimensional representation
53
Solution (high dimensionality)
  • recall the previous example

54
Solution (high dimensionality)
  • recall the previous example

55
Solution (high dimensionality)
  • recall the previous example

The blues are closer to the greens than to the
reds
56
Solution (high dimensionality)
  • recall the previous example

57
Solution (high dimensionality)
The blues are closer to the reds than to the
greens
  • recall the previous example

58
Properties
  • It can bring the marginal distributions of two
    domains closer.
  • - Marginal distributions are brought closer in
    high-dimensional space (section 3.2)
  • - Two marginal distributions are further
    minimized in low dimensional space. (theorem
    3.2)
  • It brings two domains conditional distributions
    closer.
  • - Nearby instances from two domains have similar
    conditional distributions (section 3.3)
  • It can reduce domain transfer risk
  • - The risk of nearest neighbor classifier can be
    bounded in transfer learning settings. (theorem
    3.3)

59
Experiment (I)?
  • Data Sets
  • 20 News Groups
  • 20000 newsgroup articles
  • SRAA (simulated real auto aviation)
  • 73128 articles from 4 discussion groups
    (simulated auto racing, simulated aviation, real
    autos, and real aviation)
  • Reuters
  • 21758 Reuters news articles (1987)

60
Experiment (I)?
  • Data Sets
  • 20 News Groups
  • 20000 newsgroup articles
  • SRAA (simulated real auto aviation)
  • 73128 articles from 4 discussion groups
    (simulated auto racing, simulated aviation, real
    autos, and real aviation)
  • Reuters
  • 21758 Reuters news articles (1987)

61
Experiment (I)?
  • Data Sets
  • 20 News Groups
  • 20000 newsgroup articles
  • SRAA (simulated real auto aviation)
  • 73128 articles from 4 discussion groups
    (simulated auto racing, simulated aviation, real
    autos, and real aviation)
  • Reuters
  • 21758 Reuters news articles (1987)
  • Baseline methods
  • naïve Bayes, logistic regression, SVMs
  • Knn-Reg missing value filled without SVD
  • pLatentMap SVD but missing value as 0

62
Experiment (I)?
  • Data Sets
  • 20 News Groups
  • 20000 newsgroup articles
  • SRAA (simulated real auto aviation)
  • 73128 articles from 4 discussion groups
  • Reuters
  • 21758 Reuters news articles
  • Baseline methods
  • naïve Bayes, logistic regression, SVM
  • Knn-Reg missing value filled without SVD
  • pLatentMap SVD but missing value as 0

Try to justify the two steps in our framework
63
Learning Tasks
64
Experiment (II)?
10 win 1 loss
Overall performance
65
Experiment (III)?
66
Conclusion
  • Problem High dimensional overlapping domain
    transfer
  • - text and image categorization
  • Step 1 Missing values filling up
  • --- Bring two domains marginal distributions
    closer
  • Step 2 SVD dimension reduction
  • --- Further bring two marginal distributions
    closer (Theorem 3.2)
  • --- Cluster points from two domains, making
    conditional distribution transferable. (Theorem
    3.3

67
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com