Title: Domain Adaptation with Multiple Sources
1Domain Adaptation with Multiple Sources
- Yishay Mansour, Tel Aviv Univ. Google
- Mehryar Mohri, NYU Google
- Afshin Rostami, NYU
2Adaptation
3Adaptation motivation
- High level
- The ability to generalize from one domain to
another - Significance
- Basic human property
- Essential in most learning environments
- Implicit in many applications.
4Adaptation - examples
- Sentiment analysis
- Users leave reviews
- products, sellers, movies,
- Goal score reviews as positive or negative.
- Adaptation example
- Learn for restaurants and airlines
- Generalize to hotels
5Adaptation - examples
- Speech recognition
- Adaptation
- Learn a few accents
- Generalize to new accents
- think foreign accents.
6Adaptation and generalization
- Machine Learning prediction
- Learn from examples drawn from distribution D
- predict the label of unseen examples
- drawn from the same distribution D
- generalization within a distribution
- Adaptation
- predict the label of unseen examples
- drawn from a different distribution D
- Generalization across distributions
7Adaptation Related Work
- Learn from D and test on D
- relating the increase in error to dist(D,D)
- Ben-David et al. (2006), Blitzer et al. (2007),
- Single distribution varying label quality
- Cramer et al. (2005, 2006)
8Our Model
9Our Model - input
Typical loss function L(a,b)a-b and L(D,h,f)
ExD f(x)-h(x)
10Our Model target distribution
?1
target distribution D?
?k
11Our model Combination Rule
- Combine h1, , hk to a hypothesis h
- Low expected loss
- hopefully at most e
- combining rules
- let z S zi 1 and zi 0
- linear h(x) S zi hi(x)
- distribution weighted
. . .
hk
h1
combining rule
12Combining Rules Pros
- Alternative Build a dataset for the mixture.
- Learning the mixture parameters is non-trivial
- Combined data set might be huge size
- Domain dependent data unavailable
- Combined data might be huge
- Sometimes only classifiers are given/exist
- privacy
- MOST IMPORTANT
- FUNDAMENTAL THEORY QUESTION
13Our Results
- Linear Combining rule
- Seems like the first thing to try
- Can be very bad
- Simple settings where any linear combining rule
performs badly.
14Our Results
- Distribution weighted combining rules
- Given the mixture parameter ?
- there is a good distribution weighted combining
rule. - expected loss at most e
- For any target function f,
- there is a good distribution combining rule hz
- expected loss at most e
- Extension for multiple consistent target
functions - expected loss at most 3e
- OUTCOME This is the right hypothesis class
15Known Distribution
16Linear combining rules
h0 h1 f X
0 1 1 a
0 1 0 b
Original Loss e0 !!!
Db Da D
0 1 ½ a
1 0 ½ b
Any linear combining rule h has expected absolute
loss ½
17Distribution weighted combining rule
- Target distribution a mixture
- D?(x)S ?i Di(x)
- Set z?
- Claim L(D?,h?,f) e
18Distribution weighted combining rule
PROOF
19Back to the bad example
h0 h1 f X
0 1 1 a
0 1 0 b
Original Loss e0 !!!
Db Da D
0 1 ½ a
1 0 ½ b
h(x) xa h(x)h1(x)1 xb h(x)h0(x)0
20Unknown Distribution
21Unknown mixture distribution
- Zero-sum game
- NATURE selects a distribution Di
- LEARNER selects a z
- hypothesis hz
- Payoff L(Di,hz,f)
- Restating to previous result
- For any mixed action ? of NATURE
- LEARNER has a pure action z ?
- such that the expected loss is at most e
22Unknown mixture distribution
- Consequence
- LEARNER has a mixed action (over zs)
- for any mixed action ? of NATURE
- a mixture distribution D?
- The loss is at most e
- Challenge
- show a specific hypothesis hz
- pure, not mixed, action
23Searching for a good hypothesis
- Uniformly good hypothesis hz
- for any Di we have L(Di, hz,f) e
- Assume all the hi are identical
- Extremely lucky and unlikely case
- If we have a good hypothesis we are done!
- L(D?,hz,f) S ?i L(Di,hz,f) S ?i e e
- We need to show in general a good hz !
24Proof Outline
- Balancing the losses
- Show that some hz has identical loss on any Di
- uses Brouwer Fixed Point Theorem
- holds very generally
- Bounding the losses
- Show this hz has low loss for some mixture
- specifically Dz
25Brouwer Fixed Point Theorem For any convex and
compact set A and any continuous mapping f
A?A, there exists a point x in A such that f(x)x
A compact and convex set
f A?A continuous mapping
26Balancing Losses
Problem 1 Need to get f continuous
27Balancing Losses
Fixed point zf(z)
Problem 2 Needs that zi ?0
28Bounding the losses
- We can guarantee balanced losses even for linear
combining rule !
h0 h1 f X
0 1 1 a
0 1 0 b
For z(½, ½) we have L(Da,hz,f)½ L(Db,hz,f)½
Db Da D
0 1 ½ a
1 0 ½ b
29Bounding Losses
- Consider the previous z
- from Brouwer fixed point theorem
- Consider the mixture Dz
- Expected loss is at most e
- Also L(Dz,hz,f) SzjL(Dj,hz,f)?
- Conclusion
- For any mixture expected loss at most ? e
30Solving the problems
- Redefine the distribution weighted rule
- Claim For any distribution D,
- is continuous in z.
31Main Theorem
- For any target function f and any dgt0,
- there exists ?gt0 and z such that
- for any ? we have
32Balancing Losses
- The set A S zi 1 and zi 0
- The simplex
- The mapping f with parameters ? and ?
- f(z)i (zi Li,z?/k)/ (SzjLj,z?)
- where Li,zL(Di,hz,?,f)
- For some z in A we have f(z)z
- zi (zi Li,z?/k)/ (SzjLj,z?) gt0
- Li,z (SzjLj,z)? - ?/(zi k) lt (SzjLj,z) ?
33Bounding Losses
- Consider the previous z
- from Brouwer fixed point theorem
- Consider the mixture Dz
- Expected loss is at most e?
- By definition SzjLj,z L(Dz,hz,?,f)
- Conclusion ?SzjLj,z e?
34Putting it together
- There exists (z,?) such that
- Expected loss of hz,? approximately balanced
- L(Di,hz,?,f) ??
- Bounding ? using Dz
- ? L(Dz,hz,?,f) e?
- For any mixture D?
- L(D?,hz,?,f) e? ?
35A more general model
- So far NATURE first fixes target function f
- consistent target functions f
- the expected loss w.r.t. Di is at most e
- for any of the k distributions
- Function class F f is consistent
- New Model
- LEARNER picks a hypothesis h
- NATURE picks f in F and mixture D?
- Loss L(D?,h,f)
- RESULT L(D?,h,f) 3e.
36Simple Algorithms
37Uniform Algorithm
- Hypothesis sets z(1/k , , 1/k)
- Performance
- For any mixture, expected error ke
- There exists mixture with expected error O(ke)
- For k2, there exists a mixture with 2e-e2
38Open Problem
- Find a uniformly good hypothesis
- efficiently !!!
- algorithmic issues
- Search over the zs
- Multiple local minima.
39Empirical Results
40Empirical Results
- Data-set of sentiment analysis
- good product takes a little time to start
operating very good for the price a little
trouble using it inside ca - it rocks man this is the rockinest think i've
ever seen or buyed dudes check it ou - does not retract agree with the prior reviewers i
can not get it to retract any longer and that was
only after 3 uses - dont buy not worth a cent got it at walmart can't
even remove a scuff i give it 100 good thing i
could return it - flash drive excelent hard drive good price and
good time for seller thanks
41Empirical analysis
- Multiple domains
- dvd, books, electronics, kitchen appliance.
- Language model
- build a model for each domain
- unlike the theory, this is an additional error
source - Tested on mixture distribution
- known mixture parameters
- Target score (1-5)
- error Mean Square Error (MSE)
42Distribution weighted
kitchen
dvd
linear
books
electronics
43(No Transcript)
44(No Transcript)
45Summary
46Summary
- Adaptation model
- combining rules
- linear
- distribution weighted
- Theoretical analysis
- mixture distribution
- Future research
- algorithms for combining rules
- beyond mixtures
47Thank You!
48Adaptation Our Model
- Input
- target function f
- k distributions D1, , Dk
- k hypothesis h1, , hk
- For every i L(Di,hi,f) e
- where L(D,h,f) defines the expected loss
- think L(D,h,f) ExD f(x)-h(x)