Title: Jianfeng Gao1, Hao Yu2,
1Minimum Sample Risk Methods for Language Modeling
- Jianfeng Gao1, Hao Yu2,
- Wei Yuan2, Peng Xu3
- 1Microsoft Research
- 2Shanghai Jiaotong Univ., China
- 3John Hopkins Univ., USA
- HLT/EMNLP05, October 6, 2005
2Outline
- Task of Asian language text input and language
modeling (LM) - Discriminative training for LM
- Minimum Sample Risk (MSR)
- Experiments
- Conclusions
3An example of Chinese text input
mafangnitryyixoazegefanfa
4Text input test bed for LM
- A direct test bed of LM
- no acoustic ambiguity
- P(AW) 1 ? easy to convert W to A
- Easy to obtain large training data for
discriminative learning
- Microsoft Research will soon make the Japanese
text input test data available for research
purpose
5LM fundamentals
- Estimate P(W) using trigram model
- P(W) ? P( wnw1 , , wn-1) ? P(wi wi-1,wi-2)
- Estimate P(wi wi-1,wi-2) using maximum
likelihood estimation (MLE) with smoothing
- Issues
- Maximum likelihood ? Minimum error rate
- Difficult to integrate arbitrary linguistic
features in the generation process
6Discriminative training for LM
- Linear model form
- A set of D1 arbitrary features
- f(W, A)??N1, f(W, A) f0(W, A),, fD(W, A)
- e.g. f0 log trigram probability
- e.g. fi (i 1,,D) counts of word n-grams
- A set of D1 parameters, each for one feature
- ? ?0, ?1, , ?D
- Conversion score assigned by a linear model
- Research focus how to optimize parameters ?
7Optimizing parameters
- Training criterion objective function
- Easy to be optimized
- Tightly coupled with character error rate (CER)
- Optimization algorithm search for the optimum
- Numerical methods
- Heuristic methods
- Error rate is a step function, cannot be
optimized easily - Two strategies of optimizing parameters
- Existing methods minimize an easily-optimized
loss function - an approximation (upper bound) of CER
- Our method directly minimize CER using heuristic
methods -
8Outline
- Language modeling (LM) and the task of Asian
language text input (IME) - Discriminative training for LM
- Minimum Sample Risk
- Experiments
- Conclusions
9MSR (Minimum Sample Risk)
- Assume each training sample is (A, WR)
- Assume the converted W is determined by ranking
- Sample risk is defined as the edit distance
between W and WR, by Er(WR, W)
- Goal of MSR is to minimize the sample risk over
all training samples
10MSR algorithm
- Multi-dimensional function optimization algorithm
- Take ?1, ?2, ,?D as a set of directions.
- Using line search, move along the first direction
to minimize sample risk. - Move from there along the second direction to its
minimum. - Cycling through the whole set of directions as
many times as necessary until the sample risk
stops decreasing. - Challenges
- Line search on a step function efficient grid
line search - Very large feature set feature subset selection
11Grid line search overview
- One-dimensional optimization algorithm
- Grid search
- a grid is an interval of parameter values that
map to the same sample risk
Improving stability via smoothed sample risk
(Quirk et al. 05)
12Find all grids of a feature fd
- Score of any candidates in GEN(A)
- Group candidates with the same value of fd(W)
- For each group, define active candidate as the
one with the highest value of
- Only active candidates can be selected
- reduce GEN(A) to a list of active candidates
- Find a set of grids for ?d , within each a
particular active candidate will be selected as W
13Feature subset selection
- Select a small subset of highly effective
features for MSR learning - Reduce computational complexity
- Ensure the generalization of the model (less
likely to overfit) - Algorithm
- Rank all candidate features by effectiveness
- Select top N features
- How to measure the effectiveness of a feature?
14Measure the effectiveness of a feature
- Expected reduction of sample risk
- Penalize highly correlated features
- Cross correlation between two features
- Select k-th features according to
15Putting it all together MSR algorithm
- Init Set ?0 1 and ?d 0 for d1D
- Feature Selection Rank all features and select
the top K features, - For t 1T (T total number of iterations)
- For each k 1K
- Update the parameter of fk using line search.
16Outline
- Language modeling (LM) and the task of Asian
language text input (IME) - Discriminative training for LM
- Minimum Sample Risk
- Experiments
- Conclusions
17Settings
- Task Japanese IME
- Baseline word trigram model trained on
400,000-sentence Nikkei Newspaper corpus - Corpora for discriminative training Newspaper
- Training Nikkei (80,000 sentences)
- GEN(A) contains 20-best candidates ranked by the
baseline - WR oracle best hypothesis
- Dev Yomiuri (5,000 sentences)
- Test Yomiuri (5,000 sentences)
- Metric character error rate (CER)
18Main results
- MSR select 2000 features out of 860K candidates
(unigram/bigram) - Boosting, according to (Collins, 2000)
- Perceptron, averaged variant (Collins, 2002)
19Robustness
- Robustness convergence generalization
- Feature selection leads to lower CER (better
generalization)
- Considering feature correlation leads to a better
feature subset.
20Domain adaptation
- Discriminative training as a domain adaptation
method - Baseline linear interpolation
- Described in detail in (Suzuki and Gao 05) to be
presented here tomorrow
21Conclusion
- MSR is a successful discriminative training
algorithm for LM and performs comparably with
other state-of-the-art methods on the task of
text input - MSR directly minimizes the count of training
errors without resorting any approximated loss
function - MSR can handle large number of features
- Apply MSR to other tasks
- Maximize average-precision in IR (Gao et al. 05)
- Maximize BLEU score in MT (Och 03)
- Parsing and tagging
22Thanks
23Optimize an upper-bounded loss function, e.g.
maxent, boosting