Jianfeng Gao1, Hao Yu2, - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Jianfeng Gao1, Hao Yu2,

Description:

... will soon make the Japanese text input test data ... Task: Japanese IME ... comparably with other state-of-the-art methods on the task of text input ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 23

Provided by: yan4

Category:

more less

Transcript and Presenter's Notes

Title: Jianfeng Gao1, Hao Yu2,

1
Minimum Sample Risk Methods for Language Modeling

Jianfeng Gao1, Hao Yu2,
Wei Yuan2, Peng Xu3
1Microsoft Research
2Shanghai Jiaotong Univ., China
3John Hopkins Univ., USA
HLT/EMNLP05, October 6, 2005

2
Outline

Task of Asian language text input and language
modeling (LM)
Discriminative training for LM
Minimum Sample Risk (MSR)
Experiments
Conclusions

3
An example of Chinese text input
mafangnitryyixoazegefanfa
4
Text input test bed for LM

Similar to speech

A direct test bed of LM
no acoustic ambiguity

P(AW) 1 ? easy to convert W to A
Easy to obtain large training data for
discriminative learning

Microsoft Research will soon make the Japanese
text input test data available for research
purpose

5
LM fundamentals

Estimate P(W) using trigram model
P(W) ? P( wnw1 , , wn-1) ? P(wi wi-1,wi-2)
Estimate P(wi wi-1,wi-2) using maximum
likelihood estimation (MLE) with smoothing

Issues
Maximum likelihood ? Minimum error rate
Difficult to integrate arbitrary linguistic
features in the generation process

6
Discriminative training for LM

Linear model form
A set of D1 arbitrary features
f(W, A)??N1, f(W, A) f0(W, A),, fD(W, A)
e.g. f0 log trigram probability
e.g. fi (i 1,,D) counts of word n-grams
A set of D1 parameters, each for one feature
? ?0, ?1, , ?D
Conversion score assigned by a linear model

Research focus how to optimize parameters ?

7
Optimizing parameters

Training criterion objective function
Easy to be optimized
Tightly coupled with character error rate (CER)
Optimization algorithm search for the optimum
Numerical methods
Heuristic methods
Error rate is a step function, cannot be
optimized easily
Two strategies of optimizing parameters
Existing methods minimize an easily-optimized
loss function
an approximation (upper bound) of CER
Our method directly minimize CER using heuristic
methods

8
Outline

Language modeling (LM) and the task of Asian
language text input (IME)
Discriminative training for LM
Minimum Sample Risk
Experiments
Conclusions

9
MSR (Minimum Sample Risk)

Assume each training sample is (A, WR)

Assume the converted W is determined by ranking

Sample risk is defined as the edit distance
between W and WR, by Er(WR, W)

Goal of MSR is to minimize the sample risk over
all training samples

10
MSR algorithm

Multi-dimensional function optimization algorithm
Take ?1, ?2, ,?D as a set of directions.
Using line search, move along the first direction
to minimize sample risk.
Move from there along the second direction to its
minimum.
Cycling through the whole set of directions as
many times as necessary until the sample risk
stops decreasing.
Challenges
Line search on a step function efficient grid
line search
Very large feature set feature subset selection

11
Grid line search overview

One-dimensional optimization algorithm
Grid search
a grid is an interval of parameter values that
map to the same sample risk

Improving stability via smoothed sample risk
(Quirk et al. 05)
12
Find all grids of a feature fd

Score of any candidates in GEN(A)

Group candidates with the same value of fd(W)

For each group, define active candidate as the
one with the highest value of

Only active candidates can be selected
reduce GEN(A) to a list of active candidates

Find a set of grids for ?d , within each a
particular active candidate will be selected as W

13
Feature subset selection

Select a small subset of highly effective
features for MSR learning
Reduce computational complexity
Ensure the generalization of the model (less
likely to overfit)
Algorithm
Rank all candidate features by effectiveness
Select top N features
How to measure the effectiveness of a feature?

14
Measure the effectiveness of a feature

Expected reduction of sample risk

Penalize highly correlated features
Cross correlation between two features

Select k-th features according to

15
Putting it all together MSR algorithm

Init Set ?0 1 and ?d 0 for d1D
Feature Selection Rank all features and select
the top K features,
For t 1T (T total number of iterations)
For each k 1K
Update the parameter of fk using line search.

16
Outline

Language modeling (LM) and the task of Asian
language text input (IME)
Discriminative training for LM
Minimum Sample Risk
Experiments
Conclusions

17
Settings

Task Japanese IME
Baseline word trigram model trained on
400,000-sentence Nikkei Newspaper corpus
Corpora for discriminative training Newspaper
Training Nikkei (80,000 sentences)
GEN(A) contains 20-best candidates ranked by the
baseline
WR oracle best hypothesis
Dev Yomiuri (5,000 sentences)
Test Yomiuri (5,000 sentences)
Metric character error rate (CER)

18
Main results

MSR select 2000 features out of 860K candidates
(unigram/bigram)
Boosting, according to (Collins, 2000)
Perceptron, averaged variant (Collins, 2002)

19
Robustness

Robustness convergence generalization

MSR converges

Feature selection leads to lower CER (better
generalization)

Considering feature correlation leads to a better
feature subset.

20
Domain adaptation

Discriminative training as a domain adaptation
method
Baseline linear interpolation
Described in detail in (Suzuki and Gao 05) to be
presented here tomorrow

21
Conclusion

MSR is a successful discriminative training
algorithm for LM and performs comparably with
other state-of-the-art methods on the task of
text input
MSR directly minimizes the count of training
errors without resorting any approximated loss
function
MSR can handle large number of features
Apply MSR to other tasks
Maximize average-precision in IR (Gao et al. 05)
Maximize BLEU score in MT (Och 03)
Parsing and tagging