Jianfeng Gao1, Hao Yu2, - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Jianfeng Gao1, Hao Yu2,

Description:

... will soon make the Japanese text input test data ... Task: Japanese IME ... comparably with other state-of-the-art methods on the task of text input ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 23
Provided by: yan4
Category:

less

Transcript and Presenter's Notes

Title: Jianfeng Gao1, Hao Yu2,


1
Minimum Sample Risk Methods for Language Modeling
  • Jianfeng Gao1, Hao Yu2,
  • Wei Yuan2, Peng Xu3
  • 1Microsoft Research
  • 2Shanghai Jiaotong Univ., China
  • 3John Hopkins Univ., USA
  • HLT/EMNLP05, October 6, 2005

2
Outline
  • Task of Asian language text input and language
    modeling (LM)
  • Discriminative training for LM
  • Minimum Sample Risk (MSR)
  • Experiments
  • Conclusions

3
An example of Chinese text input
mafangnitryyixoazegefanfa
4
Text input test bed for LM
  • Similar to speech
  • A direct test bed of LM
  • no acoustic ambiguity
  • P(AW) 1 ? easy to convert W to A
  • Easy to obtain large training data for
    discriminative learning
  • Microsoft Research will soon make the Japanese
    text input test data available for research
    purpose

5
LM fundamentals
  • Estimate P(W) using trigram model
  • P(W) ? P( wnw1 , , wn-1) ? P(wi wi-1,wi-2)
  • Estimate P(wi wi-1,wi-2) using maximum
    likelihood estimation (MLE) with smoothing
  • Issues
  • Maximum likelihood ? Minimum error rate
  • Difficult to integrate arbitrary linguistic
    features in the generation process

6
Discriminative training for LM
  • Linear model form
  • A set of D1 arbitrary features
  • f(W, A)??N1, f(W, A) f0(W, A),, fD(W, A)
  • e.g. f0 log trigram probability
  • e.g. fi (i 1,,D) counts of word n-grams
  • A set of D1 parameters, each for one feature
  • ? ?0, ?1, , ?D
  • Conversion score assigned by a linear model
  • Research focus how to optimize parameters ?

7
Optimizing parameters
  • Training criterion objective function
  • Easy to be optimized
  • Tightly coupled with character error rate (CER)
  • Optimization algorithm search for the optimum
  • Numerical methods
  • Heuristic methods
  • Error rate is a step function, cannot be
    optimized easily
  • Two strategies of optimizing parameters
  • Existing methods minimize an easily-optimized
    loss function
  • an approximation (upper bound) of CER
  • Our method directly minimize CER using heuristic
    methods

8
Outline
  • Language modeling (LM) and the task of Asian
    language text input (IME)
  • Discriminative training for LM
  • Minimum Sample Risk
  • Experiments
  • Conclusions

9
MSR (Minimum Sample Risk)
  • Assume each training sample is (A, WR)
  • Assume the converted W is determined by ranking
  • Sample risk is defined as the edit distance
    between W and WR, by Er(WR, W)
  • Goal of MSR is to minimize the sample risk over
    all training samples

10
MSR algorithm
  • Multi-dimensional function optimization algorithm
  • Take ?1, ?2, ,?D as a set of directions.
  • Using line search, move along the first direction
    to minimize sample risk.
  • Move from there along the second direction to its
    minimum.
  • Cycling through the whole set of directions as
    many times as necessary until the sample risk
    stops decreasing.
  • Challenges
  • Line search on a step function efficient grid
    line search
  • Very large feature set feature subset selection

11
Grid line search overview
  • One-dimensional optimization algorithm
  • Grid search
  • a grid is an interval of parameter values that
    map to the same sample risk

Improving stability via smoothed sample risk
(Quirk et al. 05)
12
Find all grids of a feature fd
  • Score of any candidates in GEN(A)
  • Group candidates with the same value of fd(W)
  • For each group, define active candidate as the
    one with the highest value of
  • Only active candidates can be selected
  • reduce GEN(A) to a list of active candidates
  • Find a set of grids for ?d , within each a
    particular active candidate will be selected as W

13
Feature subset selection
  • Select a small subset of highly effective
    features for MSR learning
  • Reduce computational complexity
  • Ensure the generalization of the model (less
    likely to overfit)
  • Algorithm
  • Rank all candidate features by effectiveness
  • Select top N features
  • How to measure the effectiveness of a feature?

14
Measure the effectiveness of a feature
  • Expected reduction of sample risk
  • Penalize highly correlated features
  • Cross correlation between two features
  • Select k-th features according to

15
Putting it all together MSR algorithm
  • Init Set ?0 1 and ?d 0 for d1D
  • Feature Selection Rank all features and select
    the top K features,
  • For t 1T (T total number of iterations)
  • For each k 1K
  • Update the parameter of fk using line search.

16
Outline
  • Language modeling (LM) and the task of Asian
    language text input (IME)
  • Discriminative training for LM
  • Minimum Sample Risk
  • Experiments
  • Conclusions

17
Settings
  • Task Japanese IME
  • Baseline word trigram model trained on
    400,000-sentence Nikkei Newspaper corpus
  • Corpora for discriminative training Newspaper
  • Training Nikkei (80,000 sentences)
  • GEN(A) contains 20-best candidates ranked by the
    baseline
  • WR oracle best hypothesis
  • Dev Yomiuri (5,000 sentences)
  • Test Yomiuri (5,000 sentences)
  • Metric character error rate (CER)

18
Main results
  • MSR select 2000 features out of 860K candidates
    (unigram/bigram)
  • Boosting, according to (Collins, 2000)
  • Perceptron, averaged variant (Collins, 2002)

19
Robustness
  • Robustness convergence generalization
  • MSR converges
  • Feature selection leads to lower CER (better
    generalization)
  • Considering feature correlation leads to a better
    feature subset.

20
Domain adaptation
  • Discriminative training as a domain adaptation
    method
  • Baseline linear interpolation
  • Described in detail in (Suzuki and Gao 05) to be
    presented here tomorrow

21
Conclusion
  • MSR is a successful discriminative training
    algorithm for LM and performs comparably with
    other state-of-the-art methods on the task of
    text input
  • MSR directly minimizes the count of training
    errors without resorting any approximated loss
    function
  • MSR can handle large number of features
  • Apply MSR to other tasks
  • Maximize average-precision in IR (Gao et al. 05)
  • Maximize BLEU score in MT (Och 03)
  • Parsing and tagging

22
Thanks
23
Optimize an upper-bounded loss function, e.g.
maxent, boosting
Write a Comment
User Comments (0)
About PowerShow.com