Final review - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Final review

Description:

Final review LING 572 Fei Xia 03/07/06 Misc Parts 3 and 4 were due at 6am today. Presentation: email me the s by 6am on 3/9 Final report: email me by 6am on 3/14. – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 73
Provided by: 5686198
Category:

less

Transcript and Presenter's Notes

Title: Final review


1
Final review
  • LING 572
  • Fei Xia
  • 03/07/06

2
Misc
  • Parts 3 and 4 were due at 6am today.
  • Presentation email me the slides by 6am on 3/9
  • Final report email me by 6am on 3/14.
  • Group meetings 130-400pm on 3/16.

3
Outline
  • Main topics
  • Applying to NLP tasks
  • Tricks

4
Main topics
5
Main topics
  • Supervised learning
  • Decision tree
  • Decision list
  • TBL
  • MaxEnt
  • Boosting
  • Semi-supervised learning
  • Self-training
  • Co-training
  • EM
  • Co-EM

6
Main topics (cont)
  • Unsupervised learning
  • The EM algorithm
  • The EM algorithm for PM models
  • Forward-backward
  • Inside-outside
  • IBM models for MT
  • Others
  • Two dynamic models FSA and HMM
  • Re-sampling bootstrap
  • System combination
  • Bagging

7
Main topics (cont)
  • Homework
  • Hw1 FSA and HMM
  • Hw2 DT, DL, CNF, DNF, and TBL
  • Hw3 Boosting
  • Project
  • P1 Trigram (learn to use Carmel, relation
    between HMM and FSA)
  • P2 TBL
  • P3 MaxEnt
  • P4 Bagging, boosting, system combination, SSL

8
Supervised learning
9
A classification problem
District House type Income Previous Customer Outcome
Suburban Detached High No Nothing
Suburban Semi-detached High Yes Respond
Rural Semi-detached Low No Respond
Urban Detached Low Yes Nothing

10
Classification and estimation problems
  • Given
  • x input attributes
  • y the goal
  • training data a set of (x, y)
  • Predict y given a new x
  • y is a discrete variable ? classification problem
  • y is a continuous variable ? estimation problem

11
Five ML methods
  • Decision tree
  • Decision list
  • TBL
  • Boosting
  • MaxEnt

12
Decision tree
  • Modeling tree representation
  • Training top-down induction, greedy algorithm
  • Decoding find the path from root to a leaf node,
    where the tests along the path are satisfied.

13
Decision tree (cont)
  • Main algorithms ID3, C4.5, CART
  • Strengths
  • Ability to generate understandable rules
  • Ability to clearly indicate best attributes
  • Weakness
  • Data splitting
  • Trouble with non-rectangular regions
  • The instability of top-down induction ? bagging

14
Decision list
  • Modeling a list of decision rules
  • Training greedy, iterative algorithm
  • Decoding find the 1st rule that applies
  • Each decision is based on a single piece of
    evidence, in contrast to MaxEnt, boosting, TBL

15
TBL
  • Modeling a list of transformations (similar to
    decision rules)
  • Training
  • Greedy, iterative algorithm
  • The concept of current state
  • Decoding apply every transformation to the data

16
TBL (cont)
  • Strengths
  • Minimizing error rate directly
  • Ability to handle non-classification problem
  • Dynamic problem POS tagging
  • Non-classification problem parsing
  • Weaknesses
  • Transformations are hard to interpret as they
    interact with one another
  • Probabilistic TBL TBL-DT

17
Boosting
ML
f1
Training Sample
ML
Weighted Sample
f2
f

ML
fT
  • Weighted Sample

18
Boosting (cont)
  • Modeling combining a set of weak classifiers to
    produce a powerful committee.
  • Training learn one classifier at each iteration
  • Decoding use the weighted majority vote of the
    weak classifiers

19
Boosting (cont)
  • Strengths
  • It comes with a set of theoretical guarantee
    (e.g., training error, test error).
  • It only needs to find weak classifiers.
  • Weaknesses
  • It is susceptible to noise.
  • The actual performance depends on the data and
    the base learner.

20
MaxEnt
The task find p s.t.
where
If p exists, it has of the form
21
MaxEnt (cont)
  • If p exists, then

where
22
MaxEnt (cont)
  • Training GIS, IIS
  • Feature selection
  • Greedy algorithm
  • Select one (or more) at a time
  • In general, MaxEnt achieves good performance on
    many NLP tasks.

23
Common issues
  • Objective function / Quality measure
  • DT, DL e.g., information gain
  • TBL, Boosting minimize training errors
  • MaxEnt maximize entropy while satisfying
    constraints

24
Common issues (cont)
  • Avoiding overfitting
  • Use development data
  • Two strategies
  • stop early
  • post-pruning

25
Common issues (cont)
  • Missing attribute values
  • Assume a blank value
  • Assign most common value among all similar
    examples in the training data
  • (DL, DT) Assign a fraction of example to each
    possible class.
  • Continuous-valued attributes
  • Choosing thresholds by checking the training data

26
Common issues (cont)
  • Attributes with different costs
  • DT Change the quality measure to include the
    costs
  • Continuous-valued goal attribute
  • DT, DL each leaf node is marked with a real
    value or a linear function
  • TBL, MaxEnt, Boosting ??

27
Comparison of supervised learners
DT DL TBL Boosting MaxEnt
Probabilistic PDT PDL TBL-DT Confidence Y
Parametric N N N N Y
representation Tree Ordered list of rules Ordered list of transformations List of weighted classifiers List of weighted features
Each iteration Attribute Rule Transformation Classifier weight Feature weight
Data processing Split data Split data Change cur_y Reweight (x,y) None
decoding Path 1st rule Sequence of rules Calc f(x) Calc f(x)
28
Semi-supervised Learning
29
Semi-supervised learning
  • Each learning method makes some assumptions about
    the problem.
  • SSL works when those assumptions are satisfied.
  • SSL could degrade the performance when mistakes
    reinforce themselves.

30
SSL (cont)
  • We have covered four methods self-training,
    co-training, EM, co-EM

31
Co-training
  • The original paper (Blum and Mitchell, 1998)
  • Two independent views split the features into
    two sets.
  • Train a classifier on each view.
  • Each classifier labels data that can be used to
    train the other classifier.
  • Extension
  • Relax the conditional independence assumptions
  • Instead of using two views, use two or more
    classifiers trained on the whole feature set.

32
Unsupervised learning
33
Unsupervised learning
  • EM is a method of estimating parameters in the
    MLE framework.
  • It finds a sequence of parameters that improve
    the likelihood of the training data.

34
The EM algorithm
  • Start with initial estimate, ?0
  • Repeat until convergence
  • E-step calculate
  • M-step find

35
The EM algorithm (cont)
  • The optimal solution for the M-step exists for
    many classes of problems.
  • ? A number of well-known methods are special
    cases of EM.
  • The EM algorithm for PM models
  • Forward-backward algorithm
  • Inside-outside algorithm

36
Other topics
37
FSA and HMM
  • Two types of HMMs
  • State-emission and arc-emission HMMs
  • They are equivalent
  • We can convert HMM into WFA
  • Modeling Marcov assumption
  • Training
  • Supervised counting
  • Unsupervised forward-backward algorithm
  • Decoding Viterbi algorithm

38
Bootstrap
ML
f1
ML
f2
f
ML
fB
39
Bootstrap (cont)
  • A method of re-sampling
  • One original sample ? B bootstrap samples
  • It has a strong mathematical background.
  • It is a method for estimating standard errors,
    bias, and so on.

40
System combination
41
System combination (cont)
  • Hybridization combine substructures to produce a
    new one.
  • Voting
  • Naïve Bayes
  • Switching choose one of the fi(x)
  • Similarity switching
  • Naïve Bayes

42
Bagging
ML
f1
ML
f2
f
ML
fB
bootstrap system combination
43
Bagging (cont)
  • It is effective for unstable learning methods
  • Decision tree
  • Regression tree
  • Neural network
  • It does not help stable learning methods
  • K-nearest neighbors

44
Relations
45
Relations
  • WFSA and HMM
  • DL, DT, TBL
  • EM, EM for PM

46
WFSA and HMM
HMM
Add a Start state and a transition from Start
to any state in HMM. Add a Finish state and a
transition from any state in HMM to Finish.
47
DT, CNF, DNF, DT, TBL
K-DL
k-CNF
k-DNF
k-DT
k-TBL
48
The EM algorithm
The generalized EM
The EM algorithm
PM
Inside-Outside Forward-backward IBM models
Gaussian Mix
49
Solving a NLP problem
50
Issues
  • Modeling represent the problem as a formula and
    decompose the formula into a function of
    parameters
  • Training estimate model parameters
  • Decoding find the best answer given the
    parameters
  • Other issues
  • Preprocessing
  • Postprocessing
  • Evaluation

51
Modeling
  • Generative vs. discriminative models
  • Introducing hidden variables
  • The order of decomposition

52
Modeling (cont)
  • Approximation / assumptions
  • Final formulae and types of parameters

53
Modeling (cont)
  • Using classifiers for non-classification problem
  • POS tagging
  • Chunking
  • Parsing

54
Training
  • Objective functions
  • Maximize likelihood EM
  • Minimize error rate TBL
  • Maximum entropy MaxEnt
  • .
  • Supervised, semi-supervised, unsupervised
  • Ex Maximize likelihood
  • Supervised simple counting
  • Unsupervised EM

55
Training (cont)
  • At each iteration
  • Choose one attribute / rule / weight / at a
    time, and never change it in later time DT, DL,
    TBL,
  • Update all the parameters at each iteration EM
  • Choose untrained parameters (e.g., thresholds)
    use development data.
  • Minimal gain for continuing iteration

56
Decoding
  • Dynamic programming
  • CYK for PCFG
  • Viterbi for HMM
  • Dynamic problem
  • Decode from left to right
  • Features only look at the left context
  • Keep top-N hypotheses at each position

57
Preprocessing
  • Sentence segmentation
  • Sentence alignment (for MT)
  • Tokenization
  • Morphing
  • POS tagging

58
Post-processing
  • System combination
  • Casing (MT)

59
Evaluation
  • Use standard training/test data if possible.
  • Choose appropriate evaluation measures
  • WSD for what applications?
  • Word alignment F-measure vs. AER. How does it
    affect MT result?
  • Parsing F-measure vs. dependency link accuracy

60
Tricks
61
Tricks
  • Algebra
  • Probability
  • Optimization
  • Programming

62
Algebra
The order of sums
Pulling out constants
63
Algebra (cont)
The order of sums and products
The order of log and product / sum
64
Probability
Introducing a new random variable
The order of decomposition
65
More general cases
66
Probability (cont)
Bayes Rule
Source-channel model
67
Probability (cont)
Normalization
Jensens inequality
68
Optimization
  • When there is no analytical solution, use
    iterative approach.
  • If the optimal solution to g(x) is hard to find,
    look for the optimal solution to a (tight) lower
    bound of g(x).

69
Optimization (cont)
  • Using Lagrange multipliers
  • Constrained problem
  • maximize f(x) with the constraint that
    g(x)0
  • Unconstrained problem maximize f(x) ?g(x)
  • Taking first derivatives to find the stationary
    points.

70
Programming
  • Using/creating a good package
  • Tutorial, sample data, well-written code
  • Multiple levels of code
  • Core ML algorithm e.g., TBL
  • Wrapper for a task e.g., POS tagger
  • Wrapper to deal with input, output, etc.

71
Programming (cont)
  • Good practice
  • Write notes and create wrappers (all the commands
    should be stored in the notes, or even better in
    a wrapper code)
  • Use standard directory structures
  • src/, include/, exec/, bin/, obj/, docs/,
    sample/, data/, result/
  • Give meaning filenames only to important code
    e.g., aaa100.exec, build_trigram_tagger.pl
  • Give meaning function, variable names
  • Dont use global variables

72
Final words
  • We have covered a lot of topics 5434
  • It takes time to digest, but at least we
    understand the basic concepts.
  • The next step applying them to real
    applications.
Write a Comment
User Comments (0)
About PowerShow.com