Title: Final review
1Final review
- LING 572
- Fei Xia
- 03/07/06
2Misc
- Parts 3 and 4 were due at 6am today.
- Presentation email me the slides by 6am on 3/9
- Final report email me by 6am on 3/14.
- Group meetings 130-400pm on 3/16.
3Outline
- Main topics
- Applying to NLP tasks
- Tricks
4Main topics
5Main topics
- Supervised learning
- Decision tree
- Decision list
- TBL
- MaxEnt
- Boosting
- Semi-supervised learning
- Self-training
- Co-training
- EM
- Co-EM
6Main topics (cont)
- Unsupervised learning
- The EM algorithm
- The EM algorithm for PM models
- Forward-backward
- Inside-outside
- IBM models for MT
- Others
- Two dynamic models FSA and HMM
- Re-sampling bootstrap
- System combination
- Bagging
7Main topics (cont)
- Homework
- Hw1 FSA and HMM
- Hw2 DT, DL, CNF, DNF, and TBL
- Hw3 Boosting
- Project
- P1 Trigram (learn to use Carmel, relation
between HMM and FSA) - P2 TBL
- P3 MaxEnt
- P4 Bagging, boosting, system combination, SSL
8Supervised learning
9A classification problem
District House type Income Previous Customer Outcome
Suburban Detached High No Nothing
Suburban Semi-detached High Yes Respond
Rural Semi-detached Low No Respond
Urban Detached Low Yes Nothing
10Classification and estimation problems
- Given
- x input attributes
- y the goal
- training data a set of (x, y)
- Predict y given a new x
- y is a discrete variable ? classification problem
- y is a continuous variable ? estimation problem
11Five ML methods
- Decision tree
- Decision list
- TBL
- Boosting
- MaxEnt
12Decision tree
- Modeling tree representation
- Training top-down induction, greedy algorithm
- Decoding find the path from root to a leaf node,
where the tests along the path are satisfied.
13Decision tree (cont)
- Main algorithms ID3, C4.5, CART
- Strengths
- Ability to generate understandable rules
- Ability to clearly indicate best attributes
- Weakness
- Data splitting
- Trouble with non-rectangular regions
- The instability of top-down induction ? bagging
14Decision list
- Modeling a list of decision rules
- Training greedy, iterative algorithm
- Decoding find the 1st rule that applies
- Each decision is based on a single piece of
evidence, in contrast to MaxEnt, boosting, TBL
15TBL
- Modeling a list of transformations (similar to
decision rules) - Training
- Greedy, iterative algorithm
- The concept of current state
- Decoding apply every transformation to the data
16TBL (cont)
- Strengths
- Minimizing error rate directly
- Ability to handle non-classification problem
- Dynamic problem POS tagging
- Non-classification problem parsing
- Weaknesses
- Transformations are hard to interpret as they
interact with one another - Probabilistic TBL TBL-DT
17Boosting
ML
f1
Training Sample
ML
Weighted Sample
f2
f
ML
fT
18Boosting (cont)
- Modeling combining a set of weak classifiers to
produce a powerful committee. - Training learn one classifier at each iteration
- Decoding use the weighted majority vote of the
weak classifiers
19Boosting (cont)
- Strengths
- It comes with a set of theoretical guarantee
(e.g., training error, test error). - It only needs to find weak classifiers.
- Weaknesses
- It is susceptible to noise.
- The actual performance depends on the data and
the base learner.
20MaxEnt
The task find p s.t.
where
If p exists, it has of the form
21MaxEnt (cont)
where
22MaxEnt (cont)
- Training GIS, IIS
- Feature selection
- Greedy algorithm
- Select one (or more) at a time
- In general, MaxEnt achieves good performance on
many NLP tasks.
23Common issues
- Objective function / Quality measure
- DT, DL e.g., information gain
- TBL, Boosting minimize training errors
- MaxEnt maximize entropy while satisfying
constraints
24Common issues (cont)
- Avoiding overfitting
- Use development data
- Two strategies
- stop early
- post-pruning
25Common issues (cont)
- Missing attribute values
- Assume a blank value
- Assign most common value among all similar
examples in the training data - (DL, DT) Assign a fraction of example to each
possible class. - Continuous-valued attributes
- Choosing thresholds by checking the training data
26Common issues (cont)
- Attributes with different costs
- DT Change the quality measure to include the
costs - Continuous-valued goal attribute
- DT, DL each leaf node is marked with a real
value or a linear function - TBL, MaxEnt, Boosting ??
27Comparison of supervised learners
DT DL TBL Boosting MaxEnt
Probabilistic PDT PDL TBL-DT Confidence Y
Parametric N N N N Y
representation Tree Ordered list of rules Ordered list of transformations List of weighted classifiers List of weighted features
Each iteration Attribute Rule Transformation Classifier weight Feature weight
Data processing Split data Split data Change cur_y Reweight (x,y) None
decoding Path 1st rule Sequence of rules Calc f(x) Calc f(x)
28Semi-supervised Learning
29Semi-supervised learning
- Each learning method makes some assumptions about
the problem. - SSL works when those assumptions are satisfied.
- SSL could degrade the performance when mistakes
reinforce themselves.
30SSL (cont)
- We have covered four methods self-training,
co-training, EM, co-EM
31Co-training
- The original paper (Blum and Mitchell, 1998)
- Two independent views split the features into
two sets. - Train a classifier on each view.
- Each classifier labels data that can be used to
train the other classifier. - Extension
- Relax the conditional independence assumptions
- Instead of using two views, use two or more
classifiers trained on the whole feature set.
32Unsupervised learning
33Unsupervised learning
- EM is a method of estimating parameters in the
MLE framework. - It finds a sequence of parameters that improve
the likelihood of the training data.
34The EM algorithm
- Start with initial estimate, ?0
- Repeat until convergence
- E-step calculate
- M-step find
35The EM algorithm (cont)
- The optimal solution for the M-step exists for
many classes of problems. - ? A number of well-known methods are special
cases of EM. - The EM algorithm for PM models
- Forward-backward algorithm
- Inside-outside algorithm
-
36Other topics
37FSA and HMM
- Two types of HMMs
- State-emission and arc-emission HMMs
- They are equivalent
- We can convert HMM into WFA
- Modeling Marcov assumption
- Training
- Supervised counting
- Unsupervised forward-backward algorithm
- Decoding Viterbi algorithm
38Bootstrap
ML
f1
ML
f2
f
ML
fB
39Bootstrap (cont)
- A method of re-sampling
- One original sample ? B bootstrap samples
- It has a strong mathematical background.
- It is a method for estimating standard errors,
bias, and so on.
40System combination
41System combination (cont)
- Hybridization combine substructures to produce a
new one. - Voting
- Naïve Bayes
- Switching choose one of the fi(x)
- Similarity switching
- Naïve Bayes
42Bagging
ML
f1
ML
f2
f
ML
fB
bootstrap system combination
43Bagging (cont)
- It is effective for unstable learning methods
- Decision tree
- Regression tree
- Neural network
- It does not help stable learning methods
- K-nearest neighbors
44Relations
45Relations
- WFSA and HMM
- DL, DT, TBL
- EM, EM for PM
46WFSA and HMM
HMM
Add a Start state and a transition from Start
to any state in HMM. Add a Finish state and a
transition from any state in HMM to Finish.
47DT, CNF, DNF, DT, TBL
K-DL
k-CNF
k-DNF
k-DT
k-TBL
48The EM algorithm
The generalized EM
The EM algorithm
PM
Inside-Outside Forward-backward IBM models
Gaussian Mix
49Solving a NLP problem
50Issues
- Modeling represent the problem as a formula and
decompose the formula into a function of
parameters - Training estimate model parameters
- Decoding find the best answer given the
parameters - Other issues
- Preprocessing
- Postprocessing
- Evaluation
-
51Modeling
- Generative vs. discriminative models
- Introducing hidden variables
- The order of decomposition
52Modeling (cont)
- Approximation / assumptions
- Final formulae and types of parameters
53Modeling (cont)
- Using classifiers for non-classification problem
- POS tagging
- Chunking
- Parsing
54Training
- Objective functions
- Maximize likelihood EM
- Minimize error rate TBL
- Maximum entropy MaxEnt
- .
- Supervised, semi-supervised, unsupervised
- Ex Maximize likelihood
- Supervised simple counting
- Unsupervised EM
55Training (cont)
- At each iteration
- Choose one attribute / rule / weight / at a
time, and never change it in later time DT, DL,
TBL, - Update all the parameters at each iteration EM
- Choose untrained parameters (e.g., thresholds)
use development data. - Minimal gain for continuing iteration
56Decoding
- Dynamic programming
- CYK for PCFG
- Viterbi for HMM
- Dynamic problem
- Decode from left to right
- Features only look at the left context
- Keep top-N hypotheses at each position
57Preprocessing
- Sentence segmentation
- Sentence alignment (for MT)
- Tokenization
- Morphing
- POS tagging
58Post-processing
- System combination
- Casing (MT)
-
59Evaluation
- Use standard training/test data if possible.
- Choose appropriate evaluation measures
- WSD for what applications?
- Word alignment F-measure vs. AER. How does it
affect MT result? - Parsing F-measure vs. dependency link accuracy
60Tricks
61Tricks
- Algebra
- Probability
- Optimization
- Programming
62Algebra
The order of sums
Pulling out constants
63Algebra (cont)
The order of sums and products
The order of log and product / sum
64Probability
Introducing a new random variable
The order of decomposition
65More general cases
66Probability (cont)
Bayes Rule
Source-channel model
67Probability (cont)
Normalization
Jensens inequality
68Optimization
- When there is no analytical solution, use
iterative approach. - If the optimal solution to g(x) is hard to find,
look for the optimal solution to a (tight) lower
bound of g(x).
69Optimization (cont)
- Using Lagrange multipliers
-
- Constrained problem
- maximize f(x) with the constraint that
g(x)0 - Unconstrained problem maximize f(x) ?g(x)
- Taking first derivatives to find the stationary
points.
70Programming
- Using/creating a good package
- Tutorial, sample data, well-written code
- Multiple levels of code
- Core ML algorithm e.g., TBL
- Wrapper for a task e.g., POS tagger
- Wrapper to deal with input, output, etc.
71Programming (cont)
- Good practice
- Write notes and create wrappers (all the commands
should be stored in the notes, or even better in
a wrapper code) - Use standard directory structures
- src/, include/, exec/, bin/, obj/, docs/,
sample/, data/, result/ - Give meaning filenames only to important code
e.g., aaa100.exec, build_trigram_tagger.pl - Give meaning function, variable names
- Dont use global variables
72Final words
- We have covered a lot of topics 5434
- It takes time to digest, but at least we
understand the basic concepts. - The next step applying them to real
applications.