Final review

About This Presentation

Title:

Final review

Description:

Final review LING 572 Fei Xia 03/07/06 Misc Parts 3 and 4 were due at 6am today. Presentation: email me the s by 6am on 3/9 Final report: email me by 6am on 3/14. – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 73

Provided by: 5686198

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Final review

1
Final review

LING 572
Fei Xia
03/07/06

2
Misc

Parts 3 and 4 were due at 6am today.
Presentation email me the slides by 6am on 3/9
Final report email me by 6am on 3/14.
Group meetings 130-400pm on 3/16.

3
Outline

Main topics
Applying to NLP tasks
Tricks

4
Main topics
5
Main topics

Supervised learning
Decision tree
Decision list
TBL
MaxEnt
Boosting
Semi-supervised learning
Self-training
Co-training
EM
Co-EM

6
Main topics (cont)

Unsupervised learning
The EM algorithm
The EM algorithm for PM models
Forward-backward
Inside-outside
IBM models for MT
Others
Two dynamic models FSA and HMM
Re-sampling bootstrap
System combination
Bagging

7
Main topics (cont)

Homework
Hw1 FSA and HMM
Hw2 DT, DL, CNF, DNF, and TBL
Hw3 Boosting
Project
P1 Trigram (learn to use Carmel, relation
between HMM and FSA)
P2 TBL
P3 MaxEnt
P4 Bagging, boosting, system combination, SSL

8
Supervised learning
9
A classification problem
District House type Income Previous Customer Outcome
Suburban Detached High No Nothing
Suburban Semi-detached High Yes Respond
Rural Semi-detached Low No Respond
Urban Detached Low Yes Nothing

10
Classification and estimation problems

Given
x input attributes
y the goal
training data a set of (x, y)
Predict y given a new x
y is a discrete variable ? classification problem
y is a continuous variable ? estimation problem

11
Five ML methods

Decision tree
Decision list
TBL
Boosting
MaxEnt

12
Decision tree

Modeling tree representation
Training top-down induction, greedy algorithm
Decoding find the path from root to a leaf node,
where the tests along the path are satisfied.

13
Decision tree (cont)

Main algorithms ID3, C4.5, CART
Strengths
Ability to generate understandable rules
Ability to clearly indicate best attributes
Weakness
Data splitting
Trouble with non-rectangular regions
The instability of top-down induction ? bagging

14
Decision list

Modeling a list of decision rules
Training greedy, iterative algorithm
Decoding find the 1st rule that applies
Each decision is based on a single piece of
evidence, in contrast to MaxEnt, boosting, TBL

15
TBL

Modeling a list of transformations (similar to
decision rules)
Training
Greedy, iterative algorithm
The concept of current state
Decoding apply every transformation to the data

16
TBL (cont)

Strengths
Minimizing error rate directly
Ability to handle non-classification problem
Dynamic problem POS tagging
Non-classification problem parsing
Weaknesses
Transformations are hard to interpret as they
interact with one another
Probabilistic TBL TBL-DT

17
Boosting
ML
f1
Training Sample
ML
Weighted Sample
f2
f

ML
fT

Weighted Sample

18
Boosting (cont)

Modeling combining a set of weak classifiers to
produce a powerful committee.
Training learn one classifier at each iteration
Decoding use the weighted majority vote of the
weak classifiers

19
Boosting (cont)

Strengths
It comes with a set of theoretical guarantee
(e.g., training error, test error).
It only needs to find weak classifiers.
Weaknesses
It is susceptible to noise.
The actual performance depends on the data and
the base learner.

20
MaxEnt
The task find p s.t.
where
If p exists, it has of the form
21
MaxEnt (cont)

If p exists, then

where
22
MaxEnt (cont)

Training GIS, IIS
Feature selection
Greedy algorithm
Select one (or more) at a time
In general, MaxEnt achieves good performance on
many NLP tasks.

23
Common issues

Objective function / Quality measure
DT, DL e.g., information gain
TBL, Boosting minimize training errors
MaxEnt maximize entropy while satisfying
constraints

24
Common issues (cont)

Avoiding overfitting
Use development data
Two strategies
stop early
post-pruning

25
Common issues (cont)

Missing attribute values
Assume a blank value
Assign most common value among all similar
examples in the training data
(DL, DT) Assign a fraction of example to each
possible class.
Continuous-valued attributes
Choosing thresholds by checking the training data

26
Common issues (cont)

Attributes with different costs
DT Change the quality measure to include the
costs
Continuous-valued goal attribute
DT, DL each leaf node is marked with a real
value or a linear function
TBL, MaxEnt, Boosting ??

27
Comparison of supervised learners
DT DL TBL Boosting MaxEnt
Probabilistic PDT PDL TBL-DT Confidence Y
Parametric N N N N Y
representation Tree Ordered list of rules Ordered list of transformations List of weighted classifiers List of weighted features
Each iteration Attribute Rule Transformation Classifier weight Feature weight
Data processing Split data Split data Change cur_y Reweight (x,y) None
decoding Path 1st rule Sequence of rules Calc f(x) Calc f(x)
28
Semi-supervised Learning
29
Semi-supervised learning

Each learning method makes some assumptions about
the problem.
SSL works when those assumptions are satisfied.
SSL could degrade the performance when mistakes
reinforce themselves.

30
SSL (cont)

We have covered four methods self-training,
co-training, EM, co-EM

31
Co-training

The original paper (Blum and Mitchell, 1998)
Two independent views split the features into
two sets.
Train a classifier on each view.
Each classifier labels data that can be used to
train the other classifier.
Extension
Relax the conditional independence assumptions
Instead of using two views, use two or more
classifiers trained on the whole feature set.

32
Unsupervised learning
33
Unsupervised learning

EM is a method of estimating parameters in the
MLE framework.
It finds a sequence of parameters that improve
the likelihood of the training data.

34
The EM algorithm

Start with initial estimate, ?0
Repeat until convergence
E-step calculate
M-step find

35
The EM algorithm (cont)

The optimal solution for the M-step exists for
many classes of problems.
? A number of well-known methods are special
cases of EM.
The EM algorithm for PM models
Forward-backward algorithm
Inside-outside algorithm

36
Other topics
37
FSA and HMM

Two types of HMMs
State-emission and arc-emission HMMs
They are equivalent
We can convert HMM into WFA
Modeling Marcov assumption
Training
Supervised counting
Unsupervised forward-backward algorithm
Decoding Viterbi algorithm

38
Bootstrap
ML
f1
ML
f2
f
ML
fB
39
Bootstrap (cont)

A method of re-sampling
One original sample ? B bootstrap samples
It has a strong mathematical background.
It is a method for estimating standard errors,
bias, and so on.

40
System combination
41
System combination (cont)

Hybridization combine substructures to produce a
new one.
Voting
Naïve Bayes
Switching choose one of the fi(x)
Similarity switching
Naïve Bayes

42
Bagging
ML
f1
ML
f2
f
ML
fB
bootstrap system combination
43
Bagging (cont)

It is effective for unstable learning methods
Decision tree
Regression tree
Neural network
It does not help stable learning methods
K-nearest neighbors

44
Relations
45
Relations

WFSA and HMM
DL, DT, TBL
EM, EM for PM

46
WFSA and HMM
HMM
Add a Start state and a transition from Start
to any state in HMM. Add a Finish state and a
transition from any state in HMM to Finish.
47
DT, CNF, DNF, DT, TBL
K-DL
k-CNF
k-DNF
k-DT
k-TBL
48
The EM algorithm
The generalized EM
The EM algorithm
PM
Inside-Outside Forward-backward IBM models
Gaussian Mix
49
Solving a NLP problem
50
Issues

Modeling represent the problem as a formula and
decompose the formula into a function of
parameters
Training estimate model parameters
Decoding find the best answer given the
parameters
Other issues
Preprocessing
Postprocessing
Evaluation

51
Modeling

Generative vs. discriminative models
Introducing hidden variables
The order of decomposition

52
Modeling (cont)

Approximation / assumptions
Final formulae and types of parameters

53
Modeling (cont)

Using classifiers for non-classification problem
POS tagging
Chunking
Parsing

54
Training

Objective functions
Maximize likelihood EM
Minimize error rate TBL
Maximum entropy MaxEnt
.
Supervised, semi-supervised, unsupervised
Ex Maximize likelihood
Supervised simple counting
Unsupervised EM

55
Training (cont)

At each iteration
Choose one attribute / rule / weight / at a
time, and never change it in later time DT, DL,
TBL,
Update all the parameters at each iteration EM
Choose untrained parameters (e.g., thresholds)
use development data.
Minimal gain for continuing iteration

56
Decoding

Dynamic programming
CYK for PCFG
Viterbi for HMM
Dynamic problem
Decode from left to right
Features only look at the left context
Keep top-N hypotheses at each position

57
Preprocessing

Sentence segmentation
Sentence alignment (for MT)
Tokenization
Morphing
POS tagging

58
Post-processing

System combination
Casing (MT)

59
Evaluation

Use standard training/test data if possible.
Choose appropriate evaluation measures
WSD for what applications?
Word alignment F-measure vs. AER. How does it
affect MT result?
Parsing F-measure vs. dependency link accuracy

60
Tricks
61
Tricks

Algebra
Probability
Optimization
Programming

62
Algebra
The order of sums
Pulling out constants
63
Algebra (cont)
The order of sums and products
The order of log and product / sum
64
Probability
Introducing a new random variable
The order of decomposition
65
More general cases
66
Probability (cont)
Bayes Rule
Source-channel model
67
Probability (cont)
Normalization
Jensens inequality
68
Optimization

When there is no analytical solution, use
iterative approach.
If the optimal solution to g(x) is hard to find,
look for the optimal solution to a (tight) lower
bound of g(x).

69
Optimization (cont)

Using Lagrange multipliers
Constrained problem
maximize f(x) with the constraint that
g(x)0
Unconstrained problem maximize f(x) ?g(x)
Taking first derivatives to find the stationary
points.

70
Programming

Using/creating a good package
Tutorial, sample data, well-written code
Multiple levels of code
Core ML algorithm e.g., TBL
Wrapper for a task e.g., POS tagger
Wrapper to deal with input, output, etc.

71
Programming (cont)

Good practice
Write notes and create wrappers (all the commands
should be stored in the notes, or even better in
a wrapper code)
Use standard directory structures
src/, include/, exec/, bin/, obj/, docs/,
sample/, data/, result/
Give meaning filenames only to important code
e.g., aaa100.exec, build_trigram_tagger.pl
Give meaning function, variable names
Dont use global variables

Final review - PowerPoint PPT Presentation

Final review

Final review LING 572 Fei Xia 03/07/06 Misc Parts 3 and 4 were due at 6am today. Presentation: email me the s by 6am on 3/9 Final report: email me by 6am on 3/14. – PowerPoint PPT presentation