Title: LING 572
1Introduction
- LING 572
- Fei Xia
- Week 1 1/4/06
2Outline
- Course overview
- Mathematical foundation (Prereq)
- Probability theory
- Information theory
- Basic concepts in the classification task
3Course overview
4General info
- Course url http//courses.washington.edu/ling572
- Syllabus (incl. slides, assignments, and papers)
updated every week. - Message board
- ESubmit
- Slides
- I will try to put the slides online before class.
- Additional slides are not required and not
covered in class.
5Office hour
- Fei
- Email
- Email address fxia_at_u
- Subject line should include ling572
- The 48-hour rule
- Office hour
- Time Fr 10-1120am
- Location Padelford A-210G
6Lab session
- Bill McNeil
- Email billmcn_at_u
- Lab session what time is good for you?
- Explaining homework and solution
- Mallet related questions
- Reviewing class material
- ? I highly recommend you to attend lab sessions,
especially the first few sessions.
7Time for Lab Session
- Time
- Monday 1000am - 1220pm, or
- Tues 1030 am - 1130 am, or
- ??
- Location ??
- ? Thursday 3-4pm, MGH 271?
8Misc
- Ling572 Mailing list ling572a_wi07_at_u
- EPost
- Mallet developer mailing list
- mallet-dev_at_cs.umass.edu
9Prerequisites
- Ling570
- Some basic algorithms FSA, HMM,
- NLP tasks tokenization, POS tagging, .
- Programming If you dont know Java well, talk to
me. - Java Mallet
- Basic concepts in probability and statistics
- Ex random variables, chain rule, Gaussian
distribution, . - Basic concepts in Information Theory
- Ex entropy, relative entropy,
10Expectations
- Reading
- Papers are online
- Reference book Manning Schutze (MS)
- Finish reading papers before class
- ? I will ask you questions.
11Grades
- Assignments (9 parts) 90
- Programming language Java
- Class participation 10
- No quizzes, no final exams
- No incomplete unless you can prove your case.
12Course objectives
- Covering basic statistical methods that produce
state-of-the-art results - Focusing on classification algorithms
- Touching on unsupervised and semi-supervised
algorithms - Some material is not easy. We will focus on
applications, not theoretical proofs.
13Course layout
- Supervised methods
- Classification algorithms
- Individual classifiers
- Naïve Bayes
- kNN and Rocchio
- Decision tree
- Decision list ??
- Maximum Entropy (MaxEnt)
- Classifier ensemble
- Bagging
- Boosting
- System combination
14Course layout (cnt)
- Supervised algorithms (cont)
- Sequence labeling algorithms
- Transformation-based learning (TBL)
- FST, HMM,
- Semi-supervised methods
- Self-training
- Co-training
15Course layout (cont)
- Unsupervised methods
- EM algorithm
- Forward-backward algorithm
- Inside-outside algorithm
-
16Questions for each method
- Modeling
- what is the model?
- How does the decomposition work?
- What kind of assumption is made?
- How many types of model parameters?
- How many internal (or non-model) parameters?
- How to handle multi-class problem?
- How to handle non-binary features?
-
17Questions for each method (cont)
- Training how to estimate parameters?
- Decoding how to find the best solution?
- Weaknesses and strengths?
- Is the algorithm
- robust? (e.g., handling outliners)
- scalable?
- prone to overfitting?
- efficient in training time? Test time?
- How much data is needed?
- Labeled data
- Unlabeled data
18Relation between 570/571 and 572
- 570/571 are organized by tasks 572 is organized
by learning methods. - 572 focuses on statistical methods.
19NLP tasks covered in Ling570
- Tokenization
- Morphological analysis
- POS tagging
- Shallow parsing
- WSD
- NE tagging
20NLP tasks covered in Ling571
- Parsing
- Semantics
- Discourse
- Dialogue
- Natural language generation (NLG)
21A ML method for multiple NLP tasks
- Task (570/571)
- Tokenization
- POS tagging
- Parsing
- Reference resolution
-
- Method (572)
- MaxEnt
22Multiple methods for one NLP task
- Task (570/571) POS tagging
- Method (572)
- Decision tree
- MaxEnt
- Boosting
- Bagging
- .
23Projects Task 1
- Text Classification Task 20 groups
- P1 First look at the Mallet package
- P2 Your first tui class
- Naïve Bayes
- P3 Feature selection
- Decision Tree
- P4 Bagging
- Boosting
- Individual project
-
24Projects Task 2
- Sequence labeling task IGT detection
- P5 MaxEnt
- P6 Beam Search
- P7 TBA
- P8 Presentation final class
- P9 Final report
- Group project (?)
25Both projects
- Use Mallet, a Java package
- Two types of work
- Reading code to understand ML methods
- Writing code to solve problems
26Feedback on assignments
- Misc section in each assignment
- How long it takes to finish the homework?
- Which part is difficult?
-
27Mallet overview
- It is a Java package, that includes many
- classifiers,
- sequence labeling algorithms,
- optimization algorithms,
- useful data classes,
-
- You should
- read Mallet Guides
- attend mallet tutorial next Tuesday
1030-1130am LLC109 - start on Hw1
- I will use Mallet class/method names if possible.
28Questions for course overview?
29Outline
- Course overview
- Mathematical foundation
- Probability theory
- Information theory
- Basic concepts in the classification task
30Probability Theory
31Basic concepts
- Sample space, event, event space
- Random variable and random vector
- Conditional probability, joint probability,
marginal probability (prior)
32Sample space, event, event space
- Sample space (O) a collection of basic outcomes.
- Ex toss a coin twice HH, HT, TH, TT
- Event an event is a subset of O.
- Ex HT, TH
- Event space (2O) the set of all possible events.
33Random variable
- The outcome of an experiment need not be a
number. - We often want to represent outcomes as numbers.
- A random variable X is a function O?R.
- Ex toss a coin twice X(HH)0, X(HT)1,
34Two types of random variables
- Discrete X takes on only a countable number of
possible values. - Ex Toss a coin 10 times. X is the number of
tails that are noted. - Continuous X takes on an uncountable number of
possible values. - Ex X is the lifetime (in hours) of a light bulb.
35Probability function
- The probability function of a discrete variable X
is a function which gives the probability p(xi)
that X equals xi a.k.a. p(xi) p(Xxi).
36Random vector
- Random vector is a finite-dimensional vector of
random variables XX1,,Xk. - P(x) P(x1,x2,,xn)P(X1x1,., Xnxn)
- Ex P(w1, , wn, t1, , tn)
37Three types of probability
- Joint prob P(x,y) prob of x and y happening
together - Conditional prob P(xy) prob of x given a
specific value of y - Marginal prob P(x) prob of x for all possible
values of y
38Common tricks (I)Marginal prob ? joint prob
39Common tricks (II)Chain rule
40Common tricks (III)Bayes rule
41Common tricks (IV)Independence assumption
42Prior and Posterior distribution
- Prior distribution P(?)
- a distribution over parameter values ? set
prior to observing any data. - Posterior Distribution P(? data)
- It represents our belief that ? is true
after observing the data. - Likelihood of the model ? P(data ?)
- Relation among the three Bayes Rule
- P(? data) P(data ?) P(?) / P(data)
43Two ways of estimating ?
- Maximum likelihood (ML)
- ? arg max? P(data ?)
- Maxinum A-Posterior (MAP)
- ? arg max? P(? data)
44Information Theory
45Information theory
- It is the use of probability theory to quantify
and measure information. - Basic concepts
- Entropy
- Joint entropy and conditional entropy
- Cross entropy and relative entropy
- Mutual information and perplexity
46Entropy
- Entropy is a measure of the uncertainty
associated with a distribution. - The lower bound on the number of bits it takes to
transmit messages. - An example
- Display the results of horse races.
- Goal minimize the number of bits to encode the
results.
47An example
- Uniform distribution pi1/8.
- Non-uniform distribution (1/2,1/4,1/8, 1/16,
1/64, 1/64, 1/64, 1/64)
(0, 10, 110, 1110, 111100, 111101, 111110, 111111)
- Uniform distribution has higher entropy.
- MaxEnt make the distribution as uniform as
possible.
48Joint and conditional entropy
- Joint entropy
- Conditional entropy
49Cross Entropy
- Entropy
- Cross Entropy
- Cross entropy is a distance measure between p(x)
and q(x) p(x) is the true probability q(x) is
our estimate of p(x).
50Relative Entropy
- Also called Kullback-Leibler divergence
- Another distance measure between prob functions
p and q. - KL divergence is asymmetric (not a true
distance)
51Mutual information
- It measures how much is in common between X and
Y - I(XY)KL(p(x,y)p(x)p(y))
52Perplexity
- Perplexity is 2H.
- Perplexity is the weighted average number of
choices a random variable has to make.
53Questions for Mathematical foundation?
54Outline
- Course overview
- Mathematical foundation
- Probability theory
- Information theory
- Basic concepts in the classification task
55Types of ML problems
- Classification problem
- Estimation problem
- Clustering
- Discovery
-
- A learning method can be applied to one or more
types of ML problems. - We will focus on the classification problem.
56Definition of classification problem
- Task
- C c1, c2, .., cm is a set of pre-defined
classes (a.k.a., labels, categories). - Dd1, d2, is a set of input needed to be
classified. - A classifier is a function D C ? 0, 1.
- Multi-label vs. single-label
- Single-label for each di, only one class is
assigned to it. - Multi-class vs. binary classification problem
- Binary C 2.
57Conversion to single-label binary problem
- Multi-label ? single-label
- We will focus on single-label problem.
- A classifier D C ? 0, 1
- becomes D ? C
- More general definition D C ? 0, 1
- Multi-class ? binary problem
- Positive examples vs. negative examples
58Examples of classification problems
- Text classification
- Document filtering
- Language/Author/Speaker id
- WSD
- PP attachment
- Automatic essay grading
-
59Problems that can be treated as a classification
problem
- Tokenization / Word segmentation
- POS tagging
- NE detection
- NP chunking
- Parsing
- Reference resolution
-
60Labeled vs. unlabeled data
- Labeled data
- (xi, yi) is a set of labeled data.
- xi 2 D data/input, often represented as a
feature vector. - yi 2 C target/label
- Unlabeled data
- xi without yi.
61Instance, training and test data
- xi with or without yi is called an instance.
- Training data a set of (labeled) instances.
- Test data a set of unlabeled instances.
- The training data is stored in an InstanceList in
Mallet, so is test data.
62Attribute-value table
- Each row corresponds to an instance.
- Each column corresponds to a feature.
- A feature type (a.k.a. a feature template) w-1
- A feature w-1book
- Binary feature vs. non-binary feature
63Attribute-value table
f1 f2 fK Target
d1 yes 1 no -1000 c2
d2
d3
dn
64Feature sequence vs. Feature vector
- Feature sequence a (featName, featValue) list
for features that are present. - Feature Vector a (featName, featValue) list for
all the features. - Representing data x as a feature vector.
65Data/Input ? a feature vector
- Example
- Task text classification
- Original x a document
- Feature vector bag-of-words approach
- In Mallet, the process is handled by a sequence
of pipes - Tokenization
- Lowercase
- Merging the counts
66Classifier and decision matrix
- A classifier is a function f f(x) (ci,
scorei). It fills out a decision matrix. - (ci, scorei) is called a Classification in
Mallet.
d1 d2 d3 .
c1 0.1 0.4 0
c2 0.9 0.1 0
c3
67Trainer (a.k.a Learner)
- A trainer is a function that takes an
InstanceList as input, and outputs a classifier. - Training stage
- Classifier train (instanceList)
- Test stage
- Classification classify (instance)
68Important concepts (summary)
- Instance, InstanceList
- Labeled data, unlabeled data
- Training data, test data
- Feature, feature template
- Feature vector
- Attribute-value table
- Trainer, classifier
- Training stage, test stage
69Steps for solving an NLP task with classifiers
- Convert the task into a classification problem
(optional) - Split data into training/test/validation
- Convert the data into attribute-value table
- Training
- Decoding
- Evaluation
70Important subtasks (for you)
- Converting the data into attribute-value table
- Define feature types
- Feature selection
- Convert an instance into a feature vector
- Understanding training/decoding algorithms for
various algorithms.
71Notation
Classification in general Text categorization
Input/data xi di
Target/label yi ci
Features fk tk (term)
72Questions for Concepts in a classification task?
73Summary
- Course overview
- Mathematical foundation
- Probability theory
- Information theory
- MS Ch2
- Basic concepts in the classification task
74Downloading
- Hw1
- Mallet Guide
- Homework Guide
75Coming up
- Next Tuesday
- Mallet tutorial on 1/8 (Tues) 1030-1130am at
LLC 109. - Classification algorithm overview and Naïve
Bayes read the paper beforehand. - Next Thursday
- kNN and Rocchio read the other paper
- Hw1 is due at 11pm on 1/13
76Additional slides
77An example
- 570/571
- POS tagging HMM
- Parsing PCFG
- MT Model 1-4 training
- 572
- HMM forward-backward algorithm
- PCFG inside-outside algorithm
- MT EM algorithm
- ? All special cases of EM algorithm, one method
of unsupervised learning.
78Proof Relative entropy is always non-negative
79Entropy of a language
- The entropy of a language L
- If we make certain assumptions that the language
is nice, then the cross entropy can be
calculated as
80Cross entropy of a language
- The cross entropy of a language L
- If we make certain assumptions that the language
is nice, then the cross entropy can be
calculated as
81Conditional Entropy