LING 572 - PowerPoint PPT Presentation

1 / 81

About This Presentation

Title:

LING 572

Description:

Covering basic statistical methods that produce state-of-the-art results ... Random variable and random vector ... Convert an instance into a feature vector ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 82

Provided by: coursesWa1

Category:

Tags: ling

more less

Transcript and Presenter's Notes

Title: LING 572

1
Introduction

LING 572
Fei Xia
Week 1 1/4/06

2
Outline

Course overview
Mathematical foundation (Prereq)
Probability theory
Information theory
Basic concepts in the classification task

3
Course overview
4
General info

Course url http//courses.washington.edu/ling572
Syllabus (incl. slides, assignments, and papers)
updated every week.
Message board
ESubmit
Slides
I will try to put the slides online before class.
Additional slides are not required and not
covered in class.

5
Office hour

Fei
Email
Email address fxia_at_u
Subject line should include ling572
The 48-hour rule
Office hour
Time Fr 10-1120am
Location Padelford A-210G

6
Lab session

Bill McNeil
Email billmcn_at_u
Lab session what time is good for you?
Explaining homework and solution
Mallet related questions
Reviewing class material
? I highly recommend you to attend lab sessions,
especially the first few sessions.

7
Time for Lab Session

Time
Monday 1000am - 1220pm, or
Tues 1030 am - 1130 am, or
??
Location ??
? Thursday 3-4pm, MGH 271?

8
Misc

Ling572 Mailing list ling572a_wi07_at_u
EPost
Mallet developer mailing list
mallet-dev_at_cs.umass.edu

9
Prerequisites

Ling570
Some basic algorithms FSA, HMM,
NLP tasks tokenization, POS tagging, .
Programming If you dont know Java well, talk to
me.
Java Mallet
Basic concepts in probability and statistics
Ex random variables, chain rule, Gaussian
distribution, .
Basic concepts in Information Theory
Ex entropy, relative entropy,

10
Expectations

Reading
Papers are online
Reference book Manning Schutze (MS)
Finish reading papers before class
? I will ask you questions.

11
Grades

Assignments (9 parts) 90
Programming language Java
Class participation 10
No quizzes, no final exams
No incomplete unless you can prove your case.

12
Course objectives

Covering basic statistical methods that produce
state-of-the-art results
Focusing on classification algorithms
Touching on unsupervised and semi-supervised
algorithms
Some material is not easy. We will focus on
applications, not theoretical proofs.

13
Course layout

Supervised methods
Classification algorithms
Individual classifiers
Naïve Bayes
kNN and Rocchio
Decision tree
Decision list ??
Maximum Entropy (MaxEnt)
Classifier ensemble
Bagging
Boosting
System combination

14
Course layout (cnt)

Supervised algorithms (cont)
Sequence labeling algorithms
Transformation-based learning (TBL)
FST, HMM,
Semi-supervised methods
Self-training
Co-training

15
Course layout (cont)

Unsupervised methods
EM algorithm
Forward-backward algorithm
Inside-outside algorithm

16
Questions for each method

Modeling
what is the model?
How does the decomposition work?
What kind of assumption is made?
How many types of model parameters?
How many internal (or non-model) parameters?
How to handle multi-class problem?
How to handle non-binary features?

17
Questions for each method (cont)

Training how to estimate parameters?
Decoding how to find the best solution?
Weaknesses and strengths?
Is the algorithm
robust? (e.g., handling outliners)
scalable?
prone to overfitting?
efficient in training time? Test time?
How much data is needed?
Labeled data
Unlabeled data

18
Relation between 570/571 and 572

570/571 are organized by tasks 572 is organized
by learning methods.
572 focuses on statistical methods.

19
NLP tasks covered in Ling570

Tokenization
Morphological analysis
POS tagging
Shallow parsing
WSD
NE tagging

20
NLP tasks covered in Ling571

Parsing
Semantics
Discourse
Dialogue
Natural language generation (NLG)

21
A ML method for multiple NLP tasks

Task (570/571)
Tokenization
POS tagging
Parsing
Reference resolution
Method (572)
MaxEnt

22
Multiple methods for one NLP task

Task (570/571) POS tagging
Method (572)
Decision tree
MaxEnt
Boosting
Bagging
.

23
Projects Task 1

Text Classification Task 20 groups
P1 First look at the Mallet package
P2 Your first tui class
Naïve Bayes
P3 Feature selection
Decision Tree
P4 Bagging
Boosting
Individual project

24
Projects Task 2

Sequence labeling task IGT detection
P5 MaxEnt
P6 Beam Search
P7 TBA
P8 Presentation final class
P9 Final report
Group project (?)

25
Both projects

Use Mallet, a Java package
Two types of work
Reading code to understand ML methods
Writing code to solve problems

26
Feedback on assignments

Misc section in each assignment
How long it takes to finish the homework?
Which part is difficult?

27
Mallet overview

It is a Java package, that includes many
classifiers,
sequence labeling algorithms,
optimization algorithms,
useful data classes,
You should
read Mallet Guides
attend mallet tutorial next Tuesday
1030-1130am LLC109
start on Hw1
I will use Mallet class/method names if possible.

28
Questions for course overview?
29
Outline

Course overview
Mathematical foundation
Probability theory
Information theory
Basic concepts in the classification task

30
Probability Theory
31
Basic concepts

Sample space, event, event space
Random variable and random vector
Conditional probability, joint probability,
marginal probability (prior)

32
Sample space, event, event space

Sample space (O) a collection of basic outcomes.
Ex toss a coin twice HH, HT, TH, TT
Event an event is a subset of O.
Ex HT, TH
Event space (2O) the set of all possible events.

33
Random variable

The outcome of an experiment need not be a
number.
We often want to represent outcomes as numbers.
A random variable X is a function O?R.
Ex toss a coin twice X(HH)0, X(HT)1,

34
Two types of random variables

Discrete X takes on only a countable number of
possible values.
Ex Toss a coin 10 times. X is the number of
tails that are noted.
Continuous X takes on an uncountable number of
possible values.
Ex X is the lifetime (in hours) of a light bulb.

35
Probability function

The probability function of a discrete variable X
is a function which gives the probability p(xi)
that X equals xi a.k.a. p(xi) p(Xxi).

36
Random vector

Random vector is a finite-dimensional vector of
random variables XX1,,Xk.
P(x) P(x1,x2,,xn)P(X1x1,., Xnxn)
Ex P(w1, , wn, t1, , tn)

37
Three types of probability

Joint prob P(x,y) prob of x and y happening
together
Conditional prob P(xy) prob of x given a
specific value of y
Marginal prob P(x) prob of x for all possible
values of y

38
Common tricks (I)Marginal prob ? joint prob
39
Common tricks (II)Chain rule
40
Common tricks (III)Bayes rule
41
Common tricks (IV)Independence assumption
42
Prior and Posterior distribution

Prior distribution P(?)
a distribution over parameter values ? set
prior to observing any data.
Posterior Distribution P(? data)
It represents our belief that ? is true
after observing the data.
Likelihood of the model ? P(data ?)
Relation among the three Bayes Rule
P(? data) P(data ?) P(?) / P(data)

43
Two ways of estimating ?

Maximum likelihood (ML)
? arg max? P(data ?)
Maxinum A-Posterior (MAP)
? arg max? P(? data)

44
Information Theory
45
Information theory

It is the use of probability theory to quantify
and measure information.
Basic concepts
Entropy
Joint entropy and conditional entropy
Cross entropy and relative entropy
Mutual information and perplexity

46
Entropy

Entropy is a measure of the uncertainty
associated with a distribution.
The lower bound on the number of bits it takes to
transmit messages.
An example
Display the results of horse races.
Goal minimize the number of bits to encode the
results.

47
An example

Uniform distribution pi1/8.
Non-uniform distribution (1/2,1/4,1/8, 1/16,
1/64, 1/64, 1/64, 1/64)

(0, 10, 110, 1110, 111100, 111101, 111110, 111111)

Uniform distribution has higher entropy.
MaxEnt make the distribution as uniform as
possible.

48
Joint and conditional entropy

Joint entropy
Conditional entropy

49
Cross Entropy

Entropy
Cross Entropy
Cross entropy is a distance measure between p(x)
and q(x) p(x) is the true probability q(x) is
our estimate of p(x).

50
Relative Entropy

Also called Kullback-Leibler divergence
Another distance measure between prob functions
p and q.
KL divergence is asymmetric (not a true
distance)

51
Mutual information

It measures how much is in common between X and
Y
I(XY)KL(p(x,y)p(x)p(y))

52
Perplexity

Perplexity is 2H.
Perplexity is the weighted average number of
choices a random variable has to make.

53
Questions for Mathematical foundation?
54
Outline

Course overview
Mathematical foundation
Probability theory
Information theory
Basic concepts in the classification task

55
Types of ML problems

Classification problem
Estimation problem
Clustering
Discovery
A learning method can be applied to one or more
types of ML problems.
We will focus on the classification problem.

56
Definition of classification problem

Task
C c1, c2, .., cm is a set of pre-defined
classes (a.k.a., labels, categories).
Dd1, d2, is a set of input needed to be
classified.
A classifier is a function D C ? 0, 1.
Multi-label vs. single-label
Single-label for each di, only one class is
assigned to it.
Multi-class vs. binary classification problem
Binary C 2.

57
Conversion to single-label binary problem

Multi-label ? single-label
We will focus on single-label problem.
A classifier D C ? 0, 1
becomes D ? C
More general definition D C ? 0, 1
Multi-class ? binary problem
Positive examples vs. negative examples

58
Examples of classification problems

Text classification
Document filtering
Language/Author/Speaker id
WSD
PP attachment
Automatic essay grading

59
Problems that can be treated as a classification
problem

Tokenization / Word segmentation
POS tagging
NE detection
NP chunking
Parsing
Reference resolution

60
Labeled vs. unlabeled data

Labeled data
(xi, yi) is a set of labeled data.
xi 2 D data/input, often represented as a
feature vector.
yi 2 C target/label
Unlabeled data
xi without yi.

61
Instance, training and test data

xi with or without yi is called an instance.
Training data a set of (labeled) instances.
Test data a set of unlabeled instances.
The training data is stored in an InstanceList in
Mallet, so is test data.

62
Attribute-value table

Each row corresponds to an instance.
Each column corresponds to a feature.
A feature type (a.k.a. a feature template) w-1
A feature w-1book
Binary feature vs. non-binary feature

63
Attribute-value table
f1 f2 fK Target
d1 yes 1 no -1000 c2
d2
d3

dn
64
Feature sequence vs. Feature vector

Feature sequence a (featName, featValue) list
for features that are present.
Feature Vector a (featName, featValue) list for
all the features.
Representing data x as a feature vector.

65
Data/Input ? a feature vector

Example
Task text classification
Original x a document
Feature vector bag-of-words approach
In Mallet, the process is handled by a sequence
of pipes
Tokenization
Lowercase
Merging the counts

66
Classifier and decision matrix

A classifier is a function f f(x) (ci,
scorei). It fills out a decision matrix.
(ci, scorei) is called a Classification in
Mallet.

d1 d2 d3 .
c1 0.1 0.4 0
c2 0.9 0.1 0
c3

67
Trainer (a.k.a Learner)

A trainer is a function that takes an
InstanceList as input, and outputs a classifier.
Training stage
Classifier train (instanceList)
Test stage
Classification classify (instance)

68
Important concepts (summary)

Instance, InstanceList
Labeled data, unlabeled data
Training data, test data
Feature, feature template
Feature vector
Attribute-value table
Trainer, classifier
Training stage, test stage

69
Steps for solving an NLP task with classifiers

Convert the task into a classification problem
(optional)
Split data into training/test/validation
Convert the data into attribute-value table
Training
Decoding
Evaluation

70
Important subtasks (for you)

Converting the data into attribute-value table
Define feature types
Feature selection
Convert an instance into a feature vector
Understanding training/decoding algorithms for
various algorithms.

71
Notation
Classification in general Text categorization
Input/data xi di
Target/label yi ci
Features fk tk (term)

72
Questions for Concepts in a classification task?
73
Summary