CS 124LINGUIST 180 From Languages to Information

About This Presentation

Title:

CS 124LINGUIST 180 From Languages to Information

Description:

Why not just periods and white-space? ... dik `past' lar plural' imiz p1pl' dan abl' mis past' siniz 2pl' casina as if' ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 78

Provided by: jamesm63

Category:

more less

Transcript and Presenter's Notes

Title: CS 124LINGUIST 180 From Languages to Information

1
CS 124/LINGUIST 180From Languages to Information

Lecture 2
Tokenization/Segmentation
Minimum Edit Distance

Thanks to Chris Manning and Serafim Batzoglou for
slides!
2
Outline

Tokenization
Word Tokenization
Normalization
Lemmatization and stemming
Sentence Tokenization
Minimum Edit Distance
Levenshtein distance
Needleman-Wunsch
Smith-Waterman

3
Tokenization

For
Information retrieval
Information extraction
Spell-checking
Text-to-speech synthesis
3 tasks
Segmenting/tokenizing words in running text
Normalizing word formats
Segmenting sentences in running text
Why not just periods and white-space?
Mr. Sherwood said reaction to Sea Containers
proposal has been "very positive." In New York
Stock Exchange composite trading yesterday, Sea
Containers closed at 62.625, up 62.5 cents.
I said, whatre you? Crazy? said Sadowsky.
I cant afford to do that.

4
Whats a word?

I do uh main- mainly business data processing
Fragments
Filled pauses
Are cat and cats the same word?
Some terminology
Lemma a set of lexical forms having the same
stem, major part of speech, and rough word sense
Cat and cats same lemma
Wordform the full inflected surface form.
Cat and cats different wordforms
Token/Type

5
How many words?

they picnicked by the pool then lay back on the
grass and looked at the stars
16 tokens
14 types
SWBD
2.4 million wordform tokens
20,000 wordform types,
Brown et al (1992) large corpus
583 million wordform tokens
293,181 wordform types
Shakespeare
884,647 wordform tokens
31,534 wordform types
Let N number of tokens, V vocabulary number
of types
General wisdom V gt O(sqrt(N))

6
Issues in Tokenization

Finlands capital ?
Finland? Finlands? Finlands
whatre, Im, isnt-gt
What are, I am, is not
Hewlett-Packard ?
Hewlett and Packard as two tokens?
state-of-the-art
Break up?
lowercase, lower-case, lower case ?
San Francisco, New York one token or two?
Words with punctuation
m.p.h., PhD.

Slide from Chris Manning
7
Tokenization language issues

French
L'ensemble ? one token or two?
L ? L ? Le ?
Want lensemble to match with un ensemble
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
life insurance company employee
German retrieval systems benefit greatly from a
compound splitter module

Slide from Chris Manning
8
Tokenization language issues

Chinese and Japanese no spaces between words
????????????????????
???? ?? ?? ? ?? ??? ? ????
Sharapova now lives in US southeastern
Florida
Further complicated in Japanese, with multiple
alphabets intermingled
Dates/amounts in multiple formats

??????500?????????????500K(?6,000??)
End-user can express query entirely in hiragana!
Slide from Chris Manning
9
Word Segmentation in Chinese

Words composed of characters
Characters are generally 1 syllable and 1
morpheme.
Average word is 2.4 characters long.
Standard segmentation algorithm
Maximum Matching
(also called Greedy)

10
Maximum Matching Word Segmentation

Given a wordlist of Chinese, and a string.
Start a pointer at the beginning of the string
Find the longest word in dictionary that matches
the string starting at pointer
Move the pointer over the word in string
Go to 2

11
English failure example (Palmer 00)

the table down there
thetabledownthere
Theta bled own there
But works astonishingly well in Chinese
????????????????????
???? ?? ?? ? ?? ??? ? ????
Modern algorithms better still
probabilistic segmentation
Using sequence models like HMMs

12
Normalization

Need to normalize terms
For IR, indexed text query terms must have same
form.
We want to match U.S.A. and USA
We most commonly implicitly define equivalence
classes of terms
e.g., by deleting periods in a term
Alternative is to do asymmetric expansion
Enter window Search window, windows
Enter windows Search Windows, windows, window
Enter Windows Search Windows
Potentially more powerful, but less efficient

Slide from Chris Manning
13
Case folding

For IR Reduce all letters to lower case
exception upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Often best to lower case everything, since users
will use lowercase regardless of correct
capitalization
For TTS
We keep case (US versus us is important)
For sentiment analysis, MT, Info extraction
Case is helpful

Slide from Chris Manning
14
Lemmatization

Reduce inflectional/variant forms to base form
E.g.,
am, are, is ? be
car, cars, car's, cars' ? car
the boy's cars are different colors ? the boy car
be different color
Lemmatization implies doing proper reduction to
dictionary headword form

Slide from Chris Manning
15
Stemming

Reduce terms to their roots before indexing
Stemming suggest crude affix chopping
language dependent
e.g., automate(s), automatic, automation all
reduced to automat.

for exampl compress and compress ar both
accept as equival to compress
for example compressed and compression are both
accepted as equivalent to compress.
Slide from Chris Manning
16
Porters algorithm

Commonest algorithm for stemming English
Results suggest its at least as good as other
stemming options
Conventions 5 phases of reductions
phases applied sequentially
each phase consists of a set of commands
sample convention Of the rules in a compound
command, select the one that applies to the
longest suffix.

Slide from Chris Manning
17
Typical rules in Porter

sses ? ss
ies ? i
ational ? ate
tional ? tion
Weight of word sensitive rules
(mgt1) EMENT ?
replacement ? replac
cement ? cement

18
Stemming/Morphology

Outside of IR
Stemming isnt done
But morphological analysis can be useful

19
English Morphology

Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes
We can usefully divide morphemes into two classes
Stems The core meaning bearing units
Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions

20
Nouns and Verbs (English)

Nouns are simple (not really)
Markers for plural and possessive
Verbs are only slightly more complex
Markers appropriate to the tense of the verb

21
Regulars and Irregulars

Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules)
Mouse/mice, goose/geese, ox/oxen
Go/went, fly/flew
The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.

22
Regular and Irregular Nouns and Verbs

Regulars
Walk, walks, walking, walked, walked
Table, tables
Irregulars
Eat, eats, eating, ate, eaten
Catch, catches, catching, caught, caught
Cut, cuts, cutting, cut, cut
Goose, geese

23
Compute

Many paths are possible
Start with compute
Computer -gt computerize -gt computerization
Computation -gt computational
Computer -gt computerize -gt computerizable
Compute -gt computee

24
Uses for morphological analysis

Machine translation
Need to know that the Spanish words quiero and
quieres are both related to querer want
Other languages
Turkish
Uygarlastiramadiklarimizdanmissinizcasina
(behaving) as if you are among those whom we
could not civilize
Uygar civilized las become
tir cause ama not able
dik past lar plural
imiz p1pl dan abl
mis past siniz 2pl casina as if

25
What we want

Something to automatically do the following kinds
of mappings
Cats cat N PL
Cat cat N SG
Cities city N PL
Merging merge V Present-participle
Caught catch V past-participle

26
Morphological Parsing Goal
27
Sentence Segmentation

!, ? relatively unambiguous
Period . is quite ambiguous
Sentence boundary
Abbreviations like Inc. or Dr.
General idea
Build a binary classifier
Looks at a .
Decides EndOfSentence/NotEOS
Could be hand-written rules, sequences of regular
expressions, or machine-learning

28
Determining if a word is end-of-utterance a
Decision Tree
29
More sophisticated decision tree features

Prob(word with . occurs at end-of-s)
Prob(word after . occurs at begin-of-s)
Length of word with .
Length of word after .
Case of word with . Upper, Lower, Cap, Number
Case of word after . Upper, Lower, Cap, Number
Punctuation after . (if any)
Abbreviation class of word with . (month name,
unit-of-measure, title, address name, etc)

From Richard Sproat slides
30
Learning Decision Trees

DTs are rarely built by hand
Hand-building only possible for very simple
features, domains
Lots of algorithms for DT induction

31
II. Minimum Edit Distance

Spell-checking
Non-word error detection
detecting graffe
Non-word error correction
figuring out that graffe should be giraffe
Context-dependent error detection and correction
Figuring out that war and piece should be peace

32
Non-word error detection

Any word not in a dictionary
Assume its a spelling error
Need a big dictionary!

33
Isolated word error correction

How do I fix graffe?
Search through all words
graf
craft
grail
giraffe
Pick the one thats closest to graffe
What does closest mean?
We need a distance metric.
The simplest one edit distance.
(More sophisticated probabilistic ones noisy
channel)

34
Edit Distance

The minimum edit distance between two strings
Is the minimum number of editing operations
Insertion
Deletion
Substitution
Needed to transform one into the other

35
Minimum Edit Distance
36
Minimum Edit Distance

If each operation has cost of 1
Distance between these is 5
If substitutions cost 2 (Levenshtein)
Distance between them is 8

37
Edit transcript
38
Defining Min Edit Distance

For two strings S1 of len n, S2 of len m
distance(i,j) or D(i,j)
means the edit distance of S11..i and S21..j
i.e., the minimum number of edit operations need
to transform the first i characters of S1 into
the first j characters of S2
The edit distance of S1, S2 is D(n,m)
We compute D(n,m) by computing D(i,j) for all i
(0 lt i lt n) and j (0 lt j lt m)

39
Defining Min Edit Distance

Base conditions
D(i,0) i
D(0,j) j
Recurrence Relation
D(i-1,j) 1
D(i,j) min D(i,j-1) 1
D(i-1,j-1) 1 if
S1(i) ? S2(j)
0
if S1(i) S2(j)

40
Dynamic Programming

A tabular computation of D(n,m)
Bottom-up
We compute D(i,j) for small i,j
And compute increase D(i,j) based on previously
computed smaller values

41
The Edit Distance Table
42
(No Transcript)
43
(No Transcript)
44
Suppose we want the alignment too

We can keep a backtrace
Every time we enter a cell, remember where we
came from
Then when we reach the end, we can trace back
from the upper right corner to get an alignment

45
Backtrace
46
Adding Backtrace to MinEdit

Base conditions
D(i,0) i
D(0,j) j
Recurrence Relation
D(i-1,j) 1
D(i,j) min D(i,j-1) 1
D(i-1,j-1) 1 if
S1(i) ? S2(j)
0
if S1(i) S2(j)
LEFT
ptr(i,j) DOWN
DIAG

Case 1
Case 2
Case 3
Case 1
Case 2
Case 3
47
MinEdit with Backtrace
48
Performance

Time
O(nm)
Space
O(nm)
Backtrace
O(nm)

49
Weighted Edit Distance

Why would we add weights to the computation?
How?

50
Confusion matrix
51
(No Transcript)
52
Weighted Minimum Edit Distance
53
Why Dynamic Programming

I spent the Fall quarter (of 1950) at RAND. My
first task was to find a name for multistage
decision processes. An interesting question is,
Where did the name, dynamic programming, come
from? The 1950s were not good years for
mathematical research. We had a very interesting
gentleman in Washington named Wilson. He was
Secretary of Defense, and he actually had a
pathological fear and hatred of the word,
research. Im not using the term lightly Im
using it precisely. His face would suffuse, he
would turn red, and he would get violent if
people used the term, research, in his presence.
You can imagine how he felt, then, about the
term, mathematical. The RAND Corporation was
employed by the Air Force, and the Air Force had
Wilson as its boss, essentially. Hence, I felt I
had to do something to shield Wilson and the Air
Force from the fact that I was really doing
mathematics inside the RAND Corporation. What
title, what name, could I choose? In the first
place I was interested in planning, in decision
making, in thinking. But planning, is not a good
word for various reasons. I decided therefore to
use the word, programming I wanted to get
across the idea that this was dynamic, this was
multistage, this was time-varying I thought, lets
kill two birds with one stone. Lets take a word
that has an absolutely precise meaning, namely
dynamic, in the classical physical sense. It also
has a very interesting property as an adjective,
and that is its impossible to use the word,
dynamic, in a pejorative sense. Try thinking of
some combination that will possibly give it a
pejorative meaning. Its impossible. Thus, I
thought dynamic programming was a good name. It
was something not even a Congressman could object
to. So I used it as an umbrella for my
activities.
Richard Bellman, Eye of the Hurricane an
autobiography 1984.

54
Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
55
Evolutionary Rates

next generation
OK

OK

OK

X

X

Still OK?

56
Sequence conservation implies function

Alignment is the key to
Finding important regions
Determining function
Uncovering the evolutionary forces

57
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
58
What is a good alignment?

AGGCTAGTT, AGCGAAGTTT
AGGCTAGTT- 6 matches, 3 mismatches, 1 gap
AGCGAAGTTT
AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps
AG-CGAAGTTT
AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps
AG-CG-AAGTTT

59
Alignments in two fields

In Natural Language Processing
We generally talk about distance (minimized)
And weights
In Computational Biology
We generally talk about similarity (maximized)
And scores

60
Scoring Alignments

Rough intuition
Similar sequences evolved from a common ancestor
Evolution changed the sequences from this
ancestral sequence by mutations
Replacements one letter replaced by another
Deletion deletion of a letter
Insertion insertion of a letter
Scoring of sequence similarity should examine how
many operations took place

61
Scoring Function

Sequence edits
AGGCCTC
Mutations AGGACTC
Insertions AGGGCCTC
Deletions AGG . CTC
Scoring Function
Match m
Mismatch -s
Gap -d
Score F ( matches) ? m - ( mismatches) ? s
(gaps) ? d

62
Example

x AGTA m 1
y ATA s -1
d -1

F(i,j) i 0 1 2 3 4
F(1, 1) maxF(0,0) s(A, A), F(0, 1)
d, F(1, 0) d max0 1,
-1 1, -1 1 1
j 0
1
2
G -
A A
T T
A A
3
63
The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
y1 yN
An optimal alignment is composed of optimal
subalignments
64
The Needleman-Wunsch Algorithm

Initialization.
F(0, 0) 0
F(0, j) - j ? d
F(i, 0) - i ? d
Main Iteration. Filling-in partial alignments
For each i 1M
For each j 1N
F(i-1,j-1) s(xi, yj) case 1
F(i, j) max F(i-1, j) d
case 2
F(i, j-1) d case 3
DIAG, if case 1
Ptr(i,j) LEFT, if case 2
UP, if case 3
Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment

65
A variant of the basic algorithm

Maybe it is OK to have an unlimited of gaps in
the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------

Then, we dont want to penalize gaps in the ends

66
Different types of overlaps
Example 2 overlappingreads from a sequencing
project recall Lecture 1
Example Search for a mouse gene within a human
chromosome
67
The Overlap Detection variant

Changes
Initialization
For all i, j,
F(i, 0) 0
F(0, j) 0
Termination
maxi F(i, N)
FOPT max
maxj F(M, j)

x1 xM
y1 yN
68
The local alignment problem

Given two strings x x1xM,
y y1yN
Find substrings x, y whose similarity
(optimal global alignment value)
is maximum
x aaaacccccggggtta
y ttcccgggaaccaacc

69
Why local alignment examples

Genes are shuffled between genomes
Portions of proteins (domains) are often conserved

70
Cross-species genome similarity

98 of genes are conserved between any two
mammals
gt70 average similarity in protein sequence

hum_a GTTGACAATAGAGGGTCTGGCAGAGGCTC------------
--------- _at_ 57331/400001 mus_a
GCTGACAATAGAGGGGCTGGCAGAGGCTC---------------------
_at_ 78560/400001 rat_a GCTGACAATAGAGGGGCTGGCAGAGA
CTC--------------------- _at_ 112658/369938 fug_a
TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG
_at_ 36008/68174 hum_a CTGGCCGCGGTGCGGAGCGTCTGGA
GCGGAGCACGCGCTGTCAGCTGGTG _at_ 57381/400001 mus_a
CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG
_at_ 78610/400001 rat_a CTGGCCCCGGTGCGGAGCGTCTGGAG
CGGAGCACGCGCTGTCAGCTGGTG _at_ 112708/369938 fug_a
TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG
_at_ 36058/68174 hum_a AGCGCACTCTCCTTTCAGGCAGCT
CCCCGGGGAGCTGTGCGGCCACATTT _at_ 57431/400001 mus_a
AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT
_at_ 78659/400001 rat_a AGCGCACTCG-CTTTCAGGCCGCTCC
CCGGGGAGCTGCGCGGCCACATTT _at_ 112757/369938 fug_a
AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC
_at_ 36084/68174 hum_a AACACCATCATCACCCCTCCCCGGC
CTCCTCAACCTCGGCCTCCTCCTCG _at_ 57481/400001 mus_a
AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG
_at_ 78708/400001 rat_a AACACCGTCGTCA-CCCTCCCCGGCC
TCCTCAACCTCGGCCTCCTCCTCG _at_ 112806/369938 fug_a
CCGAGGACCCTGA-------------------------------------
_at_ 36097/68174
atoh enhancer in human, mouse, rat, fugu fish
71
The Smith-Waterman algorithm

Idea Ignore badly aligning regions
Modifications to Needleman-Wunsch
Initialization F(0, j) F(i, 0) 0
0
Iteration F(i, j) max F(i 1, j) d
F(i, j 1) d
F(i 1, j 1) s(xi, yj)

72
The Smith-Waterman algorithm

Termination
If we want the best local alignment
FOPT maxi,j F(i, j)
Find FOPT and trace back
If we want all local alignments scoring gt t
?? For all i, j find F(i, j) gt t, and trace
back?
Complicated by overlapping local alignments

73
Local Alignment Example
s TAATA t ATCTAA
74
Local Alignment Example
s TAATA t TACTAA
75
Local Alignment Example
s TAATA t TACTAA
Slide from Hasan Ogul
76
Local Alignment Example
s TAATA t TACTAA
77
Summary