Title: CS 124LINGUIST 180 From Languages to Information
1CS 124/LINGUIST 180From Languages to Information
- Lecture 2
- Tokenization/Segmentation
- Minimum Edit Distance
Thanks to Chris Manning and Serafim Batzoglou for
slides!
2Outline
- Tokenization
- Word Tokenization
- Normalization
- Lemmatization and stemming
- Sentence Tokenization
- Minimum Edit Distance
- Levenshtein distance
- Needleman-Wunsch
- Smith-Waterman
3Tokenization
- For
- Information retrieval
- Information extraction
- Spell-checking
- Text-to-speech synthesis
- 3 tasks
- Segmenting/tokenizing words in running text
- Normalizing word formats
- Segmenting sentences in running text
- Why not just periods and white-space?
- Mr. Sherwood said reaction to Sea Containers
proposal has been "very positive." In New York
Stock Exchange composite trading yesterday, Sea
Containers closed at 62.625, up 62.5 cents. - I said, whatre you? Crazy? said Sadowsky.
I cant afford to do that.
4Whats a word?
- I do uh main- mainly business data processing
- Fragments
- Filled pauses
- Are cat and cats the same word?
- Some terminology
- Lemma a set of lexical forms having the same
stem, major part of speech, and rough word sense - Cat and cats same lemma
- Wordform the full inflected surface form.
- Cat and cats different wordforms
- Token/Type
5How many words?
- they picnicked by the pool then lay back on the
grass and looked at the stars - 16 tokens
- 14 types
- SWBD
- 2.4 million wordform tokens
- 20,000 wordform types,
- Brown et al (1992) large corpus
- 583 million wordform tokens
- 293,181 wordform types
- Shakespeare
- 884,647 wordform tokens
- 31,534 wordform types
- Let N number of tokens, V vocabulary number
of types - General wisdom V gt O(sqrt(N))
6Issues in Tokenization
- Finlands capital ?
- Finland? Finlands? Finlands
- whatre, Im, isnt-gt
- What are, I am, is not
- Hewlett-Packard ?
- Hewlett and Packard as two tokens?
- state-of-the-art
- Break up?
- lowercase, lower-case, lower case ?
- San Francisco, New York one token or two?
- Words with punctuation
- m.p.h., PhD.
Slide from Chris Manning
7Tokenization language issues
- French
- L'ensemble ? one token or two?
- L ? L ? Le ?
- Want lensemble to match with un ensemble
- German noun compounds are not segmented
- Lebensversicherungsgesellschaftsangestellter
- life insurance company employee
- German retrieval systems benefit greatly from a
compound splitter module
Slide from Chris Manning
8Tokenization language issues
- Chinese and Japanese no spaces between words
- ????????????????????
- ???? ?? ?? ? ?? ??? ? ????
- Sharapova now lives in US southeastern
Florida - Further complicated in Japanese, with multiple
alphabets intermingled - Dates/amounts in multiple formats
??????500?????????????500K(?6,000??)
End-user can express query entirely in hiragana!
Slide from Chris Manning
9Word Segmentation in Chinese
- Words composed of characters
- Characters are generally 1 syllable and 1
morpheme. - Average word is 2.4 characters long.
- Standard segmentation algorithm
- Maximum Matching
- (also called Greedy)
10Maximum Matching Word Segmentation
- Given a wordlist of Chinese, and a string.
- Start a pointer at the beginning of the string
- Find the longest word in dictionary that matches
the string starting at pointer - Move the pointer over the word in string
- Go to 2
11English failure example (Palmer 00)
- the table down there
- thetabledownthere
- Theta bled own there
- But works astonishingly well in Chinese
- ????????????????????
- ???? ?? ?? ? ?? ??? ? ????
- Modern algorithms better still
- probabilistic segmentation
- Using sequence models like HMMs
12Normalization
- Need to normalize terms
- For IR, indexed text query terms must have same
form. - We want to match U.S.A. and USA
- We most commonly implicitly define equivalence
classes of terms - e.g., by deleting periods in a term
- Alternative is to do asymmetric expansion
- Enter window Search window, windows
- Enter windows Search Windows, windows, window
- Enter Windows Search Windows
- Potentially more powerful, but less efficient
Slide from Chris Manning
13Case folding
- For IR Reduce all letters to lower case
- exception upper case in mid-sentence?
- e.g., General Motors
- Fed vs. fed
- SAIL vs. sail
- Often best to lower case everything, since users
will use lowercase regardless of correct
capitalization - For TTS
- We keep case (US versus us is important)
- For sentiment analysis, MT, Info extraction
- Case is helpful
Slide from Chris Manning
14Lemmatization
- Reduce inflectional/variant forms to base form
- E.g.,
- am, are, is ? be
- car, cars, car's, cars' ? car
- the boy's cars are different colors ? the boy car
be different color - Lemmatization implies doing proper reduction to
dictionary headword form
Slide from Chris Manning
15Stemming
- Reduce terms to their roots before indexing
- Stemming suggest crude affix chopping
- language dependent
- e.g., automate(s), automatic, automation all
reduced to automat.
for exampl compress and compress ar both
accept as equival to compress
for example compressed and compression are both
accepted as equivalent to compress.
Slide from Chris Manning
16Porters algorithm
- Commonest algorithm for stemming English
- Results suggest its at least as good as other
stemming options - Conventions 5 phases of reductions
- phases applied sequentially
- each phase consists of a set of commands
- sample convention Of the rules in a compound
command, select the one that applies to the
longest suffix.
Slide from Chris Manning
17Typical rules in Porter
- sses ? ss
- ies ? i
- ational ? ate
- tional ? tion
- Weight of word sensitive rules
- (mgt1) EMENT ?
- replacement ? replac
- cement ? cement
18Stemming/Morphology
- Outside of IR
- Stemming isnt done
- But morphological analysis can be useful
19English Morphology
- Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes - We can usefully divide morphemes into two classes
- Stems The core meaning bearing units
- Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions
20Nouns and Verbs (English)
- Nouns are simple (not really)
- Markers for plural and possessive
- Verbs are only slightly more complex
- Markers appropriate to the tense of the verb
21Regulars and Irregulars
- Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules) - Mouse/mice, goose/geese, ox/oxen
- Go/went, fly/flew
- The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.
22Regular and Irregular Nouns and Verbs
- Regulars
- Walk, walks, walking, walked, walked
- Table, tables
- Irregulars
- Eat, eats, eating, ate, eaten
- Catch, catches, catching, caught, caught
- Cut, cuts, cutting, cut, cut
- Goose, geese
23Compute
- Many paths are possible
- Start with compute
- Computer -gt computerize -gt computerization
- Computation -gt computational
- Computer -gt computerize -gt computerizable
- Compute -gt computee
24Uses for morphological analysis
- Machine translation
- Need to know that the Spanish words quiero and
quieres are both related to querer want - Other languages
- Turkish
- Uygarlastiramadiklarimizdanmissinizcasina
- (behaving) as if you are among those whom we
could not civilize - Uygar civilized las become
- tir cause ama not able
- dik past lar plural
- imiz p1pl dan abl
- mis past siniz 2pl casina as if
25What we want
- Something to automatically do the following kinds
of mappings - Cats cat N PL
- Cat cat N SG
- Cities city N PL
- Merging merge V Present-participle
- Caught catch V past-participle
26Morphological Parsing Goal
27Sentence Segmentation
- !, ? relatively unambiguous
- Period . is quite ambiguous
- Sentence boundary
- Abbreviations like Inc. or Dr.
- General idea
- Build a binary classifier
- Looks at a .
- Decides EndOfSentence/NotEOS
- Could be hand-written rules, sequences of regular
expressions, or machine-learning
28Determining if a word is end-of-utterance a
Decision Tree
29More sophisticated decision tree features
- Prob(word with . occurs at end-of-s)
- Prob(word after . occurs at begin-of-s)
- Length of word with .
- Length of word after .
- Case of word with . Upper, Lower, Cap, Number
- Case of word after . Upper, Lower, Cap, Number
- Punctuation after . (if any)
- Abbreviation class of word with . (month name,
unit-of-measure, title, address name, etc)
From Richard Sproat slides
30Learning Decision Trees
- DTs are rarely built by hand
- Hand-building only possible for very simple
features, domains - Lots of algorithms for DT induction
31II. Minimum Edit Distance
- Spell-checking
- Non-word error detection
- detecting graffe
- Non-word error correction
- figuring out that graffe should be giraffe
- Context-dependent error detection and correction
- Figuring out that war and piece should be peace
32Non-word error detection
- Any word not in a dictionary
- Assume its a spelling error
- Need a big dictionary!
33Isolated word error correction
- How do I fix graffe?
- Search through all words
- graf
- craft
- grail
- giraffe
- Pick the one thats closest to graffe
- What does closest mean?
- We need a distance metric.
- The simplest one edit distance.
- (More sophisticated probabilistic ones noisy
channel)
34Edit Distance
- The minimum edit distance between two strings
- Is the minimum number of editing operations
- Insertion
- Deletion
- Substitution
- Needed to transform one into the other
35Minimum Edit Distance
36Minimum Edit Distance
- If each operation has cost of 1
- Distance between these is 5
- If substitutions cost 2 (Levenshtein)
- Distance between them is 8
37Edit transcript
38Defining Min Edit Distance
- For two strings S1 of len n, S2 of len m
- distance(i,j) or D(i,j)
- means the edit distance of S11..i and S21..j
- i.e., the minimum number of edit operations need
to transform the first i characters of S1 into
the first j characters of S2 - The edit distance of S1, S2 is D(n,m)
- We compute D(n,m) by computing D(i,j) for all i
(0 lt i lt n) and j (0 lt j lt m)
39Defining Min Edit Distance
- Base conditions
- D(i,0) i
- D(0,j) j
- Recurrence Relation
- D(i-1,j) 1
- D(i,j) min D(i,j-1) 1
- D(i-1,j-1) 1 if
S1(i) ? S2(j) - 0
if S1(i) S2(j)
40Dynamic Programming
- A tabular computation of D(n,m)
- Bottom-up
- We compute D(i,j) for small i,j
- And compute increase D(i,j) based on previously
computed smaller values
41The Edit Distance Table
42(No Transcript)
43(No Transcript)
44Suppose we want the alignment too
- We can keep a backtrace
- Every time we enter a cell, remember where we
came from - Then when we reach the end, we can trace back
from the upper right corner to get an alignment
45Backtrace
46Adding Backtrace to MinEdit
- Base conditions
- D(i,0) i
- D(0,j) j
- Recurrence Relation
- D(i-1,j) 1
- D(i,j) min D(i,j-1) 1
- D(i-1,j-1) 1 if
S1(i) ? S2(j) - 0
if S1(i) S2(j) - LEFT
- ptr(i,j) DOWN
- DIAG
Case 1
Case 2
Case 3
Case 1
Case 2
Case 3
47MinEdit with Backtrace
48Performance
- Time
- O(nm)
- Space
- O(nm)
- Backtrace
- O(nm)
49Weighted Edit Distance
- Why would we add weights to the computation?
- How?
50Confusion matrix
51(No Transcript)
52Weighted Minimum Edit Distance
53Why Dynamic Programming
- I spent the Fall quarter (of 1950) at RAND. My
first task was to find a name for multistage
decision processes. An interesting question is,
Where did the name, dynamic programming, come
from? The 1950s were not good years for
mathematical research. We had a very interesting
gentleman in Washington named Wilson. He was
Secretary of Defense, and he actually had a
pathological fear and hatred of the word,
research. Im not using the term lightly Im
using it precisely. His face would suffuse, he
would turn red, and he would get violent if
people used the term, research, in his presence.
You can imagine how he felt, then, about the
term, mathematical. The RAND Corporation was
employed by the Air Force, and the Air Force had
Wilson as its boss, essentially. Hence, I felt I
had to do something to shield Wilson and the Air
Force from the fact that I was really doing
mathematics inside the RAND Corporation. What
title, what name, could I choose? In the first
place I was interested in planning, in decision
making, in thinking. But planning, is not a good
word for various reasons. I decided therefore to
use the word, programming I wanted to get
across the idea that this was dynamic, this was
multistage, this was time-varying I thought, lets
kill two birds with one stone. Lets take a word
that has an absolutely precise meaning, namely
dynamic, in the classical physical sense. It also
has a very interesting property as an adjective,
and that is its impossible to use the word,
dynamic, in a pejorative sense. Try thinking of
some combination that will possibly give it a
pejorative meaning. Its impossible. Thus, I
thought dynamic programming was a good name. It
was something not even a Congressman could object
to. So I used it as an umbrella for my
activities. - Richard Bellman, Eye of the Hurricane an
autobiography 1984.
54Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
55Evolutionary Rates
next generation
OK
OK
OK
X
X
Still OK?
56Sequence conservation implies function
- Alignment is the key to
- Finding important regions
- Determining function
- Uncovering the evolutionary forces
57Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
58What is a good alignment?
- AGGCTAGTT, AGCGAAGTTT
- AGGCTAGTT- 6 matches, 3 mismatches, 1 gap
- AGCGAAGTTT
- AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps
- AG-CGAAGTTT
- AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps
- AG-CG-AAGTTT
59Alignments in two fields
- In Natural Language Processing
- We generally talk about distance (minimized)
- And weights
- In Computational Biology
- We generally talk about similarity (maximized)
- And scores
60Scoring Alignments
- Rough intuition
- Similar sequences evolved from a common ancestor
- Evolution changed the sequences from this
ancestral sequence by mutations - Replacements one letter replaced by another
- Deletion deletion of a letter
- Insertion insertion of a letter
- Scoring of sequence similarity should examine how
many operations took place
61Scoring Function
- Sequence edits
- AGGCCTC
- Mutations AGGACTC
- Insertions AGGGCCTC
- Deletions AGG . CTC
- Scoring Function
- Match m
- Mismatch -s
- Gap -d
- Score F ( matches) ? m - ( mismatches) ? s
(gaps) ? d
62Example
- x AGTA m 1
- y ATA s -1
- d -1
F(i,j) i 0 1 2 3 4
F(1, 1) maxF(0,0) s(A, A), F(0, 1)
d, F(1, 0) d max0 1,
-1 1, -1 1 1
j 0
1
2
G -
A A
T T
A A
3
63The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
y1 yN
An optimal alignment is composed of optimal
subalignments
64The Needleman-Wunsch Algorithm
- Initialization.
- F(0, 0) 0
- F(0, j) - j ? d
- F(i, 0) - i ? d
- Main Iteration. Filling-in partial alignments
- For each i 1M
- For each j 1N
- F(i-1,j-1) s(xi, yj) case 1
- F(i, j) max F(i-1, j) d
case 2 - F(i, j-1) d case 3
- DIAG, if case 1
- Ptr(i,j) LEFT, if case 2
- UP, if case 3
- Termination. F(M, N) is the optimal score, and
- from Ptr(M, N) can trace back optimal alignment
65A variant of the basic algorithm
- Maybe it is OK to have an unlimited of gaps in
the beginning and end
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
- Then, we dont want to penalize gaps in the ends
66Different types of overlaps
Example 2 overlappingreads from a sequencing
project recall Lecture 1
Example Search for a mouse gene within a human
chromosome
67The Overlap Detection variant
- Changes
- Initialization
- For all i, j,
- F(i, 0) 0
- F(0, j) 0
- Termination
- maxi F(i, N)
- FOPT max
- maxj F(M, j)
x1 xM
y1 yN
68The local alignment problem
- Given two strings x x1xM,
- y y1yN
- Find substrings x, y whose similarity
- (optimal global alignment value)
- is maximum
- x aaaacccccggggtta
- y ttcccgggaaccaacc
69Why local alignment examples
- Genes are shuffled between genomes
- Portions of proteins (domains) are often conserved
70Cross-species genome similarity
- 98 of genes are conserved between any two
mammals - gt70 average similarity in protein sequence
hum_a GTTGACAATAGAGGGTCTGGCAGAGGCTC------------
--------- _at_ 57331/400001 mus_a
GCTGACAATAGAGGGGCTGGCAGAGGCTC---------------------
_at_ 78560/400001 rat_a GCTGACAATAGAGGGGCTGGCAGAGA
CTC--------------------- _at_ 112658/369938 fug_a
TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG
_at_ 36008/68174 hum_a CTGGCCGCGGTGCGGAGCGTCTGGA
GCGGAGCACGCGCTGTCAGCTGGTG _at_ 57381/400001 mus_a
CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG
_at_ 78610/400001 rat_a CTGGCCCCGGTGCGGAGCGTCTGGAG
CGGAGCACGCGCTGTCAGCTGGTG _at_ 112708/369938 fug_a
TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG
_at_ 36058/68174 hum_a AGCGCACTCTCCTTTCAGGCAGCT
CCCCGGGGAGCTGTGCGGCCACATTT _at_ 57431/400001 mus_a
AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT
_at_ 78659/400001 rat_a AGCGCACTCG-CTTTCAGGCCGCTCC
CCGGGGAGCTGCGCGGCCACATTT _at_ 112757/369938 fug_a
AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC
_at_ 36084/68174 hum_a AACACCATCATCACCCCTCCCCGGC
CTCCTCAACCTCGGCCTCCTCCTCG _at_ 57481/400001 mus_a
AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG
_at_ 78708/400001 rat_a AACACCGTCGTCA-CCCTCCCCGGCC
TCCTCAACCTCGGCCTCCTCCTCG _at_ 112806/369938 fug_a
CCGAGGACCCTGA-------------------------------------
_at_ 36097/68174
atoh enhancer in human, mouse, rat, fugu fish
71The Smith-Waterman algorithm
- Idea Ignore badly aligning regions
- Modifications to Needleman-Wunsch
- Initialization F(0, j) F(i, 0) 0
-
- 0
- Iteration F(i, j) max F(i 1, j) d
- F(i, j 1) d
- F(i 1, j 1) s(xi, yj)
72The Smith-Waterman algorithm
- Termination
- If we want the best local alignment
-
- FOPT maxi,j F(i, j)
-
- Find FOPT and trace back
- If we want all local alignments scoring gt t
- ?? For all i, j find F(i, j) gt t, and trace
back? - Complicated by overlapping local alignments
73Local Alignment Example
s TAATA t ATCTAA
74Local Alignment Example
s TAATA t TACTAA
75Local Alignment Example
s TAATA t TACTAA
Slide from Hasan Ogul
76Local Alignment Example
s TAATA t TACTAA
77Summary
- Tokenization
- Word Tokenization
- Normalization
- Lemmatization and stemming
- Sentence Tokenization
- Minimum Edit Distance
- Levenshtein distance
- Needleman-Wunsch (weighted global alignment)
- Smith-Waterman (local alignment)