CS 124LINGUIST 180 From Languages to Information - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

CS 124LINGUIST 180 From Languages to Information

Description:

Why not just periods and white-space? ... dik `past' lar plural' imiz p1pl' dan abl' mis past' siniz 2pl' casina as if' ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 78
Provided by: jamesm63
Category:

less

Transcript and Presenter's Notes

Title: CS 124LINGUIST 180 From Languages to Information


1
CS 124/LINGUIST 180From Languages to Information
  • Lecture 2
  • Tokenization/Segmentation
  • Minimum Edit Distance

Thanks to Chris Manning and Serafim Batzoglou for
slides!
2
Outline
  • Tokenization
  • Word Tokenization
  • Normalization
  • Lemmatization and stemming
  • Sentence Tokenization
  • Minimum Edit Distance
  • Levenshtein distance
  • Needleman-Wunsch
  • Smith-Waterman

3
Tokenization
  • For
  • Information retrieval
  • Information extraction
  • Spell-checking
  • Text-to-speech synthesis
  • 3 tasks
  • Segmenting/tokenizing words in running text
  • Normalizing word formats
  • Segmenting sentences in running text
  • Why not just periods and white-space?
  • Mr. Sherwood said reaction to Sea Containers
    proposal has been "very positive." In New York
    Stock Exchange composite trading yesterday, Sea
    Containers closed at 62.625, up 62.5 cents.
  • I said, whatre you? Crazy? said Sadowsky.
    I cant afford to do that.

4
Whats a word?
  • I do uh main- mainly business data processing
  • Fragments
  • Filled pauses
  • Are cat and cats the same word?
  • Some terminology
  • Lemma a set of lexical forms having the same
    stem, major part of speech, and rough word sense
  • Cat and cats same lemma
  • Wordform the full inflected surface form.
  • Cat and cats different wordforms
  • Token/Type

5
How many words?
  • they picnicked by the pool then lay back on the
    grass and looked at the stars
  • 16 tokens
  • 14 types
  • SWBD
  • 2.4 million wordform tokens
  • 20,000 wordform types,
  • Brown et al (1992) large corpus
  • 583 million wordform tokens
  • 293,181 wordform types
  • Shakespeare
  • 884,647 wordform tokens
  • 31,534 wordform types
  • Let N number of tokens, V vocabulary number
    of types
  • General wisdom V gt O(sqrt(N))

6
Issues in Tokenization
  • Finlands capital ?
  • Finland? Finlands? Finlands
  • whatre, Im, isnt-gt
  • What are, I am, is not
  • Hewlett-Packard ?
  • Hewlett and Packard as two tokens?
  • state-of-the-art
  • Break up?
  • lowercase, lower-case, lower case ?
  • San Francisco, New York one token or two?
  • Words with punctuation
  • m.p.h., PhD.

Slide from Chris Manning
7
Tokenization language issues
  • French
  • L'ensemble ? one token or two?
  • L ? L ? Le ?
  • Want lensemble to match with un ensemble
  • German noun compounds are not segmented
  • Lebensversicherungsgesellschaftsangestellter
  • life insurance company employee
  • German retrieval systems benefit greatly from a
    compound splitter module

Slide from Chris Manning
8
Tokenization language issues
  • Chinese and Japanese no spaces between words
  • ????????????????????
  • ???? ?? ?? ? ?? ??? ? ????
  • Sharapova now lives in US southeastern
    Florida
  • Further complicated in Japanese, with multiple
    alphabets intermingled
  • Dates/amounts in multiple formats

??????500?????????????500K(?6,000??)
End-user can express query entirely in hiragana!
Slide from Chris Manning
9
Word Segmentation in Chinese
  • Words composed of characters
  • Characters are generally 1 syllable and 1
    morpheme.
  • Average word is 2.4 characters long.
  • Standard segmentation algorithm
  • Maximum Matching
  • (also called Greedy)

10
Maximum Matching Word Segmentation
  • Given a wordlist of Chinese, and a string.
  • Start a pointer at the beginning of the string
  • Find the longest word in dictionary that matches
    the string starting at pointer
  • Move the pointer over the word in string
  • Go to 2

11
English failure example (Palmer 00)
  • the table down there
  • thetabledownthere
  • Theta bled own there
  • But works astonishingly well in Chinese
  • ????????????????????
  • ???? ?? ?? ? ?? ??? ? ????
  • Modern algorithms better still
  • probabilistic segmentation
  • Using sequence models like HMMs

12
Normalization
  • Need to normalize terms
  • For IR, indexed text query terms must have same
    form.
  • We want to match U.S.A. and USA
  • We most commonly implicitly define equivalence
    classes of terms
  • e.g., by deleting periods in a term
  • Alternative is to do asymmetric expansion
  • Enter window Search window, windows
  • Enter windows Search Windows, windows, window
  • Enter Windows Search Windows
  • Potentially more powerful, but less efficient

Slide from Chris Manning
13
Case folding
  • For IR Reduce all letters to lower case
  • exception upper case in mid-sentence?
  • e.g., General Motors
  • Fed vs. fed
  • SAIL vs. sail
  • Often best to lower case everything, since users
    will use lowercase regardless of correct
    capitalization
  • For TTS
  • We keep case (US versus us is important)
  • For sentiment analysis, MT, Info extraction
  • Case is helpful

Slide from Chris Manning
14
Lemmatization
  • Reduce inflectional/variant forms to base form
  • E.g.,
  • am, are, is ? be
  • car, cars, car's, cars' ? car
  • the boy's cars are different colors ? the boy car
    be different color
  • Lemmatization implies doing proper reduction to
    dictionary headword form

Slide from Chris Manning
15
Stemming
  • Reduce terms to their roots before indexing
  • Stemming suggest crude affix chopping
  • language dependent
  • e.g., automate(s), automatic, automation all
    reduced to automat.

for exampl compress and compress ar both
accept as equival to compress
for example compressed and compression are both
accepted as equivalent to compress.
Slide from Chris Manning
16
Porters algorithm
  • Commonest algorithm for stemming English
  • Results suggest its at least as good as other
    stemming options
  • Conventions 5 phases of reductions
  • phases applied sequentially
  • each phase consists of a set of commands
  • sample convention Of the rules in a compound
    command, select the one that applies to the
    longest suffix.

Slide from Chris Manning
17
Typical rules in Porter
  • sses ? ss
  • ies ? i
  • ational ? ate
  • tional ? tion
  • Weight of word sensitive rules
  • (mgt1) EMENT ?
  • replacement ? replac
  • cement ? cement

18
Stemming/Morphology
  • Outside of IR
  • Stemming isnt done
  • But morphological analysis can be useful

19
English Morphology
  • Morphology is the study of the ways that words
    are built up from smaller meaningful units called
    morphemes
  • We can usefully divide morphemes into two classes
  • Stems The core meaning bearing units
  • Affixes Bits and pieces that adhere to stems to
    change their meanings and grammatical functions

20
Nouns and Verbs (English)
  • Nouns are simple (not really)
  • Markers for plural and possessive
  • Verbs are only slightly more complex
  • Markers appropriate to the tense of the verb

21
Regulars and Irregulars
  • Ok so it gets a little complicated by the fact
    that some words misbehave (refuse to follow the
    rules)
  • Mouse/mice, goose/geese, ox/oxen
  • Go/went, fly/flew
  • The terms regular and irregular will be used to
    refer to words that follow the rules and those
    that dont.

22
Regular and Irregular Nouns and Verbs
  • Regulars
  • Walk, walks, walking, walked, walked
  • Table, tables
  • Irregulars
  • Eat, eats, eating, ate, eaten
  • Catch, catches, catching, caught, caught
  • Cut, cuts, cutting, cut, cut
  • Goose, geese

23
Compute
  • Many paths are possible
  • Start with compute
  • Computer -gt computerize -gt computerization
  • Computation -gt computational
  • Computer -gt computerize -gt computerizable
  • Compute -gt computee

24
Uses for morphological analysis
  • Machine translation
  • Need to know that the Spanish words quiero and
    quieres are both related to querer want
  • Other languages
  • Turkish
  • Uygarlastiramadiklarimizdanmissinizcasina
  • (behaving) as if you are among those whom we
    could not civilize
  • Uygar civilized las become
  • tir cause ama not able
  • dik past lar plural
  • imiz p1pl dan abl
  • mis past siniz 2pl casina as if

25
What we want
  • Something to automatically do the following kinds
    of mappings
  • Cats cat N PL
  • Cat cat N SG
  • Cities city N PL
  • Merging merge V Present-participle
  • Caught catch V past-participle

26
Morphological Parsing Goal
27
Sentence Segmentation
  • !, ? relatively unambiguous
  • Period . is quite ambiguous
  • Sentence boundary
  • Abbreviations like Inc. or Dr.
  • General idea
  • Build a binary classifier
  • Looks at a .
  • Decides EndOfSentence/NotEOS
  • Could be hand-written rules, sequences of regular
    expressions, or machine-learning

28
Determining if a word is end-of-utterance a
Decision Tree
29
More sophisticated decision tree features
  • Prob(word with . occurs at end-of-s)
  • Prob(word after . occurs at begin-of-s)
  • Length of word with .
  • Length of word after .
  • Case of word with . Upper, Lower, Cap, Number
  • Case of word after . Upper, Lower, Cap, Number
  • Punctuation after . (if any)
  • Abbreviation class of word with . (month name,
    unit-of-measure, title, address name, etc)

From Richard Sproat slides
30
Learning Decision Trees
  • DTs are rarely built by hand
  • Hand-building only possible for very simple
    features, domains
  • Lots of algorithms for DT induction

31
II. Minimum Edit Distance
  • Spell-checking
  • Non-word error detection
  • detecting graffe
  • Non-word error correction
  • figuring out that graffe should be giraffe
  • Context-dependent error detection and correction
  • Figuring out that war and piece should be peace

32
Non-word error detection
  • Any word not in a dictionary
  • Assume its a spelling error
  • Need a big dictionary!

33
Isolated word error correction
  • How do I fix graffe?
  • Search through all words
  • graf
  • craft
  • grail
  • giraffe
  • Pick the one thats closest to graffe
  • What does closest mean?
  • We need a distance metric.
  • The simplest one edit distance.
  • (More sophisticated probabilistic ones noisy
    channel)

34
Edit Distance
  • The minimum edit distance between two strings
  • Is the minimum number of editing operations
  • Insertion
  • Deletion
  • Substitution
  • Needed to transform one into the other

35
Minimum Edit Distance
36
Minimum Edit Distance
  • If each operation has cost of 1
  • Distance between these is 5
  • If substitutions cost 2 (Levenshtein)
  • Distance between them is 8

37
Edit transcript
38
Defining Min Edit Distance
  • For two strings S1 of len n, S2 of len m
  • distance(i,j) or D(i,j)
  • means the edit distance of S11..i and S21..j
  • i.e., the minimum number of edit operations need
    to transform the first i characters of S1 into
    the first j characters of S2
  • The edit distance of S1, S2 is D(n,m)
  • We compute D(n,m) by computing D(i,j) for all i
    (0 lt i lt n) and j (0 lt j lt m)

39
Defining Min Edit Distance
  • Base conditions
  • D(i,0) i
  • D(0,j) j
  • Recurrence Relation
  • D(i-1,j) 1
  • D(i,j) min D(i,j-1) 1
  • D(i-1,j-1) 1 if
    S1(i) ? S2(j)
  • 0
    if S1(i) S2(j)

40
Dynamic Programming
  • A tabular computation of D(n,m)
  • Bottom-up
  • We compute D(i,j) for small i,j
  • And compute increase D(i,j) based on previously
    computed smaller values

41
The Edit Distance Table
42
(No Transcript)
43
(No Transcript)
44
Suppose we want the alignment too
  • We can keep a backtrace
  • Every time we enter a cell, remember where we
    came from
  • Then when we reach the end, we can trace back
    from the upper right corner to get an alignment

45
Backtrace
46
Adding Backtrace to MinEdit
  • Base conditions
  • D(i,0) i
  • D(0,j) j
  • Recurrence Relation
  • D(i-1,j) 1
  • D(i,j) min D(i,j-1) 1
  • D(i-1,j-1) 1 if
    S1(i) ? S2(j)
  • 0
    if S1(i) S2(j)
  • LEFT
  • ptr(i,j) DOWN
  • DIAG

Case 1
Case 2
Case 3
Case 1
Case 2
Case 3
47
MinEdit with Backtrace
48
Performance
  • Time
  • O(nm)
  • Space
  • O(nm)
  • Backtrace
  • O(nm)

49
Weighted Edit Distance
  • Why would we add weights to the computation?
  • How?

50
Confusion matrix
51
(No Transcript)
52
Weighted Minimum Edit Distance
53
Why Dynamic Programming
  • I spent the Fall quarter (of 1950) at RAND. My
    first task was to find a name for multistage
    decision processes. An interesting question is,
    Where did the name, dynamic programming, come
    from? The 1950s were not good years for
    mathematical research. We had a very interesting
    gentleman in Washington named Wilson. He was
    Secretary of Defense, and he actually had a
    pathological fear and hatred of the word,
    research. Im not using the term lightly Im
    using it precisely. His face would suffuse, he
    would turn red, and he would get violent if
    people used the term, research, in his presence.
    You can imagine how he felt, then, about the
    term, mathematical. The RAND Corporation was
    employed by the Air Force, and the Air Force had
    Wilson as its boss, essentially. Hence, I felt I
    had to do something to shield Wilson and the Air
    Force from the fact that I was really doing
    mathematics inside the RAND Corporation. What
    title, what name, could I choose? In the first
    place I was interested in planning, in decision
    making, in thinking. But planning, is not a good
    word for various reasons. I decided therefore to
    use the word, programming I wanted to get
    across the idea that this was dynamic, this was
    multistage, this was time-varying I thought, lets
    kill two birds with one stone. Lets take a word
    that has an absolutely precise meaning, namely
    dynamic, in the classical physical sense. It also
    has a very interesting property as an adjective,
    and that is its impossible to use the word,
    dynamic, in a pejorative sense. Try thinking of
    some combination that will possibly give it a
    pejorative meaning. Its impossible. Thus, I
    thought dynamic programming was a good name. It
    was something not even a Congressman could object
    to. So I used it as an umbrella for my
    activities.
  • Richard Bellman, Eye of the Hurricane an
    autobiography 1984.

54
Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
55
Evolutionary Rates



next generation
OK



OK



OK



X



X



Still OK?



56
Sequence conservation implies function
  • Alignment is the key to
  • Finding important regions
  • Determining function
  • Uncovering the evolutionary forces

57
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
58
What is a good alignment?
  • AGGCTAGTT, AGCGAAGTTT
  • AGGCTAGTT- 6 matches, 3 mismatches, 1 gap
  • AGCGAAGTTT
  • AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps
  • AG-CGAAGTTT
  • AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps
  • AG-CG-AAGTTT

59
Alignments in two fields
  • In Natural Language Processing
  • We generally talk about distance (minimized)
  • And weights
  • In Computational Biology
  • We generally talk about similarity (maximized)
  • And scores

60
Scoring Alignments
  • Rough intuition
  • Similar sequences evolved from a common ancestor
  • Evolution changed the sequences from this
    ancestral sequence by mutations
  • Replacements one letter replaced by another
  • Deletion deletion of a letter
  • Insertion insertion of a letter
  • Scoring of sequence similarity should examine how
    many operations took place

61
Scoring Function
  • Sequence edits
  • AGGCCTC
  • Mutations AGGACTC
  • Insertions AGGGCCTC
  • Deletions AGG . CTC
  • Scoring Function
  • Match m
  • Mismatch -s
  • Gap -d
  • Score F ( matches) ? m - ( mismatches) ? s
    (gaps) ? d

62
Example
  • x AGTA m 1
  • y ATA s -1
  • d -1

F(i,j) i 0 1 2 3 4
F(1, 1) maxF(0,0) s(A, A), F(0, 1)
d, F(1, 0) d max0 1,
-1 1, -1 1 1
j 0
1
2
G -
A A
T T
A A
3
63
The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
y1 yN
An optimal alignment is composed of optimal
subalignments
64
The Needleman-Wunsch Algorithm
  • Initialization.
  • F(0, 0) 0
  • F(0, j) - j ? d
  • F(i, 0) - i ? d
  • Main Iteration. Filling-in partial alignments
  • For each i 1M
  • For each j 1N
  • F(i-1,j-1) s(xi, yj) case 1
  • F(i, j) max F(i-1, j) d
    case 2
  • F(i, j-1) d case 3
  • DIAG, if case 1
  • Ptr(i,j) LEFT, if case 2
  • UP, if case 3
  • Termination. F(M, N) is the optimal score, and
  • from Ptr(M, N) can trace back optimal alignment

65
A variant of the basic algorithm
  • Maybe it is OK to have an unlimited of gaps in
    the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
  • Then, we dont want to penalize gaps in the ends

66
Different types of overlaps
Example 2 overlappingreads from a sequencing
project recall Lecture 1
Example Search for a mouse gene within a human
chromosome
67
The Overlap Detection variant
  • Changes
  • Initialization
  • For all i, j,
  • F(i, 0) 0
  • F(0, j) 0
  • Termination
  • maxi F(i, N)
  • FOPT max
  • maxj F(M, j)

x1 xM
y1 yN
68
The local alignment problem
  • Given two strings x x1xM,
  • y y1yN
  • Find substrings x, y whose similarity
  • (optimal global alignment value)
  • is maximum
  • x aaaacccccggggtta
  • y ttcccgggaaccaacc

69
Why local alignment examples
  • Genes are shuffled between genomes
  • Portions of proteins (domains) are often conserved

70
Cross-species genome similarity
  • 98 of genes are conserved between any two
    mammals
  • gt70 average similarity in protein sequence

hum_a GTTGACAATAGAGGGTCTGGCAGAGGCTC------------
--------- _at_ 57331/400001 mus_a
GCTGACAATAGAGGGGCTGGCAGAGGCTC---------------------
_at_ 78560/400001 rat_a GCTGACAATAGAGGGGCTGGCAGAGA
CTC--------------------- _at_ 112658/369938 fug_a
TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG
_at_ 36008/68174 hum_a CTGGCCGCGGTGCGGAGCGTCTGGA
GCGGAGCACGCGCTGTCAGCTGGTG _at_ 57381/400001 mus_a
CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG
_at_ 78610/400001 rat_a CTGGCCCCGGTGCGGAGCGTCTGGAG
CGGAGCACGCGCTGTCAGCTGGTG _at_ 112708/369938 fug_a
TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG
_at_ 36058/68174 hum_a AGCGCACTCTCCTTTCAGGCAGCT
CCCCGGGGAGCTGTGCGGCCACATTT _at_ 57431/400001 mus_a
AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT
_at_ 78659/400001 rat_a AGCGCACTCG-CTTTCAGGCCGCTCC
CCGGGGAGCTGCGCGGCCACATTT _at_ 112757/369938 fug_a
AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC
_at_ 36084/68174 hum_a AACACCATCATCACCCCTCCCCGGC
CTCCTCAACCTCGGCCTCCTCCTCG _at_ 57481/400001 mus_a
AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG
_at_ 78708/400001 rat_a AACACCGTCGTCA-CCCTCCCCGGCC
TCCTCAACCTCGGCCTCCTCCTCG _at_ 112806/369938 fug_a
CCGAGGACCCTGA-------------------------------------
_at_ 36097/68174
atoh enhancer in human, mouse, rat, fugu fish
71
The Smith-Waterman algorithm
  • Idea Ignore badly aligning regions
  • Modifications to Needleman-Wunsch
  • Initialization F(0, j) F(i, 0) 0
  • 0
  • Iteration F(i, j) max F(i 1, j) d
  • F(i, j 1) d
  • F(i 1, j 1) s(xi, yj)

72
The Smith-Waterman algorithm
  • Termination
  • If we want the best local alignment
  • FOPT maxi,j F(i, j)
  • Find FOPT and trace back
  • If we want all local alignments scoring gt t
  • ?? For all i, j find F(i, j) gt t, and trace
    back?
  • Complicated by overlapping local alignments

73
Local Alignment Example
s TAATA t ATCTAA
74
Local Alignment Example
s TAATA t TACTAA
75
Local Alignment Example
s TAATA t TACTAA
Slide from Hasan Ogul
76
Local Alignment Example
s TAATA t TACTAA
77
Summary
  • Tokenization
  • Word Tokenization
  • Normalization
  • Lemmatization and stemming
  • Sentence Tokenization
  • Minimum Edit Distance
  • Levenshtein distance
  • Needleman-Wunsch (weighted global alignment)
  • Smith-Waterman (local alignment)
Write a Comment
User Comments (0)
About PowerShow.com