Title: Morphology from a computational point of view
1Morphology from a computational point of view
2Today
- Minimal Edit Distance, and Viterbi more
generally - Letter to Sound
- What is morphology?
- Finite-state automata
- Finite-state phonological rules
31. What is morphology?
- Study of the internal structure of words
- morph-ology word-s jump-ing
- Why?
- For some purposes, we need to know what the
internal pieces are. - Knowledge of the words of a language cant be
summarized in a finite list we need to know the
principles of word-formation
4Some resources
- Richard Sproat Morphology and Computation (MIT
Press, 1992) - Excellent overview of computational morphology
and phonoloy by Harald Trost at - http//www.ai.univie.ac.at/harald/handbook.html
52. What applications need knowledge of words?
- Any high-level linguistic analysis syntactic
parser - machine translation
- speech recognition, text-to-speech (TTS)
- information retrieval (IR)
- dictionary, spell-checker
63. A list is not enough
- An empirical fact
- AP newswire mid-Feb Dec 30 1988
- Nearly 300,000 words.
- New words that appeared on Dec 31 1988
- compounds prenatal-care, publicly-funded,
channel-switching, owner-president, logic-loving,
part-Vulcan, signal-emitting, landsite,
government-aligned, armhole, signal-emitting...
7...new words...
- dumbbells, groveled, fuzzier, oxidized
- ex-presidency, puppetry, boulderlike,
over-emphasized, hydrosulfite, outclassing,
non-passengers, racialist, counterprograms,
antiprejudice, re-unification, traumatological,
refinancings, instrumenting, ex-critters,
mega-lizard
8- ex-presidency prefix ex-
- boulder-like suffix like
- over-emphasized prefix over-
- antiprejudice prefix anti
- This is often called the OOV problem (out of
vocabulary).
9- If we work out the principles of word-formation,
we will simultaneously - compress the size of our internalized list of
words - become able to deal with new words on the fly.
104. Overview of applications that need knowledge
of words
- Speech generation text to speech (TTS)
- Speech recognition
11Text to speech
- Problem take text, in standard spelling, and
produce a sequence of phonemes which can be
synthesized by the backend. - Severe problems Proper names (persons, places),
OOV words - boathouse B OW1 T H AU2 S
12Speech recognition
- Take a sound file (e.g., .wav) and produce a
list of words in standard orthography. - Bill Clinton is a recent ex-president.
- If someone says it, we need to figure out what
the word was.
13Do we know what a word is?
- This is actually not an easy question!
especially if we turn to Asian languages, without
a tradition of putting in white space between
words, as we do in the West. - German writes more compounds without white space
than English does.
14Basic principles of morphology
- For some purposes, we need to think about
phonemes, while for others its more convenient
to talk about letters. - For our purposes, Ill talk about letters
whenever we dont need to specifically focus on
phonemes.
15Morpheme
- It is convenient to be able to talk about the
pieces into which words may be broken, and
linguists call these pieces morphemes the
smallest parts of a language that can be
regularly assigned a meaning.
16Morphemes
- Uncontroversial morphemes
- door, dog, jump, -ing, -s, to
- More controversial morphemes
- sing/sang s-ng i/a
- cut/cut cut PAST
17Classic distinctions in morphology
- Analytic (isolating) languages
- no morphology of derivational or inflectional
sort. - Synthetic (inflecting) languages
- Agglutinative 1 function per morpheme
- Fusional gt 1 function per morpheme
18AgglutinativeFinnish Nominal Declension
talo 'the-house' kaup-pa 'the-shop' talo-ni 'my
house' kaup-pa-ni 'my shop' talo-ssa 'in
the-house' kaup-a-ssa 'in the-shop' talo-ssa-ni
'in my house kaup-a-ssa-ni 'in my
shop' talo-i-ssa 'in the-houses kaup-o-i-ssa 'in
the-shops' talo-i-ssa-ni 'in my
houses kaup-o-i-ssa-ni 'in my shops'
Courtesy of Bucknell Univ. web page
19Fusional LatinLatin Declension of hortus
'garden'
Singular Plural Nominative (Subject)
hort-us hort-i Genitive (of)
hort-i hort-rum Dative (for/to) hort-o
hort-is Accusative (Direct Obj) hort-um
hort-us Vocative (Call) hort-e hort-i Ablative
(from/with) hort-o hort-is
20Morphemes vs. morphs
- Some analysts distinguish between morphemes and
morphs. - Morphemes are motivated by an analysis, and
include plural and past - Morphs are strings of letters or phones that
realize or manifest a morpheme.
21Free and bound morphemes
- Free morphemes can form (free-standing) words
bound morphemes are only found in combination
with other morphemes. - Examples?
22Functions of morphology
- Derivational morphology creates one lexeme from
another - compute gt computer gt computerize gt
computerization - Inflectional morphology creates the form of a
lexeme thats right for a sentence - the nominative singular form of a noun or the
past 3rd person singular form of a verb.
23- Word an identifiable string of letters (or
phonemes) sing - Word-form a word with a specific set of
syntactic and morphological features. The sing in
I sing is 1st person sg, and is a distinct
word-from from the sing in you sing. - Lexeme a complete set of inflectionally related
word-forms, including sing, sings, and sang - Lemma a complete set of morphologically related
lexemes sing, sings, song, sang.
24A lexemes stem
- In many languages (unlike English),
constellations of word-forms forming a lexeme
demand the recognition of a basic stem which does
not stand freely as a word - Italian ragazzo, ragazzi (boy, girl)
- ragazzi, ragazze (boys, girls)
- ragazz-
25Compounds
- Compounds are composed of 2 (or more) words or
stems - Compounds hot dog, White House, bookstore,
cherry-covered
26Languages vary in the amount of morphology they
have and use
- English has a lot of derivational morphology and
relatively little inflectional morphology - English verbs inflectional forms
- bare stem, -s, -ed, -ing
27European languages
- Not uncommon for a verb to have 30 to 50 forms
- marking tense, person and number of the subject
28Derivation
- Derivational morphology usually consists of
adding a prefix or suffix to a base ( stem). - The base has a lexical category (it is a noun,
verb, adjective), and the suffix typically
assigns a different category to the whole word.
Noun
-ness suffix that takes an adjective, makes a
noun.
Adj
sad ness
29Adj
Adj
Verb
un interest ing
30Distinct from contractions
- English (and some other languages) permit the
collapsing together of common words. In some
extremely rare cases, only the collapsed form
exists (English possessive s). - He will arrive tonight gt hell arrive
- The King of Englands children
31Some basics of English morphology
- Inflectional morphology
- Nouns -NULL, -s, -s
- Verbs -NULL, s, -ed, -ing
- (so-called weak verbs)
- Strong verbs 3 major groups
- a. Internal verb change (sing/sang,
drive/drove/driven, dive/dove) - b. t suffix, typically with vowel-shortening
dream/dreamt, sleep/slept - c. aught replacement catch, teach, seek,
32Derivational morphology in complex
- This morphology creates new words, by adding
prefixes or suffixes. - It is helpful to divide them into two groups,
depending on whether they leave the pronunciation
of the base unchanged or not. - There are, as always, some fuzzy cases.
33- Level 1
- ize, ization, al, ity, al, ic, al, ity, ion, y
(nominaliz-ing), al, ate, ous, ive, ation - Can attach to non-word stems (fratern-al,
paternal parent-al) - Typically change stress and vowel quality of stem
- Level 2
- Never precede Level 1 suffixes
- Never change stress pattern or vowel quality
- Almost always attach to words that already exist
- hood, ness, ly, s, ing, ish, ful, ly, ize, less,
y (adj.)
34Combinations of Class 1,2
- Class 1 Class 1 histor-ic-al,
illumina-at-tion, indetermin-at-y - Class 1 Class 2 frantern-al-ly,
transform-ate-ion-less - Class 2 Class 2 weight-less-ness
- ?? Class 2 Class 1 weight-less-ity,
fatal-ism-al
35Signature
- Set of suffixes (or prefixes) that occurs in a
corpus with a set of stems.
36- NULL.ed.ing.s
- look interest add claim mark extend demand
remain want succeed record offer represent cover
return end explain follow help belong attempt
talk fear happen assault account point award
appeal train contract result request staff view
fail kick visit confront attack comment sponsor
37- NULL.s
- paper retain improvement missile song truth
doctor indictment window conductor dick
misunderstanding struggle stake tank belief
cafeteria material mind operator bassi lot
movement chain notion marriage dancer scholarship
reservoir sweet right battalion hold mr shot
cardinal athletic revenue duel confrontation solo
talent guest shoe russian commitment average monk
election street roger rifle worker area plane
pinch-hitter dozen browning conclusion teacher
narcotic appearance alternative dealer producer
mile stock shrine sometime bag successor career
mistake ankle weapon model front spotlight rhode
pace debate payment requirement fairway
consultation chip dollar employer thank mustang
rocket-bomb hat string precinct robert employee
action detective pressure measure spirit forbid
hitter breast yankee partner floor member
38- NULL.d.s
- increase tie hole associate reserve price fire
receive challenge rate purchase propose feature
celebrate decide suite single change sculpture
combine privilege pledge issue frame indicate
believe damage include use aide graduate surprise
intervene practice trouble serve oppose promise
charge note schedule continue raise decline cause
operate emphasize relieve hope share judge birdie
produce exchange -
39- NULL.ed.er.ing.s report turn walk park pick flow
- NULL.d.ment enforce announce engage arrange
replace improve encourage - NULL.n.s rose low take law drive rise undertake
- NULL.al intern profession logic fat tradition
extern margin jurisdiction historic education
promotion constitution addition sensation roy
ration origin classic convention - NULL.man sand news police states gross sun fresh
sports boss sales 3- patrol bonds - ed.er.ing slugg manag crush publish robb
- NULL.ity.s major senior moral hospital
- NULL.ry hung mason ave summit scene surge rival
forest - NULL.a.s indian kind american
-
-
40Finite state morphology
41FSA finite-state automata
- Consists of
- a set of states
- a starting state
- a set of final or accepting states
- a finite set of symbols
- a set of transitions each is defined by a
from-state, a to-state, and a symbol
42- Its natural to think of this as describing an
annotated directed graph.
a
b
a
a
a
!
!
a
q0
q1
q2
q3
q3
43- An FSA can be thought of as judging (accepting) a
string, or as generating one. - How does it judge? Find a start to finish path
that matches the string. - How does it generate? Walk through any
start-to-finish path.
44Deterministic and Non-deterministic FSAs
- Just a little difference
- Deterministic case For every state, there is a
maximum of one transition associated with any
given symbol. You can say that theres a function
from statesXsymbols ? states - Nondeterministic case There is no such
restriction hence, given a state and a symbol,
it is not necessarily certain which transition is
to be taken.
45a
b
a
!
a
q0
q1
q2
q3
q3
deterministic
46a
b
a
!
a
q0
q1
q2
q3
q3
non-deterministic
The best things in life are non-deterministic.
47Figure 3.4 p. 68
un-
adj-root
-er est -ly
q0
q1
q2
q3
e
Alternate notation
48Yet a third way rows in an array(to-column can
consist of pointers)
From To Output
0 1 un
0 1 NULL
1 2 adj-root-list
2 3 erestly
Stop states 2,3
49adj-root-1
un-
-er est -ly
q1
q2
q0
q5
adj-root-1
q3
q4
-er est
e
adj-root-2
Figure 3.5 p. 69
50Yet a third way rows in an array(to-column can
consist of pointers)
Stop states 2,4,5
From To Output
0 1 un
0 3 NULL
1 2 adj-root-list-1
2 5 erestly
3 2 adj-root-list-1
3 4 adj-root-list-2
4 5 erest
510 1 noun 1
1 2 ize
2 3 ation
2 4 er
2 5 able
0 5 -al adj
0 1 adj-al
5 6 ityness
0 7 verbs1
7 8 ive adj
8 6 ness
8 9 ly
0 10 verb2
10 8 ative
0 11 noun2
11 8 ful
Figure 3.6, p. 70
52We can simplify greatly (generating a bit more.)
53Finite-State Transducers (FST)
- The symbols of the FST are complex theyre
really pairs of symbols, one for each of two
tapes or levels. - Recognizer decides if a given pair of
representations fits together OK - Generator generates pairs of representations
that fit together - Translator takes a representation on one level
and produces the appropriate representation on
the other level
54Finite state transducers
- can be inverted, or
- composed
- and you get another FST.
55Complex symbols
- Usually of the form ab, which means a can appear
on the upper tape when b appears on the lower
tape. - So ab means thats a permissible pairing up of
symbols. - a along means aa, etc.
- epsilon e means null character.
- Remember, other means any feasible pair that
is not in this transducer (p. 78)
56Using FSTs for orthographic rules
57Using FSTs for orthographic rules
foxswe get to q1 with x
58Using FSTs for orthographic rules
foxswe get to q2 with
59Using FSTs for orthographic rules
foxswe can get to q3 with NULL
60Using FSTs for orthographic rules
foxswe also get to q5 with s but we dont
want to!
61So why is this transition there? ?friendship,
?foxss ( foxess)
foxswe also get to q5 with s but we dont
want to!
62foxsq4 with s
63foxsq0 with (accepting state)
64Other transitions
arizona we leave q0 but return
65Other transitions
m i s s s