Title: Acquisition of Morphology by Computer: Unsupervised learning
1Acquisition of Morphology by Computer
Unsupervised learning
- John Goldsmith
- The University of Chicago
2The goal
- To produce a morphological analysis of a corpus
from an unknown language automatically - that is, with no knowledge of the structure of
that language built in - To produce both generalizations about the
language, and a correct analysis of each word in
the corpus.
3raw data
Linguistica
Analyzed data
4- Implemented in Linguistica, a program that runs
under Windows that you can download at - humanities.uchicago.edu/faculty/goldsmith
5- The goal is not to eliminate either linguists or
linguistics - The goal is to understand what the goal of a
linguistic analysis is so well that we can state
it explicitly and algorithmically.
6Other work in this area
- Derrick Higgins on Thursday
- Michael Brent 1993
- Zellig Harris 1955 and 1967, follow-up Hafer
and Weiss 1974
7Zellig HarrisRight-branching count
- Right-branching count of jum 2
- jum p (jump, jumping, jumps, jumped, jumpy)
- b (jumble)
- Right-branching count of jump5
- e (jumped)
- i (jumping)
- jump s (jumps)
- y (jumpy)
- (jump)
8Zellig HarrisRight-branching count
predicted break
19 9 6 3 1 3 1
1
a c c e p t i n g
able ing
lerate (accelerate) nted (accented)
ident (accident) laim (acclaim) omodate
(accomodate) reditated (accredited) used
(accused)
9Zellig HarrisRight-branching count
d dead f deaf l deal n dean t death
Bad predictions
a
18
a
e
5
d
b debate, debuting c decade, december,
decide d dedicate, deduce, deduct e
deep f
9
i
e defeat, defend, defer i deficit, deficiency
r defraud
3
Good predictions
o
10Zellig HarrisRight-branching count
9 18 11 6 4 1 2 1 1 2
1 1
c o n s e r v a t i
v e s
wrong
right
wrong
11The problem with Harris approach
- it cannot distinguish between
- phonological freedom due to phonological patterns
(C after V, V after C) - phonological freedom due to morphological pattern
(...any morpheme after a ...) - But thats the problem its supposed to solve.
12Global approach
- Focus on devising a method for evaluating a
hypothesis, given the data. - Finding explicit methods of discovery is
important, but those methods play no role in
evaluating the analysis for a given corpus. - (Very similar in conception to Chomskys notion
of an evaluation metric.)
13Framework for evaluation
- Jorma Rissanens Minimum Description Length
(MDL). - Quite intricate but we can get a very good feel
for the general idea with a naïve version of
MDL...
14Naive description length
Count the total number of letters in the list of
stems and affixes the fewer, the better.
15Intuition
A word which is morphologically complex reveals
that composite character by virtue of being
composed of (one or more) strings of letters
which have a relatively high frequency throughout
the corpus.
16Naive description length 2
- Lexicographers know what they are doing when they
indicate the entry for the verb laugh as laugh,
s, ed, ing -- - They recognize that the tilde allows them
to utilize the regularities of the language in
order to save space and specification, and
implicitly to underscore the regularity of the
pattern that the stem possesses.
17- Morphological analysis is not merely a matter of
frequency. - Not every word that ends in ing is
morphologically complex string, sing, etc.
18Frequencies are important but far from the whole
story
- Every word that ends in ity also ends in ty.
- Hence tys frequency gt itys frequency.
- Yet -ty is a suffix only in a few words (like
six-ty) - ity is a suffix in far more words, despite its
lower frequency (insan-ity, precoc-ity, etc.). - frequency( y ) gt frequency (ty) gt frequency
(ity) - y is a suffix in some words (dirt-y, runn-y,
etc.), but not in insan-ity, precoc-ity, etc.
19Naive Minimum Description Length
- Analyze the words of a corpus into stem suffix
with the requirement that every stem and every
suffix must be used in at least 2 distinct words.
- Tally up the total number of letters in (a) each
of the proposed stems, (b) each of the proposed
suffixes, and (c) each of the unanalyzed words,
and call that total the naive description
length.
20Naive Minimum Description Length
- Corpus
- jump, jumps, jumping
- laugh, laughed, laughing
- sing, sang, singing
- the, dog, dogs
- total 62 letters
- Analysis
- Stems jump laugh sing sang dog (20 letters)
- Suffixes s ing ed (6 letters)
- Unanalyzed the (3 letters)
- total 29 letters.
Notice that the description length goes UP if we
analyze sing into sing
21- Frequencies matter, but only in the overarching
context of a total morphological analysis of all
of the words of the language.
22Lets look at how the work is done, step by
step...
23Corpus
Pick a large corpus from a language -- 5,000 to
1,000,000 words.
24Corpus
Feed it into the bootstrapping heuristic...
Bootstrap heuristic
25Corpus
Bootstrap heuristic
Out of which comes a preliminary
morphology, which need not be superb.
Morphology
26Corpus
Bootstrap heuristic
Feed it to the incremental heuristics...
Morphology
incremental heuristics
27Corpus
Out comes a modified morphology.
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
28Corpus
Is the modification an improvement? Ask MDL!
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
29Corpus
If it is an improvement, replace the morphology...
Bootstrap heuristic
modified morphology
Morphology
Garbage
30Corpus
Send it back to the incremental heuristics
again...
Bootstrap heuristic
modified morphology
incremental heuristics
31Continue until there are no improvements to try.
Morphology
modified morphology
incremental heuristics
32Bootstrapping...initial hypothesis initial
morphology of the corpus
33First a set of candidate suffixesfor the
language
- Using some interesting statistics.
344. Weight the stickiness (3) by how often
the string shows up in the corpus
1. Observed frequency of a string (e.g., ing)
3. The computed stickiness of that string
2. Predicted frequency of the same string if
there were no morphemes in the language
35- Rank all word-final sequences of letters (of
length 1-4 letters) - This gives us an excellent first guess of the
suffixes of the language. - See Handout for English, French, Spanish, and
Latin.
36(No Transcript)
37Given a candidate set of 100 suffixes...
- It is not difficult to find the set of stems that
gives us the largest number of analyses employing
only those suffixes. - We use these to find the major signatures present
in the corpus ...
38Discovery of signatures
The first 8 stems in the largest signature in
a 500,000 word corpus of English.
Set of suffixes that appears with all of these
stems
39Minimum Description Length
The real thing, this time Rissanen
1989. Evaluate a morphology by 1. How well the
morphology extracts generalizations present in
the data how well it describes the data. 2. How
concise the morphology is. The naïve MDL we
just looked at only covered the second point, and
only crudely.
40Measure how well the morphology fits the data
- 1. Compute the predicted inverse log frequency of
each word in the corpus, and sum
This is a well-understood quantity in information
theory, called the optimal compressed length
of the corpus based on the probability
distribution defined by the morphology.
41Conciseness
- Sum all the letters, plus all the structure
inherent in the description, using information
theory.
42structure
Number of letters
Signatures, which well get to shortly
43Information contained in the Signature component
list of pointers to signatures
ltXgt indicates the number of distinct elements in X
44Results
45(No Transcript)
46French
47Spanish
48Latin
49Future directions
- Develop it to work with languages with greater
complexity and - Use it as an aide in the task of learning syntax
in the same unsupervised fashion.