Acquisition of Morphology by Computer: Unsupervised learning

About This Presentation

Title:

Acquisition of Morphology by Computer: Unsupervised learning

Description:

... that ends in ing is morphologically complex: string, sing, etc. ... Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ... we analyze sing into s ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 44

Provided by: humanitie

Category:

more less

Transcript and Presenter's Notes

Title: Acquisition of Morphology by Computer: Unsupervised learning

1
Acquisition of Morphology by Computer
Unsupervised learning

John Goldsmith
The University of Chicago

2
The goal

To produce a morphological analysis of a corpus
from an unknown language automatically
that is, with no knowledge of the structure of
that language built in
To produce both generalizations about the
language, and a correct analysis of each word in
the corpus.

3
raw data
Linguistica
Analyzed data
4

Implemented in Linguistica, a program that runs
under Windows that you can download at
humanities.uchicago.edu/faculty/goldsmith

The goal is not to eliminate either linguists or
linguistics
The goal is to understand what the goal of a
linguistic analysis is so well that we can state
it explicitly and algorithmically.

6
Other work in this area

Derrick Higgins on Thursday
Michael Brent 1993
Zellig Harris 1955 and 1967, follow-up Hafer
and Weiss 1974

7
Zellig HarrisRight-branching count

Right-branching count of jum 2
jum p (jump, jumping, jumps, jumped, jumpy)
b (jumble)
Right-branching count of jump5
e (jumped)
i (jumping)
jump s (jumps)
y (jumpy)
(jump)

8
Zellig HarrisRight-branching count
predicted break
19 9 6 3 1 3 1
1
a c c e p t i n g
able ing
lerate (accelerate) nted (accented)
ident (accident) laim (acclaim) omodate
(accomodate) reditated (accredited) used
(accused)
9
Zellig HarrisRight-branching count
d dead f deaf l deal n dean t death
Bad predictions
a
18
a
e
5
d
b debate, debuting c decade, december,
decide d dedicate, deduce, deduct e
deep f
9
i
e defeat, defend, defer i deficit, deficiency
r defraud
3
Good predictions
o
10
Zellig HarrisRight-branching count
9 18 11 6 4 1 2 1 1 2
1 1
c o n s e r v a t i
v e s
wrong
right
wrong
11
The problem with Harris approach

it cannot distinguish between
phonological freedom due to phonological patterns
(C after V, V after C)
phonological freedom due to morphological pattern
(...any morpheme after a ...)
But thats the problem its supposed to solve.

12
Global approach

Focus on devising a method for evaluating a
hypothesis, given the data.
Finding explicit methods of discovery is
important, but those methods play no role in
evaluating the analysis for a given corpus.
(Very similar in conception to Chomskys notion
of an evaluation metric.)

13
Framework for evaluation

Jorma Rissanens Minimum Description Length
(MDL).
Quite intricate but we can get a very good feel
for the general idea with a naïve version of
MDL...

14
Naive description length
Count the total number of letters in the list of
stems and affixes the fewer, the better.
15
Intuition
A word which is morphologically complex reveals
that composite character by virtue of being
composed of (one or more) strings of letters
which have a relatively high frequency throughout
the corpus.
16
Naive description length 2

Lexicographers know what they are doing when they
indicate the entry for the verb laugh as laugh,
s, ed, ing --
They recognize that the tilde allows them
to utilize the regularities of the language in
order to save space and specification, and
implicitly to underscore the regularity of the
pattern that the stem possesses.

Morphological analysis is not merely a matter of
frequency.
Not every word that ends in ing is
morphologically complex string, sing, etc.

18
Frequencies are important but far from the whole
story

Every word that ends in ity also ends in ty.
Hence tys frequency gt itys frequency.
Yet -ty is a suffix only in a few words (like
six-ty)
ity is a suffix in far more words, despite its
lower frequency (insan-ity, precoc-ity, etc.).
frequency( y ) gt frequency (ty) gt frequency
(ity)
y is a suffix in some words (dirt-y, runn-y,
etc.), but not in insan-ity, precoc-ity, etc.

19
Naive Minimum Description Length

Analyze the words of a corpus into stem suffix
with the requirement that every stem and every
suffix must be used in at least 2 distinct words.
Tally up the total number of letters in (a) each
of the proposed stems, (b) each of the proposed
suffixes, and (c) each of the unanalyzed words,
and call that total the naive description
length.

20
Naive Minimum Description Length

Corpus
jump, jumps, jumping
laugh, laughed, laughing
sing, sang, singing
the, dog, dogs
total 62 letters

Analysis
Stems jump laugh sing sang dog (20 letters)
Suffixes s ing ed (6 letters)
Unanalyzed the (3 letters)
total 29 letters.

Notice that the description length goes UP if we
analyze sing into sing
21

Frequencies matter, but only in the overarching
context of a total morphological analysis of all
of the words of the language.

22
Lets look at how the work is done, step by
step...
23
Corpus
Pick a large corpus from a language -- 5,000 to
1,000,000 words.
24
Corpus
Feed it into the bootstrapping heuristic...
Bootstrap heuristic
25
Corpus
Bootstrap heuristic
Out of which comes a preliminary
morphology, which need not be superb.
Morphology
26
Corpus
Bootstrap heuristic
Feed it to the incremental heuristics...
Morphology
incremental heuristics
27
Corpus
Out comes a modified morphology.
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
28
Corpus
Is the modification an improvement? Ask MDL!
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
29
Corpus
If it is an improvement, replace the morphology...
Bootstrap heuristic
modified morphology
Morphology
Garbage
30
Corpus
Send it back to the incremental heuristics
again...
Bootstrap heuristic
modified morphology
incremental heuristics
31
Continue until there are no improvements to try.
Morphology
modified morphology
incremental heuristics
32
Bootstrapping...initial hypothesis initial
morphology of the corpus
33
First a set of candidate suffixesfor the
language

Using some interesting statistics.

34
4. Weight the stickiness (3) by how often
the string shows up in the corpus
1. Observed frequency of a string (e.g., ing)
3. The computed stickiness of that string
2. Predicted frequency of the same string if
there were no morphemes in the language
35

Rank all word-final sequences of letters (of
length 1-4 letters)
This gives us an excellent first guess of the
suffixes of the language.
See Handout for English, French, Spanish, and
Latin.

36
(No Transcript)
37
Given a candidate set of 100 suffixes...

It is not difficult to find the set of stems that
gives us the largest number of analyses employing
only those suffixes.
We use these to find the major signatures present
in the corpus ...

38
Discovery of signatures
The first 8 stems in the largest signature in
a 500,000 word corpus of English.
Set of suffixes that appears with all of these
stems
39
Minimum Description Length
The real thing, this time Rissanen
1989. Evaluate a morphology by 1. How well the
morphology extracts generalizations present in
the data how well it describes the data. 2. How
concise the morphology is. The naïve MDL we
just looked at only covered the second point, and
only crudely.
40
Measure how well the morphology fits the data

1. Compute the predicted inverse log frequency of
each word in the corpus, and sum

This is a well-understood quantity in information
theory, called the optimal compressed length
of the corpus based on the probability
distribution defined by the morphology.
41
Conciseness

Sum all the letters, plus all the structure
inherent in the description, using information
theory.

42
structure
Number of letters
Signatures, which well get to shortly
43
Information contained in the Signature component
list of pointers to signatures
ltXgt indicates the number of distinct elements in X
44
Results
45
(No Transcript)
46
French
47
Spanish
48
Latin
49
Future directions

Develop it to work with languages with greater
complexity and
Use it as an aide in the task of learning syntax
in the same unsupervised fashion.

Write a Comment

User Comments (0)

About PowerShow.com

Acquisition of Morphology by Computer: Unsupervised learning - PowerPoint PPT Presentation

Acquisition of Morphology by Computer: Unsupervised learning

... that ends in ing is morphologically complex: string, sing, etc. ... Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ... we analyze sing into s ... – PowerPoint PPT presentation