Title: ParaMor
1ParaMor
Across Morphology
Christian Monson
2Turkish Morphology Beads on a String
present progressive
2nd person singular
take
passive
negative
You are not being taken
3Turkish Morphology Beads on a String
götür
ül
m
sun
üyor
present progressive
2nd person singular
take
passive
negative
You are not being taken
4Applications of Computational Morphology
- Machine Translation
- Turkish-English (Oflazer, 2007)
- Czech-English (Goldwater and McClosky, 2005)
- Speech Recognition
- Finnish (Creutz, 2006)
- Information Retrieval
5Challenges of Computational Morphology
- Time Consuming for a New Language
- Kemal Oflazer estimates
- 3-4 months to build basic Turkish analyzer
- Plus lexicon development and maintenance
- Expertise Needed
- Greenlandic
- Official language of Greenland
- Agglutinative Inuit language
- 50,000 speakers
- Per Langaard
6The Solution
Raw Text
Unsupervised Morphology Induction
7ParaMor Paradigm Morphology
- ParaMor
- Unsupervised morphology induction system
- Paradigm
- The natural structure of morphology
8Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
sun
üyor
götür
present progressive
2nd person singular
take
passive
negative
9Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
present progressive
take
passive
negative
1st person singular
10Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
Ø
present progressive
take
passive
negative
3rd person singular
11Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
Ø
uz
present progressive
take
passive
negative
1st person plural
12Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
Ø
uz
present progressive
take
passive
negative
13Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
yecek
um
Ø
uz
take
passive
negative
future
14Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
yecek
um
Ø
uz
take
passive
negative
15Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
yecek
um
Ø
uz
16Paradigms The Structure of Morphology
Paradigms
ül
m
um
üyor
yecek
um
Ø
uz
17Paradigms The Structure of Morphology
Paradigms
ül
m
um
üyor
yecek
um
Ø
uz
- Paradigm
- Set of mutually replaceable strings
18Paradigms The Structure of Morphology
Paradigm
ül
m
um
üyor
yecek
um
Ø
uz
- Paradigm
- Set of mutually replaceable strings
19The ParaMor Algorithm
- Identify suffix paradigms in 3 steps
20The ParaMor Algorithm
- Identify suffix paradigms in 3 steps
- Search for candidate paradigms
21The ParaMor Algorithm
- Identify suffix paradigms in 3 steps
- Search for candidate paradigms
- Cluster candidates modeling the same paradigm
22The ParaMor Algorithm
- Identify suffix paradigms in 3 steps
- Search for candidate paradigms
- Cluster candidates modeling the same paradigm
- Filter
23The ParaMor Algorithm
- Identify suffix paradigms in 3 steps
- Search for candidate paradigms
- Cluster candidates modeling the same paradigm
- Filter
- Segment words
- Using the discovered paradigms
24Search for Candidate Paradigms
- All character boundaries are candidate morpheme
boundaries
25Search for Candidate Paradigms
- Begin search with the most frequent word-final
string
Spanish
autorizaciones buscabamos costas importadoras vall
as
s 10662
26Search for Candidate Paradigms
- Identify the most frequent mutually replaceable
string - Stems that occur with one suffix in a paradigm
will likely occur with other suffixes in that
paradigm
Spanish
autorizaciones buscabamos costas importadoras vall
as
Ø s 5501
s 10662
27Search for Candidate Paradigms
- Stop adding suffixes
- When the most frequent mutually replaceable
string severly decreases the stem count.
Ø r s 287
autorizaciones buscabamos costas importadoras vall
as
Ø s 5501
s 10662
28Search for Candidate Paradigms
- Move on to the next most frequent word-final
string
Ø r s 287
Ø s 5501
s 10662
a 8981
29Search for Candidate Paradigms
a as o os 892
a o os 1410
Ø r s 287
a o 2304
Ø s 5501
a 8981
s 10662
30Search for Candidate Paradigms
Ø da das do dos n ndo r ron 118
Ø do n r 354
a as o os 892
Ø n r 509
a o os 1410
Ø r s 287
Ø n 1874
a o 2304
Ø s 5501
n 6051
a 8981
s 10662
31Search for Candidate Paradigms
Ø da das do dos n ndo r ron 118
Ø do n r 354
a as o os 892
Ø n r 509
a o os 1410
Ø r s 287
Ø n 1874
a o 2304
Ø s 5501
Ø es 874
n 6051
a 8981
s 10662
es 2751
32Search for Candidate Paradigms
a ada adas ado ados an ar aron ó 149
Ø da das do dos n ndo r ron 118
a an ar ó 353
Ø do n r 354
a as o os 892
a an ar 413
Ø n r 509
a o os 1410
Ø r s 287
a an 1049
Ø n 1874
a o 2304
Ø s 5501
Ø es 874
an 1786
n 6051
a 8981
s 10662
es 2751
33Search for Candidate Paradigms
a ada adas ado ados an ar aron ó 149
ra rada radas rado rados ran rar raron ró 23
Ø da das do dos n ndo r ron 118
strada stradas strado strar stró 7
a an ar ó 353
rada radas rado rados 53
Ø do n r 354
a as o os 892
strada strado strar stró 8
a an ar 413
rada rado rados 67
Ø n r 509
a o os 1410
strada strado stró 9
Ø r s 287
a an 1049
rada rado 89
Ø n 1874
a o 2304
Ø s 5501
strada strado 12
Ø es 874
strado 15
rado 167
an 1786
n 6051
a 8981
s 10662
es 2751
...
...
34Cluster Candidates per Paradigm
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
35Cluster Candidates per Paradigm
16 a aba ada adas ado ados an ando ar ara aron
arse ará arán aría ó Cosine Similarity 0.664 451
Covered Types
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
36Cluster Candidates per Paradigm
16 a aba ada adas ado ados an ando ar ara aron
arse ará arán aría ó Cosine Similarity 0.664 451
Covered Types
15 a aba aban ada adas ado ados an ando ar aron
arse ará arán ó 25 Stems anunci, aplic, apoy,
celebr, consider, 375 Covered Types
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
37Cluster Candidates per Paradigm
17 a aba aban ada adas ado ados an ando ar ara
aron arse ará arán aría ó Cosine Similarity
0.715 532 Covered Types
16 a aba ada adas ado ados an ando ar ara aron
arse ará arán aría ó Cosine Similarity 0.664 451
Covered Types
15 a aba aban ada adas ado ados an ando ar aron
arse ará arán ó 25 Stems anunci, aplic, apoy,
celebr, consider, 375 Covered Types
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
38Filter Candidate Paradigms
- 2 types of filtering
- Remove small unclustered candidate paradigms
- Remove candidates modeling unlikely morpheme
boundaries (Harris, 1955)
39Segment Words Using Paradigms
administradas
40Segment Words Using Paradigms
administradas
a ada adas ado ados an ar aron ó ...
41Segment Words Using Paradigms
administradas
administrada
a ada adas ado ados an ar aron ó ...
42Segment Words Using Paradigms
administradas
administrada
administr adas
a ada adas ado ados an ar aron ó ...
43Segment Words Using Paradigms
administradas
administrada
administr adas
a as o os
44Segment Words Using Paradigms
administradas
administrada
administr adas, administrad as
Old way Separate alternative analysis
a as o os
45Segment Words Using Paradigms
administradas
administrada
administr adas, administrad as
New way Augment the current segmentation
administr ad as
a as o os
46Segment Words Using Paradigms
administradas
administradaØ
administr adas, administrad as, administrada s
administr ad a s
Ø s
47Morpho Challenge 2007
- Peer operated competition
- For unsupervised morphology induction algorithms
- 4 languages
- English
- German
- Finnish
- Turkish
48ParaMor in Morpho Challenge 2007
- Developed on Spanish
- ParaMors free parameters were frozen
492 Methods of Evaluation
- Linguistic
- Segmentations compared to a morphologically
analyzed lexicon
502 Methods of Evaluation
- Linguistic
- Segmentations compared to a morphologically
analyzed lexicon
512 Methods of Evaluation
- Task based
- Information retrieval
- Short two-sentence queries
- About international news topics
- Binary relevance assessments
- About 50 queries and 20Krelevance judgements for
each language.
52Linguistic Evaluation
F1
47.2
Bernhard 2
Morfessor
53Linguistic Evaluation
F1
50.6
47.2
Bernhard 2
Morfessor
ParaMor
54Linguistic Evaluation
F1
50.6
50.7
47.2
ParaMor Morfessor
Bernhard 2
Bernhard 2
Morfessor
Morfessor
ParaMor
55Linguistic Evaluation
60.8
F1
50.7
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
56Linguistic Evaluation
60.8
56.3
F1
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
57Linguistic Evaluation
60.8
56.3
52.9
53.4
F1
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
58Linguistic Evaluation
60.8
56.3
53.4
52.9
F1
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
59Linguistic Evaluation
60.8
56.3
53.4
52.9
F1
48.2
48.5
ParaMor Morf.
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
60Linguistic Evaluation
60.8
56.3
53.4
52.9
F1
52.0
48.2
48.5
ParaMor Morf.
ParaMor Morfessor
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
24.7
61IR Evaluation (TF/IDF)
Average Precision
28.9
P M
26.4
27.0 No Morphological Analysis
Morf.
Par.
McNamee
62IR Evaluation (TF/IDF)
Average Precision
29.3
28.9
P M
27.0 No Morphological Analysis
Morf.
ParaMor
McNamee
63IR Evaluation (TF/IDF)
Average Precision
38.3
32.1
29.3
30.7 No Morphological Analysis
28.9
P M
ParaMor M.
Morfessor Baseline
Morf.
Morfessor
ParaMor
ParaMor
McNamee
64IR Evaluation (TF/IDF)
Average Precision
38.3
38.2
29.3
30.7 No Morphological Analysis
28.9
P M
ParaMor M.
Morfessor Baseline
Morf.
Morfessor
ParaMor
ParaMor
McNamee
65IR Evaluation (TF/IDF)
Average Precision
41.2
38.8
38.2
37.2
32.0 No Morphological Analysis
29.3
28.9
Morfessor Baseline
P M
ParaMor Morfessor
ParaMor Morfessor
Morfessor Baseline
Morf.
Morfessor
ParaMor
Morfessor
ParaMor
ParaMor
McNamee
66ParaMor State-of-the-Art Unsupervised Morphology
Induction System
- Combined system among the best in Morpho
Challenge 2007 - Consistent across languages
- Better than no morphology
- Task based (IR) measure
67Many Future Directions
- Improve Performance
- F1 of 50-60 is state-of-the-art!
- Inflection classes
- Morphophonology
- Beyond beads-on-a-string
68Thank You!