ParaMor - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

ParaMor

Description:

Search for candidate paradigms. 21. Carnegie Mellon. Christian Monson ... Search for Candidate Paradigms. Move on to the next most frequent word-final string ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 69
Provided by: carnegieme
Category:
Tags: paramor

less

Transcript and Presenter's Notes

Title: ParaMor


1
ParaMor
  • Finding Paradigms

Across Morphology
Christian Monson
2
Turkish Morphology Beads on a String
present progressive
2nd person singular
take
passive
negative
You are not being taken
3
Turkish Morphology Beads on a String
götür
ül
m
sun
üyor
present progressive
2nd person singular
take
passive
negative
You are not being taken
4
Applications of Computational Morphology
  • Machine Translation
  • Turkish-English (Oflazer, 2007)
  • Czech-English (Goldwater and McClosky, 2005)
  • Speech Recognition
  • Finnish (Creutz, 2006)
  • Information Retrieval

5
Challenges of Computational Morphology
  • Time Consuming for a New Language
  • Kemal Oflazer estimates
  • 3-4 months to build basic Turkish analyzer
  • Plus lexicon development and maintenance
  • Expertise Needed
  • Greenlandic
  • Official language of Greenland
  • Agglutinative Inuit language
  • 50,000 speakers
  • Per Langaard

6
The Solution
Raw Text
Unsupervised Morphology Induction
7
ParaMor Paradigm Morphology
  • ParaMor
  • Unsupervised morphology induction system
  • Paradigm
  • The natural structure of morphology

8
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
sun
üyor
götür
present progressive
2nd person singular
take
passive
negative
9
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
present progressive
take
passive
negative
1st person singular
10
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
Ø
present progressive
take
passive
negative
3rd person singular
11
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
Ø
uz
present progressive
take
passive
negative
1st person plural
12
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
um
Ø
uz
present progressive
take
passive
negative
13
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
yecek
um
Ø
uz
take
passive
negative
future
14
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
götür
yecek
um
Ø
uz
take
passive
negative
15
Paradigms The Structure of Morphology
Tense Mood
Person Number
Stem
Voice
Polarity
ül
m
um
üyor
yecek
um
Ø
uz
16
Paradigms The Structure of Morphology
Paradigms
ül
m
um
üyor
yecek
um
Ø
uz
17
Paradigms The Structure of Morphology
Paradigms
ül
m
um
üyor
yecek
um
Ø
uz
  • Paradigm
  • Set of mutually replaceable strings

18
Paradigms The Structure of Morphology
Paradigm
ül
m
um
üyor
yecek
um
Ø
uz
  • Paradigm
  • Set of mutually replaceable strings

19
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps

20
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
  • Search for candidate paradigms

21
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
  • Search for candidate paradigms
  • Cluster candidates modeling the same paradigm

22
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
  • Search for candidate paradigms
  • Cluster candidates modeling the same paradigm
  • Filter

23
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
  • Search for candidate paradigms
  • Cluster candidates modeling the same paradigm
  • Filter
  • Segment words
  • Using the discovered paradigms

24
Search for Candidate Paradigms
  • All character boundaries are candidate morpheme
    boundaries

25
Search for Candidate Paradigms
  • Begin search with the most frequent word-final
    string

Spanish
autorizaciones buscabamos costas importadoras vall
as
s 10662
26
Search for Candidate Paradigms
  • Identify the most frequent mutually replaceable
    string
  • Stems that occur with one suffix in a paradigm
    will likely occur with other suffixes in that
    paradigm

Spanish
autorizaciones buscabamos costas importadoras vall
as
Ø s 5501
s 10662
27
Search for Candidate Paradigms
  • Stop adding suffixes
  • When the most frequent mutually replaceable
    string severly decreases the stem count.

Ø r s 287
autorizaciones buscabamos costas importadoras vall
as
Ø s 5501
s 10662
28
Search for Candidate Paradigms
  • Move on to the next most frequent word-final
    string

Ø r s 287
Ø s 5501
s 10662
a 8981
29
Search for Candidate Paradigms
a as o os 892
a o os 1410
Ø r s 287
a o 2304
Ø s 5501
a 8981
s 10662
30
Search for Candidate Paradigms
Ø da das do dos n ndo r ron 118
Ø do n r 354
a as o os 892
Ø n r 509
a o os 1410
Ø r s 287
Ø n 1874
a o 2304
Ø s 5501
n 6051
a 8981
s 10662
31
Search for Candidate Paradigms
Ø da das do dos n ndo r ron 118
Ø do n r 354
a as o os 892
Ø n r 509
a o os 1410
Ø r s 287
Ø n 1874
a o 2304
Ø s 5501
Ø es 874
n 6051
a 8981
s 10662
es 2751
32
Search for Candidate Paradigms
a ada adas ado ados an ar aron ó 149
Ø da das do dos n ndo r ron 118
a an ar ó 353
Ø do n r 354
a as o os 892
a an ar 413
Ø n r 509
a o os 1410
Ø r s 287
a an 1049
Ø n 1874
a o 2304
Ø s 5501
Ø es 874
an 1786
n 6051
a 8981
s 10662
es 2751
33
Search for Candidate Paradigms
a ada adas ado ados an ar aron ó 149
ra rada radas rado rados ran rar raron ró 23
Ø da das do dos n ndo r ron 118
strada stradas strado strar stró 7
a an ar ó 353
rada radas rado rados 53
Ø do n r 354
a as o os 892
strada strado strar stró 8
a an ar 413
rada rado rados 67
Ø n r 509
a o os 1410
strada strado stró 9
Ø r s 287
a an 1049
rada rado 89
Ø n 1874
a o 2304
Ø s 5501
strada strado 12
Ø es 874
strado 15
rado 167
an 1786
n 6051
a 8981
s 10662
es 2751
...
...
34
Cluster Candidates per Paradigm
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
35
Cluster Candidates per Paradigm
16 a aba ada adas ado ados an ando ar ara aron
arse ará arán aría ó Cosine Similarity 0.664 451
Covered Types
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
36
Cluster Candidates per Paradigm
16 a aba ada adas ado ados an ando ar ara aron
arse ará arán aría ó Cosine Similarity 0.664 451
Covered Types
15 a aba aban ada adas ado ados an ando ar aron
arse ará arán ó 25 Stems anunci, aplic, apoy,
celebr, consider, 375 Covered Types
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
37
Cluster Candidates per Paradigm
17 a aba aban ada adas ado ados an ando ar ara
aron arse ará arán aría ó Cosine Similarity
0.715 532 Covered Types
16 a aba ada adas ado ados an ando ar ara aron
arse ará arán aría ó Cosine Similarity 0.664 451
Covered Types
15 a aba aban ada adas ado ados an ando ar aron
arse ará arán ó 25 Stems anunci, aplic, apoy,
celebr, consider, 375 Covered Types
15 a aba ada adas ado ados an ando ar ara aron
arse ará arán ó 23 Stems anunci, apoy, confirm,
consider, declar, 345 Covered Types
15 a aba ada adas ado ados an ando ar aron arse
ará arán aría ó 22 Stems anunci, aplic, apoy,
celebr, concentr, 330 Covered Types
38
Filter Candidate Paradigms
  • 2 types of filtering
  • Remove small unclustered candidate paradigms
  • Remove candidates modeling unlikely morpheme
    boundaries (Harris, 1955)

39
Segment Words Using Paradigms
administradas
40
Segment Words Using Paradigms
administradas
a ada adas ado ados an ar aron ó ...
41
Segment Words Using Paradigms
administradas
administrada
a ada adas ado ados an ar aron ó ...
42
Segment Words Using Paradigms
administradas
administrada
administr adas
a ada adas ado ados an ar aron ó ...
43
Segment Words Using Paradigms
administradas
administrada
administr adas
a as o os
44
Segment Words Using Paradigms
administradas
administrada
administr adas, administrad as
Old way Separate alternative analysis
a as o os
45
Segment Words Using Paradigms
administradas
administrada
administr adas, administrad as
New way Augment the current segmentation
administr ad as
a as o os
46
Segment Words Using Paradigms
administradas
administradaØ
administr adas, administrad as, administrada s
administr ad a s
Ø s
47
Morpho Challenge 2007
  • Peer operated competition
  • For unsupervised morphology induction algorithms
  • 4 languages
  • English
  • German
  • Finnish
  • Turkish

48
ParaMor in Morpho Challenge 2007
  • Developed on Spanish
  • ParaMors free parameters were frozen

49
2 Methods of Evaluation
  • Linguistic
  • Segmentations compared to a morphologically
    analyzed lexicon

50
2 Methods of Evaluation
  • Linguistic
  • Segmentations compared to a morphologically
    analyzed lexicon

51
2 Methods of Evaluation
  • Task based
  • Information retrieval
  • Short two-sentence queries
  • About international news topics
  • Binary relevance assessments
  • About 50 queries and 20Krelevance judgements for
    each language.

52
Linguistic Evaluation
F1
47.2
Bernhard 2
Morfessor
53
Linguistic Evaluation
F1
50.6
47.2
Bernhard 2
Morfessor
ParaMor
54
Linguistic Evaluation
F1
50.6
50.7
47.2
ParaMor Morfessor
Bernhard 2
Bernhard 2
Morfessor
Morfessor
ParaMor
55
Linguistic Evaluation
60.8
F1
50.7
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
56
Linguistic Evaluation
60.8
56.3
F1
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
57
Linguistic Evaluation
60.8
56.3
52.9
53.4
F1
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
58
Linguistic Evaluation
60.8
56.3
53.4
52.9
F1
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
59
Linguistic Evaluation
60.8
56.3
53.4
52.9
F1
48.2
48.5
ParaMor Morf.
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
60
Linguistic Evaluation
60.8
56.3
53.4
52.9
F1
52.0
48.2
48.5
ParaMor Morf.
ParaMor Morfessor
ParaMor Morfessor
ParaMor Morfessor
Bernhard 2
Morfessor
ParaMor
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
Bernhard 2
Morfessor
ParaMor
24.7
61
IR Evaluation (TF/IDF)
Average Precision
28.9
P M
26.4
27.0 No Morphological Analysis
Morf.
Par.
McNamee
62
IR Evaluation (TF/IDF)
Average Precision
29.3
28.9
P M
27.0 No Morphological Analysis
Morf.
ParaMor
McNamee
63
IR Evaluation (TF/IDF)
Average Precision
38.3
32.1
29.3
30.7 No Morphological Analysis
28.9
P M
ParaMor M.
Morfessor Baseline
Morf.
Morfessor
ParaMor
ParaMor
McNamee
64
IR Evaluation (TF/IDF)
Average Precision
38.3
38.2
29.3
30.7 No Morphological Analysis
28.9
P M
ParaMor M.
Morfessor Baseline
Morf.
Morfessor
ParaMor
ParaMor
McNamee
65
IR Evaluation (TF/IDF)
Average Precision
41.2
38.8
38.2
37.2
32.0 No Morphological Analysis
29.3
28.9
Morfessor Baseline
P M
ParaMor Morfessor
ParaMor Morfessor
Morfessor Baseline
Morf.
Morfessor
ParaMor
Morfessor
ParaMor
ParaMor
McNamee
66
ParaMor State-of-the-Art Unsupervised Morphology
Induction System
  • Combined system among the best in Morpho
    Challenge 2007
  • Consistent across languages
  • Better than no morphology
  • Task based (IR) measure

67
Many Future Directions
  • Improve Performance
  • F1 of 50-60 is state-of-the-art!
  • Inflection classes
  • Morphophonology
  • Beyond beads-on-a-string

68
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com