Morphology-2 - PowerPoint PPT Presentation

About This Presentation
Title:

Morphology-2

Description:

Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur Morphology in NLP Analysis vs synthesis what does dogs ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 72
Provided by: DanJur1
Category:

less

Transcript and Presenter's Notes

Title: Morphology-2


1
Morphology-2
  • Sudeshna Sarkar
  • Professor
  • Computer Science Engineering Department
  • Indian Institute of Technology Kharagpur

2
Morphology in NLP
  • Analysis vs synthesis
  • what does dogs mean? vs what is the plural of
    dog?
  • Analysis
  • Need to identify lexeme
  • Tokenization
  • To access lexical information
  • Inflections (etc) carry information that will be
    needed by other processes (eg agreement useful in
    parsing, inflections can carry meaning (eg tense,
    number)
  • Morphology can be ambiguous
  • May need other process to disambiguate (eg German
    en)
  • Synthesis
  • Need to generate appropriate inflections from
    underlying representation

3
Morphological processing
  • Stemming
  • String-handling approaches
  • Regular expressions
  • Mapping onto finite-state automata
  • 2-level morphology
  • Mapping between surface form and lexical
    representation

4
Stemming
  • Stemming is the particular case of tokenization
    which reduces inflected forms to a single base
    form or stem
  • Stemming algorithms are basic string-handling
    algorithms, which depend on rules which identify
    affixes that can be stripped

5
Surface and Lexical Forms
  • The surface level of a word represents the actual
    spelling
  • of that word.
  • geliyorum eats cats kitabim
  • The lexical level of a word represents a simple
    concatenation
  • of morphemes making up that word.
  • gel PROG 1SG
  • eat AOR
  • cat PLU
  • kitap P1SG
  • Morphological processors try to find
    correspondences between lexical and surface forms
    of words.
  • Morphological recognition/ analysis surface
    to lexical
  • Morphological generation/ synthesis lexical
    to surface

6
Morphological Parsing
  • Morphological parsing is to find the lexical form
    of a word
  • from its surface form.
  • cats -- cat N PLU
  • cat -- cat N SG
  • goose -- goose N SG or goose V
  • geese -- goose N PLU
  • gooses -- goose V 3SG
  • catch -- catch V
  • caught -- catch V PAST or catch V PP
  • AsachhilAma AsAPROGPAST1st I/We was/were
    coming
  • There can be more than one lexical level
    representation
  • for a given word. (ambiguity)
  • flies flyVERBPROG
  • flyNOUNPLU
  • mAtAla
  • kare

7
  • The history of morphological analysis dates back
    to the ancient Indian linguist Pa?ini, who
    formulated the 3,959 rules of Sanskrit morphology
    in the text A??adhyayi by using a Constituency
    Grammar.

8
Formal definition of the problem
  • Surface form The word (ws) as it occurs in the
    text. sings
  • ws ? L ? S
  • Lexical form The root word(s) (r1, r2, ) and
    other grammatical features (F). sing,v,sg,3rd
  • wl ? S,F
  • wl ? ?

9
Analysis Synthesis
  • Morphological Analysis Maps a string from
    surface form to corresponding lexical form.
  • fMA S ? ?
  • Morphological Synthesis Maps a string from
    lexical form to surface form.
  • fMA ? ? S

10
Relationship between MA MS
  • fMS??fMA(ws) ws
  • fMA??fMS(wl) wl
  • fMS? fMA, fMA? fMS
  • But is that really the case?

-1
-1
11
  • Fly s ? flys ? flies (y ?i rule)
  • Duckling
  • Go-getter ? get er
  • Doer ? do er
  • Beer ? ?
  • What knowledge do we need?
  • How do we represent it?
  • How do we compute with it?

12
Knowledge needed
  • Knowledge of stems or roots
  • Duck is a possible root, not duckl
  • We need a dictionary (lexicon)
  • Only some endings go on some words
  • Do er ok
  • Be er not ok
  • In addition, spelling change rules that adjust
    the surface form
  • Get er double the t getter
  • Fox s insert e foxes
  • Fly s insert e flys y to i flies
  • Chase ed drop e - chased

13
Put all this in a big dictionary (lexicon)
  • Turkish approx 600 ? 106 forms
  • Finnish 107
  • Hindi, Bengali, Telugu, Tamil?
  • Besides, always novel forms can be constructed
  • Anti-missile
  • Anti-anti-missile
  • Anti-anti-anti-missile
  • ..
  • Compounding of words Sanskrit, German

14
Dictionary
  • Lemma lexical unit, pointer to lexicon
  • typically is represented as the base form, or
    dictionary headword
  • possibly indexed when ambiguous/polysemous
  • state1 (verb), state2 (state-of-the-art), state3
    (government)
  • from one or more morphemes (root, stem,
    rootderivation, ...)
  • Categories non-lexical
  • small number of possible values (lt 100, often lt
    5-10)

15
Morphological Analyzer
  • Relatively simple for English.
  • But for many Indian languages, it may be more
    difficult.
  • Examples
  • Inflectional and Derivational Morphology.
  • Common tools Finite-state transducers
  • A transducer maps a set/string of symbols to
    another set/string of symbols

16
A simpler problem
  • Linear concatenation of morphemes with possible
    spelling changes at the boundary and a few
    irregular cases.
  • Quite practical assumptions
  • English, Hindi, Bengali, Telugu, Tamil, French,
    Turkish
  • Exceptions Semitic languages, Sanskrit

17
Computational Morphology
  • Approaches
  • Lexicon only
  • Rules only
  • Lexicon and Rules
  • Finite-state Automata
  • Finite-state Transducers

18
Computational Morphology
  • Systems
  • WordNets morphy
  • PCKimmo
  • Named after Kimmo Koskenniemi, much work done by
    Lauri Karttunen, Ron Kaplan, and Martin Kay
  • Accurate but complex
  • http//www.sil.org/pckimmo/
  • Two-level morphology
  • Commercial version available from InXight Corp.
  • Background
  • Chapter 3 of Jurafsky and Martin
  • A short history of Two-Level Morphology
  • http//www.ling.helsinki.fi/koskenni/esslli-2001-
    karttunen/

19
Morphological Anlayser
  • To build a morphological analyser we need
  • lexicon the list of stems and affixes, together
    with basic information about them
  • morphotactics the model of morpheme ordering (eg
    English plural morpheme follows the noun rather
    than a verb)
  • orthographic rules these spelling rules are used
    to model the changes that occur in a word,
    usually when two morphemes combine (e.g., flys
    flies)

20
Finite State Machines
  • FSAs are equivalent to regular languages
  • FSTs are equivalent to regular relations (over
    pairs of regular languages)
  • FSTs are like FSAs but with complex labels.
  • We can use FSTs to transduce between surface and
    lexical levels.

21
Can FSAs help?
Reg-noun
Plural (-s)
Q0
Q1
Q2
Irreg-pl-noun
Irreg-sg-noun
22
Whats this for?
un
Adj-root
Q0
Q1
Q2
Q3
-er -est -ly
e
un?ADJ-ROOTer est ly?
23
Morphotactics
  • The last two examples basically model some parts
    of the English morphotactics
  • But where is the information about regular and
    irregular roots?LEXICON
  • Can we include the lexicon in the FSA?

24
The English Pluralization FSA
25
After adding a mini-lexicon
a
s
g
u
b
Q1
Q2
s
Q0
d
o
g
m
a
n
n
e
26
Elegance Power
  • FSAs are elegant because
  • NFA ?? DFA
  • Closed under Union, Intersection, Concatenation,
    Complementation
  • Traversal is always linear on input size
  • Well-known algorithms for minimization,
    determinization, compilation etc.
  • They are powerful because they can capture
  • Linear morphology
  • Irregularities

27
But
  • FSAs are language recognizer/generator.
  • We need transducers to build
  • Morphological Analyzers (fMA)
  • Morphological Synthesizers (fMS)

28
Finite State Transducers
s i n g s
Finite State Machine
Surface form
Lexical form
s i n g v sg
29
Formal Definition
  • A 6-tuple S,?,Q,d,q0,F
  • S is the (finite) set of input symbols
  • ? is the (finite) set of output symbols
  • Q is the set (FINITE) of states
  • d is the transition function Q?? S to Q ? ?
  • q0 ? Q is the start state
  • F ? Q is the set of accepting states

30
An example FST
aa
se
gg
bb
uu
Q1
Q2
ss
Q0
dd
oo
gg
aa
mm
nn
nn
ea
31
The Lexicon FST
aa
sPl
gg
Sg
bb
uu
Q1
Q2
ss
Q0
dd
oo
gg
Sg
aa
nn
mm
Q3
ea
Pl
nn
Q4
32
Ways to look at FSTs
  • Recognizer of a pair of strings
  • Generator of a pair of strings
  • Translator from one regular language to another
  • Computer of a relation regular relation

33
Invertibility
  • Given T S,?,Q,d,q0,F
  • Construct T-1 ?,S,Q,d-1,q0,F
  • such that if d(x,q) ? (y,q)
  • then d-1(y,q) ? (x,q)
  • where, x ? S and y ? ?

34
Compositionality
  • T1 S, X, Q1,d1,q1,F1 T2 X, ?,
    Q2,d2,q2,F2
  • Define T3 S, ?, Q3,d3,q3,F3
  • such that Q3 Q1 ? Q2
  • q3 (q1, q2)
  • d3 ((q,s), i) ((q,s),o) if
  • ?c s.t d1 (q, i) (q,c) and d2 (s, c) (s,o)

35
Modeling Orthographic Rules
  • Spelling changes in morpheme boundaries
  • buss ? buses, watchs ? watches
  • flys ? flies
  • makeing ? making
  • Rules
  • E-insertion takes place if the stem ends in s, z,
    ch, sh etc.
  • y maps to ie when pluralization marker s is added

36
Incorporating Spelling Rules
  • Spelling rules, each corresponding to an FST, can
    be run in parallel provided that they are
    "aligned".
  • The set of spelling rules is positioned between
    the surface level and the intermediate level.
  • Parallel execution of FSTs can be carried out
  • by simulation in this case FSTs must first be
    aligned.
  • by first constructing a a single FST
    corresponding to their intersection.

37
Rewrite Rules
  • Chomsky and Halle (1968)
  • General form
  • a ? b / ?__ ?
  • E-insertion
  • e ? e / x,s,z,ch,sh __ s
  • Kay and Kaplan (1994) showed that FSTs can be
    compiled from general rewrite rules

38
Two-level Morphology (Koskenniemi, 1983)
b u s N Pl
lexical
LEXICON FST
b u s s
intermediate
FST1
FSTn
orthographic rules
b u s e s
surface
39
A Single FST for MA and MS
Pl
N
s
u
b
Pl
N
s
u
b
LEXICON FST
Morphology FST

s

s
u
b
FST1
FSTn
orthographic rules
40
Can we do without the lexicon
  • Not really!
  • But for some applications we might need to know
    the stem only
  • Surface form ? Stem Stemming
  • Porter Stemming algorithm (1980) is a very
    popular technique that does not use lexicon.

41
Derivational Rules
42
Lexicon Morphotactics
  • Typically list of word parts (lexicon) and the
    models of ordering can be combined together into
    an FSA which will recognise the all the valid
    word forms.
  • For this to be possible the word parts must first
    be classified into sublexicons.
  • The FSA defines the morphotactics (ordering
    constraints).

43
Sublexicons to classify the list of word parts
reg-noun irreg-pl-noun irreg-sg-noun plural
cat mice mouse -s
fox sheep sheep
geese goose
44
Towards the Analyser
  • We can use lexc or xfst to build such an FSA (see
    lex1.lexc)
  • To augment this to produce an analysis we must
    create a transducer Tnum which maps between the
    lexical level and an "intermediate" level that is
    needed to handle the spelling rules of English.

45
Ambiguity
  • Recall that in non-deterministic recognition
    multiple paths through a machine may lead to an
    accept state.
  • Didnt matter which path was actually traversed
  • In FSTs the path to an accept state does matter
    since differ paths represent different parses and
    different outputs will result

46
Ambiguity
  • Whats the right parse for
  • Unionizable
  • Union-ize-able
  • Un-ion-ize-able
  • Each represents a valid path through the
    derivational morphology machine.

47
Ambiguity
  • There are a number of ways to deal with this
    problem
  • Simply take the first output found
  • Find all the possible outputs (all paths) and
    return them all (without choosing)
  • Bias the search so that only one or a few likely
    paths are explored

48
Generativity
  • Nothing really privileged about the directions.
  • We can write from one and read from the other or
    vice-versa.
  • One way is generation, the other way is analysis

49
Multi-Level Tape Machines
  • We use one machine to transduce between the
    lexical and the intermediate level, and another
    to handle the spelling changes to the surface
    tape

50
Note
  • A key feature of this machine is that it doesnt
    do anything to inputs to which it doesnt apply.
  • Meaning that they are written out unchanged to
    the output tape.
  • Turns out the multiple tapes arent really
    needed they can be compiled away.

51
Overall Scheme
  • We now have one FST that has explicit information
    about the lexicon (actual words, their spelling,
    facts about word classes and regularity).
  • Lexical level to intermediate forms
  • We have a larger set of machines that capture
    orthographic/spelling rules.
  • Intermediate forms to surface forms

52
Other Issues
  • How to formulate the rewrite rules?
  • How to ensure coverage?
  • What to do for unknown roots?
  • Is it possible to learn morphology of a language
    in supervised/unsupervised manner?
  • What about non-linear morphology?

53
References
  • Chapter 3, pp 57-89
  • Speech and Language Processing by D. Jurafsky
    J. H. Martin, Pearson Education Asia, 2002 (2000)
  • Slides based on the chapter
  • Chapter 2, pp 70
  • Natural Language Understanding by J. Allen,
    Pearson Education, 2003 (1995)
  • Slide by Monojit Choudhury

54
Spelling errors
55
Non-word error detection
  • Any word not in a dictionary
  • Assume its a spelling error
  • Need a big dictionary!
  • What to use?
  • FST dictionary!!

56
Isolated word error correction
  • How do I fix graffe?
  • Search through all words
  • graf
  • craft
  • grail
  • giraffe
  • Pick the one thats closest to graffe
  • What does closest mean?
  • We need a distance metric.
  • The simplest one edit distance.
  • (More sophisticated probabilistic ones noisy
    channel)

57
Edit Distance
  • The minimum edit distance between two strings
  • Is the minimum number of editing operations
  • Insertion
  • Deletion
  • Substitution
  • Needed to transform one into the other

58
Minimum Edit Distance
  • If each operation has cost of 1
  • Distance between these is 5
  • If substitutions cost 2 (Levenshtein)
  • Distance between these is 8

59
Part of Speech Tagging
  • Task
  • assign the right part-of-speech tag, e.g. noun,
    verb, conjunction, to a word in context
  • POS taggers
  • need to be fast in order to process large corpora
  • should take no more than time linear in the size
    of the corpora
  • full parsing is slow
  • e.g. context-free grammar ? n3, n length of the
    sentence
  • POS taggers try to assign correct tag without
    actually parsing the sentence

60
Part-of-Speech (POS)
  • Categories to which words are assigned according
    to their function.
  • Noun, verb, adjective, preposition, adverb,
    article, pronoun, conjunction, etc.

61
POS Tagging
  • The process of assigning a part-of-speech to each
    word in a sentence
  • Keep the book on the top
    shelf .

.
N
ADJ N
DET
ADV ADJ P
N V
N V
DET
62
Techniques for POS tagging
  • Linguistic approaches
  • Statistical approaches
  • Hidden Markov Model
  • Maximum Entropy
  • CRF

63
Named Entity Recognition
  • Named Entity Recognition (NER) Locate and
    Classify the Names in Text
  • Example
  • Jawaharlal Nehru was the first prime
    minister of India.
  • Per-beg Per-end Title-beg Title-end
    Loc
  • Importance
  • Information Extraction, Question-Answering
  • Can help Summarization, ASR and MT
  • Intelligent document access
  • etc

64
Syntax
  • Order and group words together in sentence
  • The dog barked at the visitor
  • Vs
  • Barked dog the at visitor the

65
Semantics
  • Understand word meanings and combine meanings in
    larger units
  • Lexical semantics
  • Compositional sematics

66
Discourse Pragmatics
  • Interpret utterances in context
  • Resolve references
  • I'm afraid I can't do that
  • that ?
  • Speech act interpretation
  • Open the pod bay doors
  • Command

67
Phonology
  • The study of the sound patterns of languages.

68
Computational phonology
  • Automatic Speech Recognition (ASR)
  • take an acoustic waveform as input and produce
    as output a string of words.
  • Text-To-Speech (TTS)
  • take a sequence of text words and produce as
    output an acoustic waveform.
  • ? How words are pronounced in terms
  • of individual speech units called
    phones.

69
Speech sounds and phonetic transcription
  • A phone a speech sound, represented by IPA or
    ARPAbet.
  • IPA An evolving standard with the goal of
    transcribing the sounds of all human languages.
  • ARPAbet A phonetic alphabet designed for
    American English using only ASCII symbols.

70
Why phonology?
  • Text to speech (TTS) applications include a
    component which converts spelled words to
    sequences of phonemes ( sound representations).
    G2P - grapheme to phoneme conversion
  • E.g., sight ?S AY1 T
  • John ? J AA1 N
  • Phoneme to Grapheme for speech recognition

71
Varieties of sounds in peoples speech
  • Most phonemes have several different
    pronunciations (called their allophones),
    determined by nearby sounds, most usually by the
    following sound.
  • A striking instance of such variation is in the
    realization of the phoneme /T/ in American
    English.

72
Grapheme phoneme relationships
  • LTS Letter to sound, or G2P relationships.
  • In some languages, this is simple, e.g., Sanskrit
  • But in English and in French, its very messy.
  • Why? Because the spelling system is based on how
    the language used to be pronounced, and the
    pronunciation has since changed.
  • Schwa deletion in Hindi

73
References
  • Chapter 3, pp 57-89
  • Speech and Language Processing by D. Jurafsky
    J. H. Martin, Pearson Education Asia, 2002 (2000)
  • Slides based on the chapter
  • Chapter 2, pp 70
  • Natural Language Understanding by J. Allen,
    Pearson Education, 2003 (1995)
  • Slide by Monojit Choudhury
Write a Comment
User Comments (0)
About PowerShow.com