Title: Dictionaries and Grammar
1Dictionaries and Grammar
Questions to Address
- Do we include all forms of a particular word, or
do we include only the base word and derive its
forms? - How are the grammatical rules of a language
represented? - How do we represent the parts of speech that go
with particular grammatical rules?
2Language Modeling Integration of Natural Language
3Morphology
- How phonemes combine to make words
- Important for speech recognition and synthesis
- Example singular to plural
- Run to runs z sound (voiced)
- Hit to Hits s sound (unvoiced)
- One approach Devise language specific sets of
rules of pronunciation
4Morphology
Definitions
- Morphology
- The study of the patterns used to form words
- E.g. inflection, derivation, and compounds
- Morpheme - Minimal meaning-bearing unit
- Could be a stem or an affix
- Stem unthinkable realization distrust
- The part of a word that contains the root meaning
(E.g. cat) - Affixes -s, un-, de-, -en, -able, -ize, -hood
- a linguistic element added to a word modify the
meaning - E.g. prefix (unbuckle), suffix (buckled), infix
(absobloodylutely), and circumfix (gesagt in
German for said). - Affixes can attach to other affixes (boyishness)
5Knowing Words
- When we know a word, we know its
- Phonological sound sequences
- Semantic meanings
- Morphological relationships
- Syntactic categories and proper structure of a
sentence - Morphological relationships adjust word meanings
- Person Jill waits.
- Number Jill carried two buckets.
- Case The chairs leg is broken.
- Tense Jill is waiting there now.
- Degree Jill ran faster than Jack.
- Gender Jill is female
- Part of Speech Jill is a proper noun
These are the kind of things we want our
computers to figure out
6Units of Meaning
- How many morphemes do each of the following
sentences have? - I have two cats
- She wants to leave soon
- He walked across the room
- Her behavior was unbelievable
- Free Morphemes eye, think, run, apple
- Bound Morphemes -able, un-, -s, -tion, -ly
7Affix Examples
- Prefixes from Karuk, a Hokan language of
California - pasip Shoot!
- nipasip I shoot
- /upasip She/he shoots
- Suffixes from Mende spoken in Liberia and Sierra
Leone - pElE house
- pElEi the house
- mEmE glass
- mEmEi the glass
- Infixes from Bontoc spoken in the Phillipines
- fikas strong
- fumikas she is becoming strong
- fusul enemy
- fumusal she is becoming an enemy
8Turkish Morpology
- Uygarlastiramadiklarimizdanmissinizcasina
- Meaning behaving as if you are among those whom
we could not civilize - Uygar civilized las become tir cause
- ama not able dik past
- lar plural imiz p1pl dan abl
- mis past siniz 2pl casina as if
9How does the Mind Store Meanings?
- Hypotheses
- Full listing We store all words individually
- Minimum redundancy We store morphemes and how
they relate - Analysis
- Determine if people understand new words based on
root meanings - Observe whether children have difficulty learning
exceptions - Regular form government/govern, Irregular form
department/depart - Evidence suggests
- The mind represents words and affix meanings
separately - Linguists observe that affixes were originally
separate words that speakers slur together over
time
10General Observations about Lexicons
- Meanings are continually changing
- Roots and Morphemes do not have to occur in a
fixed position in relation to other elements. - How many words do people know?
- Shakespeare uses 15,000 words
- A typical high school student knows
60,000(learning 10 words a day from 12 months to
18 years) - How many English words are there?
- Over 300,000 words without Morphemes in 1988
11Computational Morphology
Speech recognition requires a language
dictionary How many words would it contain?
- Consider all of the morphemes of the word true
- true, truer, truest, truly, untrue, truth,
truthful, truthfully, untruthfully,
untruthfulness - Untruthfulness un- true -th -ful -ness
- Productive morphemes
- An affix that at a point in time spread rapidly
through the language - Consider goose and geese versus cat and cats
- The former was an older way to indicate plurals
- The latter is a more recent way that spread
throughout - If we store morpheme rules, not all words, we can
- Reduce storage requirements and simplify creating
entire dictionaries - More closely mimic how the mind does it
- Be able to automatically understand newly
encountered word forms
12Morphology Rules
- There are rules used to form complex words from
their roots - re- only precedes verbs (rerun, release,
return) - -s indicates plurals
- -ed indicates past tense
- Affix Rules
- Regular follow productive affix rules
- Irregular dont follow productive affix rules
- Nouns
- Regular (cat, thrush), (cats, thrushes), (cats
thrushes) - Irregular (mouse, ox), (mice, oxen)
Observation More frequent words resist changes
that result fromproductive affixes and take
irregular forms (E.g. am, is, are). Exceptions
A singer sings, and a writer writes. Why doesnt
a whisker whisk, a spider spid, or a finger fing?
13Parsing
Identify components and underlying structure
- Morphological parsing
- Identifies stem and affixes and how they relate
- Example
- fish ? fish Noun Singular or goose Verb
- fish ? fish Noun Plural
- fish ? fish Verb Singular
- Bracketing indecipherable ? in de cipher
able - Why do we parse?
- spell-checking Is muncheble a real word?
- Identify a words part-of-speech (pos)
- Sentence parsing and machine translation
- Identify word stems for data mining search
operations - Speech recognition and text to speech
14Parsing Applications
- Lexicon
- Create a word list
- Include both stems and affixes (with the part of
speech) - Morphotactics
- Models how morphemes can be affixed to a stem.
- E.g., plural morpheme follows noun in English
- Orthographic rules
- Defines spelling modifications during affixation
- E.g. true ? tru in context of true ? truthfully
15Grammatical Morphemes
- New forms are rarely added to closed morpheme
classes - Examples
- prepositions at, for, by
- articles a, the
- conjunctions and, but, or
16Morphological Parsing (stemming)
- Goal Break the surface input into morphemes
- foxes
- Fox is a noun stem
- It has -es as a plural suffix
- rewrites
- Write is the verb stem
- It has re- as a prefix meaning to do again
- It has a s suffix indicating a continuing
activity
17Inflectional Morphology
Does not change the grammatical category
- Nouns
- plural marker -s (dog s dogs)
- possessive marker -s (dog s dogs)
- Verbs
- 3rd person present singular -s (walk s
walks) - past tense -ed (walk ed walked)
- progressive -ing (walk ing walking)
- past participle -en or -ed (eat en eaten)
- Adjectives
- comparative -er (fast er faster)
- superlative -est (fast est fastest)
- In English
- Meaning transformations are predictable
- All inflectional affixes are suffices
- Inflectional affixes are attached after any
derivational (next slide) affixes - E.g. modern ize s modernizes not modern
s ize
18Concatenative and Non-concatenative
- Concatenative morphology combines by
concatentation - prefixes and suffixes
- Non-concatentative morphology combines in complex
ways - circumfixes and infixes
- templatic morphology
- words change by internal changes to the root
- E.g. (Arabic, Hebrew) ktb (write), kuttib (will
have been written)
Templative Example
19Verbal Inflective Morphology
- Verbal inflection
- Main verbs (sleep, like, fear) are relatively
regular - Standard morphemes -s, ing, ed
- These morphemes are productive Emails, Emailing,
Emailed - Combination with nouns for syntactical agreement
- I am, we are, they were
- There are exceptions
- Eat (will eat, eats, eating, ate)
- Catch (will catch, catches, catching, caught)
- Be (will be, is, being, was)
- Have (will have, has, having, had)
- General Observations about English
- There are approximately 250 Irregular verbs that
occur - Other languages have more complex verbal
inflection rules
20Nominal Inflective Morphology
- Plural forms (s or es)
- Possessives (cats or cats)
- Regular Nouns
- Singular (cat, bush)
- Plural (cats, bushes)
- Possessive (cats bushes)
- Irregular Nouns
- Singular (mouse, ox)
- Plural (mice, oxen)
21Derivational Morphology
- Word stem combines with grammatical morpheme
- Usually produces word of different class
- Complex rules that are less productive with many
exceptions - Sometimes meanings of derived terms are hard to
predict (E.g. hapless) - Examples verbs to nouns
- generalize, realize ? generalization, realization
- Murder, spell ? murderer, speller
- Examples verbs and nouns to adjectives
- embrace, pity? embraceable, pitiable
- care, wit ? careless, witless
- Example adjectives ? adverbs
- happy ? happily
- More complicated to model than inflection
- Less productive science-less, concern-less,
go-able, sleep-able
22Derivational Morphology Examples
- Level 2
- Examples hood, ness, ly, s, ing, ish, ful, ly,
less, y (adj.) - Observations
- Never precede Level 1 suffixes
- Never change stress or vowel quality
- Almost always attach to words that exists
- Level 1
- Examples ize, ization, ity, ic, al, ity, ion, y,
ate, ous, ive, ation - Observations
- Can attach to non-words (e.g. fratern-al,
paternal) - Often changes stems stress and vowel quality
Level 1 Level 1 histor-ic-al,
illumina-at-tion, indetermin-at-y Level 1
Level 2 fratern-al-ly, transform-ate-ion-less Le
vel 2 Level 2 weight-less-ness Big one
antidisestablishmenterrianism (if I spelled it
right)
23Adjective Morphology
- Standard Forms
- Big, bigger, biggest
- Cool, cooler, coolest, cooly
- Red, redder, reddest
- Clear, clearer, clearest, clearly, unclear,
unclearly - Happy, happier, happiest, happily
- Unhappy, unhappier, unhappiest, unhappily
- Real, unreal, really
- Exceptions unbig, redly, realest
24Identify and Classify Morphemes
- In each group
- Two words have a different morphological
structure - One word has a different type of suffix
- One word has no suffix at all
- Perform the following tasks
- 1.Isolate the suffix that two of the words share.
- 2.Identify whether it is (i) free or bound (ii)
prefix, infix, suffix (iii) inflectional or
derivational. - 3.Give its function/meaning.
- 4.Identify the word that has no suffix
- 5.Identify the word that has a suffix which is
different from the others in each group. - a. b. c. d.
- rider tresses running tables
- colder melodies foundling lens
- silver Besss handling witches
- actor guess fling calculates
25Computational Techniques
- Regular Grammars
- Finite State Automata
- Finite State Transducer
- Parsing Top down and bottom up
26Regular Grammars
- Grammar Rules that define legal characters
strings - A regular grammar accepts regular expressions
- A regular expression must satisfy the following
- The grammar with no strings is regular
- The grammar that accepts the empty string is
regular - A single character is a regular grammar
- If r1 and r2 are regular grammars, then r1 union
r2, and r1 concatenated with r2 are regular
grammars - If r is a regular grammar, then r ( where
means zero or more occurrences) is regular
27Notations to Express Regular Expressions
- Conjunction abc
- Disjunction a-zA-Z, gupp(yies)
- Counters a, a, ?, a5, a5,8, a5,
- Any character a.b
- Not 0-9
- Anchors /The dog\./
- Note the backslash before the period is an
escape character - Other escape characters include \, \?, \n, \t,
\\, \, \, etc. - Operators
- \d equivalent to 0-9, \D equivalent to 0-9
- \w equivalent to a-zA-z0-9 , \W equivalent to
\w - \s equivalent to \r\t\n\f, \S equivalent to
s - Substitute one regular expression for another
s/regExp1/regExp2/
28Examples of Regular Expressions
- All strings ending with two zeroes
- All strings containing three consecutive zeroes
- All strings that every block of five consecutive
symbols have at least two zeroes - All strings that the tenth symbol from the right
is a one - The set of all modular five numbers
29Finite State Automata (FSA)
FSAs recognize grammars that are regular
- Definition A FSA (?, q0, F, Q, d) consists of
- a set of states (S)
- a starting state (q0)
- a set of final or accepting states (F ? Q)
- a finite set of symbols (Q)
- a transition function (?(q,i) ) that maps QxS to
Q. It switches from a from-state to a to-state,
based on one of the valid symbols
Synonyms Finite Automata, Finite State Machine
30Finite-state Automata
- Equivalent to
- Finite-state automata (FSA)
- Regular languages
- Regular expressions
31Recognition
Determine if the machine accepts a particular
string i.e. Is a string in the language?
- Traditionally, Turing used a tape reader to
depict a FSA - Algorithm
- Begin in the start state
- Examine the current input character
- Consult the table
- Go to a new state and update the tape pointer.
- Until you run out of tape.
- The machine accepts the string processing stops
in a final state
32Graphs and State Transition Tables
- What can we can say about this machine?
- It has 5 states
- At least b,a, and ! are in its alphabet
- q0 is the start state
- q4 is an accept state
- It has 5 transitions
- Questions
- Which strings does it accept? baaaa, aaabaaa, ba
- Is this the only FSA that can accept this
language?
State Transition Table
Annotated Directed Graph
An FSA only can accept regular strings.
Question Can you think of a string that is not
regular?
33Finite-state Automata (Machines)
34Input Tape
REJECT
35Input Tape
ACCEPT
36State-transition Tables
Input Input Input
State b a !
0 1 0 0
1 0 2 0
2 0 3 0
3 0 3 4
4 0 0 0
37Recognizer Implementation
- index beginning of tape
- state start state
- DO
- IF transitionindex, tapeindex is empty
- RETURN false
- state transitionindex, tapeindex
- index index 1
- UNTIL end of tap is reached
- IF state is a final state
- RETURN true
- ELSE RETURN false
38Other FSA Examples
Dollars and Cents
Exercise Create a FSA for the following regular
expressions (01) a-f1-9 abc5
39FSAs and Morphology
- Apply an FSA to each word in the dictionary to
capture the morphological forms. - Groups of words with common morphology can share
FSAs
40Building a Lexicon with a FSA
41Derivational Rules
42Simple Morphology Example
From To Output
0 1 un
0 1 NULL
1 2 adj-root-list
2 3 erestly
Stop states q2 and q3
43An Extended Example
From To Output
0 1 un
0 3 NULL
1 2 adj-root-list-1
2 5 erestly
3 2 adj-root-list-1
3 4 adj-root-list-2
4 5 erest
Adj-root1 clear, happy, real Adj-root2 big,
red
44Representing Derivational Rules
45Key Points Regarding FSAs
- This algorithm is a state-space search algorithm
- Implementation uses simple table lookups
- Success occurs when at the end of a string, we
reach a final state - The results are always deterministic
- There is one unique choice at each step
- The algorithm recognizes all regular languages
- Perl, Java, etc. use a regular expression
algorithm - Create a state transition table from the
expression - pass the table to the FSA interpreter
- FSA algorithms
- Recognizer determines if a string is in the
language - Generator Generates all strings in the language
46Non-Deterministic FSA
- Deterministic Given a state and symbol, only one
transition is possible - Nondeterministic (dSx??P?S)
- Given a state and a symbol, multiple transitions
are possible - Epsilon transitions those which DO NOT examine
or advance the tape - The Nondeterministic FSA recognizes a string if
- At least one transition sequence ends at a final
state - Note all sequences DO NOT have to end at a final
state - Note String rejection occurs only when NO
sequence ends at a final state
Examples
e
47FSA vs Non Deterministic FSA
a b b a a b a b
0 1 1 0 1 1 0 1
48Concatenation
49Closure Closure
50Union
51Using NFSAs
Input Input Input Input
State b a ! e
0 1 0 0 0
1 0 2 0 0
2 0 2,3 0 0
3 0 0 4 0
4 0 0 0 0
52NFSA Recognition of baaa!
53Breadth-first Recognition of baaa!
54Nondeterministic FSA Example
55Non Deterministic FSA Recognizer
- Recognizer (index, state)
- LOOP
- IF end of tape THEN
- IF state is final RETURN true ELSE
RETURN false - IF no possible transitions RETURN false
- IF there is only one transition
- state transitionindex, tapeindex
- IF not an epsilon transition THEN index
- ELSE
- FOR each possible transition not considered
- result CALL recognizer(nextState,nextInd
ex) - IF result true RETURN true
- END LOOP
- RETURN false
56Finite State Transducer (FST)
- Definition An FST is a 7-tuple (Q, S, G, I, F,
d, ?) - Q is a finite set of states
- S is a finite set of symbols (the input alphabet)
- G is a finite set of symbols (the output
alphabet) - I is a subset of Q (the initial states)
- F is a subset of Q (the final states)
- ? is a function, ?(q,i), that maps QxS to Q
- ? is a function, ? (q,i) that maps ?G?O
- (where e is the empty string) is the transition
relation. - Concept Translates and writes to a second tape
ao
57Transition Example
- cc means read a c on one tape and write a c on
the other - Ne means read a N symbol on one tape and write
nothing on the other - PLs means read PL and write an s
58Finite State Transducer
A Finite State Automata that produces an output
string
Input Features from a sequence of
frames Processing Find the most likely path
through the sequence using hidden Markov models
or Neural Networks Output The most likely word,
phoneme, or syllable
O is a set of output states, ? S-gtO
59On-line demos
- Finite state automata demos
- http//www.xrce.xerox.com/competencies/content-ana
lysis/fsCompiler/fsinput.html - Finite state morphology
- http//www.xrce.xerox.com/competencies/content-ana
lysis/demos/english - Some other downloadable FSA tools
- http//www.research.att.com/sw/tools/fsm/
60PFSA (Probabilistic Finite State Automata)
- A PFSA is a type of Probabilistic Context Free
Grammar - The states are the non-terminals in a production
rule - The output symbols are the observed outputs
- The arcs represent a context-free rule
- The path through the automata represent a parse
tree - A PCFG considers state transitions and the
transition path
S1
61Probabilistic Finite State Machines
- Probabilistic models determine weights of the
transitions - The sum of weights leaving a state total to unity
- Operations
- Consider the weights to compute the probability
of a given string or most likely path. - The machine can learn the weights over time
62Another Example
63Pronunciation decoding
n iy
64Merging the machines together
n iy
65Another Example
66Beam Search
Graph Searching Approach
- Beam search is a breadth first approach
- It considers a set of most likely subset of edges
(beam width). If the beam width is infinity, the
search reduces to breadth first - Beam width can be fixed or it can vary depending
on search parameters. - At each level of the graph, it generates a list
of successors sorted according to an heuristic
cost metric. - Because beam search returns the first solution
found, there is no guarantee that the algorithm
will find the best solution. It returns the first
solution found.
67Dynamic Programming-Based Search
68Syllables
- Organizational phonological unit
- Vowel between two consonants
- Ambiguous positioning of consonants into
syllables - Tree structured representation
- Basic unit of prosody
- Lexical stress inherent property of a word
- Sentential stress speaker choice to emphasize or
clarrify
69Representing Stress
- There have been unsuccessful attempts to
automatically assign stress to phonemes - Notations for representing stress
- IPA (International Phonetic Alphabet) has a
diacritic symbol for stress - Numeric representation
- 0 reduced, 1 normal, 2 stressed
- Relative
- Reduced (R) or Stressed (S)
- No notation means undistinguished