Dictionaries and Grammar - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Dictionaries and Grammar

Description:

Knowing Words. When we know a word, we know its. Phonological sound sequences. Semantic meanings. Morphological relationships. Syntactic categories and proper ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 70
Provided by: souEdu
Learn more at: http://cs.sou.edu
Category:

less

Transcript and Presenter's Notes

Title: Dictionaries and Grammar


1
Dictionaries and Grammar
Questions to Address
  • Do we include all forms of a particular word, or
    do we include only the base word and derive its
    forms?
  • How are the grammatical rules of a language
    represented?
  • How do we represent the parts of speech that go
    with particular grammatical rules?

2
Language Modeling Integration of Natural Language
3
Morphology
  • How phonemes combine to make words
  • Important for speech recognition and synthesis
  • Example singular to plural
  • Run to runs z sound (voiced)
  • Hit to Hits s sound (unvoiced)
  • One approach Devise language specific sets of
    rules of pronunciation

4
Morphology
Definitions
  • Morphology
  • The study of the patterns used to form words
  • E.g. inflection, derivation, and compounds
  • Morpheme - Minimal meaning-bearing unit
  • Could be a stem or an affix
  • Stem unthinkable realization distrust
  • The part of a word that contains the root meaning
    (E.g. cat)
  • Affixes -s, un-, de-, -en, -able, -ize, -hood
  • a linguistic element added to a word modify the
    meaning
  • E.g. prefix (unbuckle), suffix (buckled), infix
    (absobloodylutely), and circumfix (gesagt in
    German for said).
  • Affixes can attach to other affixes (boyishness)

5
Knowing Words
  • When we know a word, we know its
  • Phonological sound sequences
  • Semantic meanings
  • Morphological relationships
  • Syntactic categories and proper structure of a
    sentence
  • Morphological relationships adjust word meanings
  • Person Jill waits.
  • Number Jill carried two buckets.
  • Case The chairs leg is broken.
  • Tense Jill is waiting there now.
  • Degree Jill ran faster than Jack.
  • Gender Jill is female
  • Part of Speech Jill is a proper noun

These are the kind of things we want our
computers to figure out
6
Units of Meaning
  • How many morphemes do each of the following
    sentences have?
  • I have two cats
  • She wants to leave soon
  • He walked across the room
  • Her behavior was unbelievable
  • Free Morphemes eye, think, run, apple
  • Bound Morphemes -able, un-, -s, -tion, -ly

7
Affix Examples
  • Prefixes from Karuk, a Hokan language of
    California
  • pasip Shoot!
  • nipasip I shoot
  • /upasip She/he shoots
  • Suffixes from Mende spoken in Liberia and Sierra
    Leone
  • pElE house
  • pElEi the house
  • mEmE glass
  • mEmEi the glass
  • Infixes from Bontoc spoken in the Phillipines
  • fikas strong
  • fumikas she is becoming strong
  • fusul enemy
  • fumusal she is becoming an enemy

8
Turkish Morpology
  • Uygarlastiramadiklarimizdanmissinizcasina
  • Meaning behaving as if you are among those whom
    we could not civilize
  • Uygar civilized las become tir cause
  • ama not able dik past
  • lar plural imiz p1pl dan abl
  • mis past siniz 2pl casina as if

9
How does the Mind Store Meanings?
  • Hypotheses
  • Full listing We store all words individually
  • Minimum redundancy We store morphemes and how
    they relate
  • Analysis
  • Determine if people understand new words based on
    root meanings
  • Observe whether children have difficulty learning
    exceptions
  • Regular form government/govern, Irregular form
    department/depart
  • Evidence suggests
  • The mind represents words and affix meanings
    separately
  • Linguists observe that affixes were originally
    separate words that speakers slur together over
    time

10
General Observations about Lexicons
  • Meanings are continually changing
  • Roots and Morphemes do not have to occur in a
    fixed position in relation to other elements.
  • How many words do people know?
  • Shakespeare uses 15,000 words
  • A typical high school student knows
    60,000(learning 10 words a day from 12 months to
    18 years)
  • How many English words are there?
  • Over 300,000 words without Morphemes in 1988

11
Computational Morphology
Speech recognition requires a language
dictionary How many words would it contain?
  • Consider all of the morphemes of the word true
  • true, truer, truest, truly, untrue, truth,
    truthful, truthfully, untruthfully,
    untruthfulness
  • Untruthfulness un- true -th -ful -ness
  • Productive morphemes
  • An affix that at a point in time spread rapidly
    through the language
  • Consider goose and geese versus cat and cats
  • The former was an older way to indicate plurals
  • The latter is a more recent way that spread
    throughout
  • If we store morpheme rules, not all words, we can
  • Reduce storage requirements and simplify creating
    entire dictionaries
  • More closely mimic how the mind does it
  • Be able to automatically understand newly
    encountered word forms

12
Morphology Rules
  • There are rules used to form complex words from
    their roots
  • re- only precedes verbs (rerun, release,
    return)
  • -s indicates plurals
  • -ed indicates past tense
  • Affix Rules
  • Regular follow productive affix rules
  • Irregular dont follow productive affix rules
  • Nouns
  • Regular (cat, thrush), (cats, thrushes), (cats
    thrushes)
  • Irregular (mouse, ox), (mice, oxen)

Observation More frequent words resist changes
that result fromproductive affixes and take
irregular forms (E.g. am, is, are). Exceptions
A singer sings, and a writer writes. Why doesnt
a whisker whisk, a spider spid, or a finger fing?
13
Parsing
Identify components and underlying structure
  • Morphological parsing
  • Identifies stem and affixes and how they relate
  • Example
  • fish ? fish Noun Singular or goose Verb
  • fish ? fish Noun Plural
  • fish ? fish Verb Singular
  • Bracketing indecipherable ? in de cipher
    able
  • Why do we parse?
  • spell-checking Is muncheble a real word?
  • Identify a words part-of-speech (pos)
  • Sentence parsing and machine translation
  • Identify word stems for data mining search
    operations
  • Speech recognition and text to speech

14
Parsing Applications
  • Lexicon
  • Create a word list
  • Include both stems and affixes (with the part of
    speech)
  • Morphotactics
  • Models how morphemes can be affixed to a stem.
  • E.g., plural morpheme follows noun in English
  • Orthographic rules
  • Defines spelling modifications during affixation
  • E.g. true ? tru in context of true ? truthfully

15
Grammatical Morphemes
  • New forms are rarely added to closed morpheme
    classes
  • Examples
  • prepositions at, for, by
  • articles a, the
  • conjunctions and, but, or

16
Morphological Parsing (stemming)
  • Goal Break the surface input into morphemes
  • foxes
  • Fox is a noun stem
  • It has -es as a plural suffix
  • rewrites
  • Write is the verb stem
  • It has re- as a prefix meaning to do again
  • It has a s suffix indicating a continuing
    activity

17
Inflectional Morphology
Does not change the grammatical category
  • Nouns
  • plural marker -s (dog s dogs)
  • possessive marker -s (dog s dogs)
  • Verbs
  • 3rd person present singular -s (walk s
    walks)
  • past tense -ed (walk ed walked)
  • progressive -ing (walk ing walking)
  • past participle -en or -ed (eat en eaten)
  • Adjectives
  • comparative -er (fast er faster)
  • superlative -est (fast est fastest)
  • In English
  • Meaning transformations are predictable
  • All inflectional affixes are suffices
  • Inflectional affixes are attached after any
    derivational (next slide) affixes
  • E.g. modern ize s modernizes not modern
    s ize

18
Concatenative and Non-concatenative
  • Concatenative morphology combines by
    concatentation
  • prefixes and suffixes
  • Non-concatentative morphology combines in complex
    ways
  • circumfixes and infixes
  • templatic morphology
  • words change by internal changes to the root
  • E.g. (Arabic, Hebrew) ktb (write), kuttib (will
    have been written)

Templative Example
19
Verbal Inflective Morphology
  • Verbal inflection
  • Main verbs (sleep, like, fear) are relatively
    regular
  • Standard morphemes -s, ing, ed
  • These morphemes are productive Emails, Emailing,
    Emailed
  • Combination with nouns for syntactical agreement
  • I am, we are, they were
  • There are exceptions
  • Eat (will eat, eats, eating, ate)
  • Catch (will catch, catches, catching, caught)
  • Be (will be, is, being, was)
  • Have (will have, has, having, had)
  • General Observations about English
  • There are approximately 250 Irregular verbs that
    occur
  • Other languages have more complex verbal
    inflection rules

20
Nominal Inflective Morphology
  • Plural forms (s or es)
  • Possessives (cats or cats)
  • Regular Nouns
  • Singular (cat, bush)
  • Plural (cats, bushes)
  • Possessive (cats bushes)
  • Irregular Nouns
  • Singular (mouse, ox)
  • Plural (mice, oxen)

21
Derivational Morphology
  • Word stem combines with grammatical morpheme
  • Usually produces word of different class
  • Complex rules that are less productive with many
    exceptions
  • Sometimes meanings of derived terms are hard to
    predict (E.g. hapless)
  • Examples verbs to nouns
  • generalize, realize ? generalization, realization
  • Murder, spell ? murderer, speller
  • Examples verbs and nouns to adjectives
  • embrace, pity? embraceable, pitiable
  • care, wit ? careless, witless
  • Example adjectives ? adverbs
  • happy ? happily
  • More complicated to model than inflection
  • Less productive science-less, concern-less,
    go-able, sleep-able

22
Derivational Morphology Examples
  • Level 2
  • Examples hood, ness, ly, s, ing, ish, ful, ly,
    less, y (adj.)
  • Observations
  • Never precede Level 1 suffixes
  • Never change stress or vowel quality
  • Almost always attach to words that exists
  • Level 1
  • Examples ize, ization, ity, ic, al, ity, ion, y,
    ate, ous, ive, ation
  • Observations
  • Can attach to non-words (e.g. fratern-al,
    paternal)
  • Often changes stems stress and vowel quality

Level 1 Level 1 histor-ic-al,
illumina-at-tion, indetermin-at-y Level 1
Level 2 fratern-al-ly, transform-ate-ion-less Le
vel 2 Level 2 weight-less-ness Big one
antidisestablishmenterrianism (if I spelled it
right)
23
Adjective Morphology
  • Standard Forms
  • Big, bigger, biggest
  • Cool, cooler, coolest, cooly
  • Red, redder, reddest
  • Clear, clearer, clearest, clearly, unclear,
    unclearly
  • Happy, happier, happiest, happily
  • Unhappy, unhappier, unhappiest, unhappily
  • Real, unreal, really
  • Exceptions unbig, redly, realest

24
Identify and Classify Morphemes
  • In each group
  • Two words have a different morphological
    structure
  • One word has a different type of suffix
  • One word has no suffix at all
  • Perform the following tasks
  • 1.Isolate the suffix that two of the words share.
  • 2.Identify whether it is (i) free or bound (ii)
    prefix, infix, suffix (iii) inflectional or
    derivational.
  • 3.Give its function/meaning.
  • 4.Identify the word that has no suffix
  • 5.Identify the word that has a suffix which is
    different from the others in each group.
  • a. b. c. d.
  • rider tresses running tables
  • colder melodies foundling lens
  • silver Besss handling witches
  • actor guess fling calculates

25
Computational Techniques
  • Regular Grammars
  • Finite State Automata
  • Finite State Transducer
  • Parsing Top down and bottom up

26
Regular Grammars
  • Grammar Rules that define legal characters
    strings
  • A regular grammar accepts regular expressions
  • A regular expression must satisfy the following
  • The grammar with no strings is regular
  • The grammar that accepts the empty string is
    regular
  • A single character is a regular grammar
  • If r1 and r2 are regular grammars, then r1 union
    r2, and r1 concatenated with r2 are regular
    grammars
  • If r is a regular grammar, then r ( where
    means zero or more occurrences) is regular

27
Notations to Express Regular Expressions
  • Conjunction abc
  • Disjunction a-zA-Z, gupp(yies)
  • Counters a, a, ?, a5, a5,8, a5,
  • Any character a.b
  • Not 0-9
  • Anchors /The dog\./
  • Note the backslash before the period is an
    escape character
  • Other escape characters include \, \?, \n, \t,
    \\, \, \, etc.
  • Operators
  • \d equivalent to 0-9, \D equivalent to 0-9
  • \w equivalent to a-zA-z0-9 , \W equivalent to
    \w
  • \s equivalent to \r\t\n\f, \S equivalent to
    s
  • Substitute one regular expression for another
    s/regExp1/regExp2/

28
Examples of Regular Expressions
  • All strings ending with two zeroes
  • All strings containing three consecutive zeroes
  • All strings that every block of five consecutive
    symbols have at least two zeroes
  • All strings that the tenth symbol from the right
    is a one
  • The set of all modular five numbers

29
Finite State Automata (FSA)
FSAs recognize grammars that are regular
  • Definition A FSA (?, q0, F, Q, d) consists of
  • a set of states (S)
  • a starting state (q0)
  • a set of final or accepting states (F ? Q)
  • a finite set of symbols (Q)
  • a transition function (?(q,i) ) that maps QxS to
    Q. It switches from a from-state to a to-state,
    based on one of the valid symbols

Synonyms Finite Automata, Finite State Machine
30
Finite-state Automata
  • Equivalent to
  • Finite-state automata (FSA)
  • Regular languages
  • Regular expressions

31
Recognition
Determine if the machine accepts a particular
string i.e. Is a string in the language?
  • Traditionally, Turing used a tape reader to
    depict a FSA
  • Algorithm
  • Begin in the start state
  • Examine the current input character
  • Consult the table
  • Go to a new state and update the tape pointer.
  • Until you run out of tape.
  • The machine accepts the string processing stops
    in a final state

32
Graphs and State Transition Tables
  • What can we can say about this machine?
  • It has 5 states
  • At least b,a, and ! are in its alphabet
  • q0 is the start state
  • q4 is an accept state
  • It has 5 transitions
  • Questions
  • Which strings does it accept? baaaa, aaabaaa, ba
  • Is this the only FSA that can accept this
    language?

State Transition Table
Annotated Directed Graph
An FSA only can accept regular strings.
Question Can you think of a string that is not
regular?
33
Finite-state Automata (Machines)
34
Input Tape
REJECT
35
Input Tape
ACCEPT
36
State-transition Tables
Input Input Input
State b a !
0 1 0 0
1 0 2 0
2 0 3 0
3 0 3 4
4 0 0 0
37
Recognizer Implementation
  • index beginning of tape
  • state start state
  • DO
  • IF transitionindex, tapeindex is empty
  • RETURN false
  • state transitionindex, tapeindex
  • index index 1
  • UNTIL end of tap is reached
  • IF state is a final state
  • RETURN true
  • ELSE RETURN false

38
Other FSA Examples
Dollars and Cents
Exercise Create a FSA for the following regular
expressions (01) a-f1-9 abc5
39
FSAs and Morphology
  • Apply an FSA to each word in the dictionary to
    capture the morphological forms.
  • Groups of words with common morphology can share
    FSAs

40
Building a Lexicon with a FSA
41
Derivational Rules
42
Simple Morphology Example
From To Output
0 1 un
0 1 NULL
1 2 adj-root-list
2 3 erestly
Stop states q2 and q3
43
An Extended Example
From To Output
0 1 un
0 3 NULL
1 2 adj-root-list-1
2 5 erestly
3 2 adj-root-list-1
3 4 adj-root-list-2
4 5 erest
Adj-root1 clear, happy, real Adj-root2 big,
red
44
Representing Derivational Rules
45
Key Points Regarding FSAs
  • This algorithm is a state-space search algorithm
  • Implementation uses simple table lookups
  • Success occurs when at the end of a string, we
    reach a final state
  • The results are always deterministic
  • There is one unique choice at each step
  • The algorithm recognizes all regular languages
  • Perl, Java, etc. use a regular expression
    algorithm
  • Create a state transition table from the
    expression
  • pass the table to the FSA interpreter
  • FSA algorithms
  • Recognizer determines if a string is in the
    language
  • Generator Generates all strings in the language

46
Non-Deterministic FSA
  • Deterministic Given a state and symbol, only one
    transition is possible
  • Nondeterministic (dSx??P?S)
  • Given a state and a symbol, multiple transitions
    are possible
  • Epsilon transitions those which DO NOT examine
    or advance the tape
  • The Nondeterministic FSA recognizes a string if
  • At least one transition sequence ends at a final
    state
  • Note all sequences DO NOT have to end at a final
    state
  • Note String rejection occurs only when NO
    sequence ends at a final state

Examples

e
47
FSA vs Non Deterministic FSA
  • Non deterministic
  • Deterministic

a b b a a b a b
0 1 1 0 1 1 0 1
48
Concatenation
49
Closure Closure
50
Union
51
Using NFSAs
Input Input Input Input
State b a ! e
0 1 0 0 0
1 0 2 0 0
2 0 2,3 0 0
3 0 0 4 0
4 0 0 0 0
52
NFSA Recognition of baaa!
53
Breadth-first Recognition of baaa!
54
Nondeterministic FSA Example
55
Non Deterministic FSA Recognizer
  • Recognizer (index, state)
  • LOOP
  • IF end of tape THEN
  • IF state is final RETURN true ELSE
    RETURN false
  • IF no possible transitions RETURN false
  • IF there is only one transition
  • state transitionindex, tapeindex
  • IF not an epsilon transition THEN index
  • ELSE
  • FOR each possible transition not considered
  • result CALL recognizer(nextState,nextInd
    ex)
  • IF result true RETURN true
  • END LOOP
  • RETURN false

56
Finite State Transducer (FST)
  • Definition An FST is a 7-tuple (Q, S, G, I, F,
    d, ?)
  • Q is a finite set of states
  • S is a finite set of symbols (the input alphabet)
  • G is a finite set of symbols (the output
    alphabet)
  • I is a subset of Q (the initial states)
  • F is a subset of Q (the final states)
  • ? is a function, ?(q,i), that maps QxS to Q
  • ? is a function, ? (q,i) that maps ?G?O
  • (where e is the empty string) is the transition
    relation.
  • Concept Translates and writes to a second tape

ao
57
Transition Example
  • cc means read a c on one tape and write a c on
    the other
  • Ne means read a N symbol on one tape and write
    nothing on the other
  • PLs means read PL and write an s

58
Finite State Transducer
A Finite State Automata that produces an output
string
Input Features from a sequence of
frames Processing Find the most likely path
through the sequence using hidden Markov models
or Neural Networks Output The most likely word,
phoneme, or syllable
O is a set of output states, ? S-gtO
59
On-line demos
  • Finite state automata demos
  • http//www.xrce.xerox.com/competencies/content-ana
    lysis/fsCompiler/fsinput.html
  • Finite state morphology
  • http//www.xrce.xerox.com/competencies/content-ana
    lysis/demos/english
  • Some other downloadable FSA tools
  • http//www.research.att.com/sw/tools/fsm/

60
PFSA (Probabilistic Finite State Automata)
  • A PFSA is a type of Probabilistic Context Free
    Grammar
  • The states are the non-terminals in a production
    rule
  • The output symbols are the observed outputs
  • The arcs represent a context-free rule
  • The path through the automata represent a parse
    tree
  • A PCFG considers state transitions and the
    transition path

S1
61
Probabilistic Finite State Machines
  • Probabilistic models determine weights of the
    transitions
  • The sum of weights leaving a state total to unity
  • Operations
  • Consider the weights to compute the probability
    of a given string or most likely path.
  • The machine can learn the weights over time

62
Another Example
63
Pronunciation decoding
n iy
64
Merging the machines together
n iy
65
Another Example
66
Beam Search
Graph Searching Approach
  • Beam search is a breadth first approach
  • It considers a set of most likely subset of edges
    (beam width). If the beam width is infinity, the
    search reduces to breadth first
  • Beam width can be fixed or it can vary depending
    on search parameters.
  • At each level of the graph, it generates a list
    of successors sorted according to an heuristic
    cost metric.
  • Because beam search returns the first solution
    found, there is no guarantee that the algorithm
    will find the best solution. It returns the first
    solution found.

67
Dynamic Programming-Based Search
68
Syllables
  • Organizational phonological unit
  • Vowel between two consonants
  • Ambiguous positioning of consonants into
    syllables
  • Tree structured representation
  • Basic unit of prosody
  • Lexical stress inherent property of a word
  • Sentential stress speaker choice to emphasize or
    clarrify

69
Representing Stress
  • There have been unsuccessful attempts to
    automatically assign stress to phonemes
  • Notations for representing stress
  • IPA (International Phonetic Alphabet) has a
    diacritic symbol for stress
  • Numeric representation
  • 0 reduced, 1 normal, 2 stressed
  • Relative
  • Reduced (R) or Stressed (S)
  • No notation means undistinguished
Write a Comment
User Comments (0)
About PowerShow.com