CS60057%20Speech%20 - PowerPoint PPT Presentation

About This Presentation
Title:

CS60057%20Speech%20

Description:

FSAs are equivalent to regular languages ... FSTs are like FSAs but with complex labels. ... As we saw earlier there are geese, mice and oxen ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 44
Provided by: IBMU306
Category:
Tags: 20speech | cs60057 | oxen

less

Transcript and Presenter's Notes

Title: CS60057%20Speech%20


1
CS60057Speech Natural Language Processing
  • Autumn 2007

Lecture4 1 August 2007
2
MORPHOLOGY
3
Finite State Machines
  • FSAs are equivalent to regular languages
  • FSTs are equivalent to regular relations (over
    pairs of regular languages)
  • FSTs are like FSAs but with complex labels.
  • We can use FSTs to transduce between surface and
    lexical levels.

4
Simple Rules
5
Adding in the Words
6
Derivational Rules
7
Parsing/Generation vs. Recognition
  • Recognition is usually not quite what we need.
  • Usually if we find some string in the language we
    need to find the structure in it (parsing)
  • Or we have some structure and we want to produce
    a surface form (production/generation)
  • Example
  • From cats to cat N PL and back

8
Morphological Parsing
  • Given the input cats, wed like to outputcat N
    Pl, telling us that cat is a plural noun.
  • Given the Spanish input bebo, wed like to
    outputbeber V PInd 1P Sg telling us that
    bebo is the present indicative first person
    singular form of the Spanish verb beber, to
    drink.

9
Morphological Anlayser
  • To build a morphological analyser we need
  • lexicon the list of stems and affixes, together
    with basic information about them
  • morphotactics the model of morpheme ordering (eg
    English plural morpheme follows the noun rather
    than a verb)
  • orthographic rules these spelling rules are used
    to model the changes that occur in a word,
    usually when two morphemes combine (e.g., flys
    flies)

10
Lexicon Morphotactics
  • Typically list of word parts (lexicon) and the
    models of ordering can be combined together into
    an FSA which will recognise the all the valid
    word forms.
  • For this to be possible the word parts must first
    be classified into sublexicons.
  • The FSA defines the morphotactics (ordering
    constraints).

11
Sublexicons to classify the list of word parts
reg-noun irreg-pl-noun irreg-sg-noun plural
cat mice mouse -s
fox sheep sheep
geese goose
12
FSA Expresses Morphotactics (ordering model)
13
Towards the Analyser
  • We can use lexc or xfst to build such an FSA
  • To augment this to produce an analysis we must
    create a transducer Tnum which maps between the
    lexical level and an "intermediate" level that is
    needed to handle the spelling rules of English.

14
Three Levels of Analysis
15
1. Tnum Noun Number Inflection
  • multi-character symbols
  • morpheme boundary
  • word boundary

16
Intermediate Form to Surface
  • The reason we need to have an intermediate form
    is that funny things happen at morpheme
    boundaries, e.g.
  • cats ? cats
  • foxs ? foxes
  • flys ? flies
  • The rules which describe these changes are called
    orthographic rules or "spelling rules".

17
More English Spelling Rules
  • consonant doubling beg / begging
  • y replacement try/tries
  • k insertion panic/panicked
  • e deletion make/making
  • e insertion watch/watches
  • Each rule can be stated in more detail ...

18
Spelling Rules
  • Chomsky Halle (1968) invented a special
    notation for spelling rules.
  • A very similar notation is embodied in the
    "conditional replacement" rules of xfst.
  • E -gt F L _ R
  • which means replace E with F when it appears
    between left context L and right context R

19
A Particular Spelling Rule
  • This rule does e-insertion
  • -gt e x _ s

20
e insertion over 3 levels
The rule corresponds to the mapping
between surface and intermediate levels
21
e insertion as an FST
22
Incorporating Spelling Rules
  • Spelling rules, each corresponding to an FST, can
    be run in parallel provided that they are
    "aligned".
  • The set of spelling rules is positioned between
    the surface level and the intermediate level.
  • Parallel execution of FSTs can be carried out
  • by simulation in this case FSTs must first be
    aligned.
  • by first constructing a a single FST
    corresponding to their intersection.

23
Putting it all together
execution of FSTi takes place in parallel
24
Kaplan and Kay The Xerox View
FSTi are aligned but separate
FSTi intersected together
25
Finite State Transducers
  • The simple story
  • Add another tape
  • Add extra symbols to the transitions
  • On one tape we read cats, on the other we write
    cat N PL, or the other way around.

26
FSTs
27
English Plural
surface lexical
cat catNSg
cats catNPl
foxes foxNPl
mice mouseNPl
sheep sheepNPl sheepNSg
28
Transitions
  • cc means read a c on one tape and write a c on
    the other
  • Ne means read a N symbol on one tape and write
    nothing on the other
  • PLs means read PL and write an s

29
Typical Uses
  • Typically, well read from one tape using the
    first symbol on the machine transitions (just as
    in a simple FSA).
  • And well write to the second tape using the
    other symbols on the transitions.

30
Ambiguity
  • Recall that in non-deterministic recognition
    multiple paths through a machine may lead to an
    accept state.
  • Didnt matter which path was actually traversed
  • In FSTs the path to an accept state does matter
    since differ paths represent different parses and
    different outputs will result

31
Ambiguity
  • Whats the right parse for
  • Unionizable
  • Union-ize-able
  • Un-ion-ize-able
  • Each represents a valid path through the
    derivational morphology machine.

32
Ambiguity
  • There are a number of ways to deal with this
    problem
  • Simply take the first output found
  • Find all the possible outputs (all paths) and
    return them all (without choosing)
  • Bias the search so that only one or a few likely
    paths are explored

33
The Gory Details
  • Of course, its not as easy as
  • cat N PL lt-gt cats
  • As we saw earlier there are geese, mice and oxen
  • But there are also a whole host of
    spelling/pronunciation changes that go along with
    inflectional changes
  • Cats vs Dogs
  • Fox and Foxes

34
Multi-Tape Machines
  • To deal with this we can simply add more tapes
    and use the output of one tape machine as the
    input to the next
  • So to handle irregular spelling changes well add
    intermediate tapes with intermediate symbols

35
Generativity
  • Nothing really privileged about the directions.
  • We can write from one and read from the other or
    vice-versa.
  • One way is generation, the other way is analysis

36
Multi-Level Tape Machines
  • We use one machine to transduce between the
    lexical and the intermediate level, and another
    to handle the spelling changes to the surface
    tape

37
Lexical to Intermediate Level
38
Intermediate to Surface
  • The add an e rule as in foxs lt-gt foxes

39
Foxes
40
Note
  • A key feature of this machine is that it doesnt
    do anything to inputs to which it doesnt apply.
  • Meaning that they are written out unchanged to
    the output tape.
  • Turns out the multiple tapes arent really
    needed they can be compiled away.

41
Overall Scheme
  • We now have one FST that has explicit information
    about the lexicon (actual words, their spelling,
    facts about word classes and regularity).
  • Lexical level to intermediate forms
  • We have a larger set of machines that capture
    orthographic/spelling rules.
  • Intermediate forms to surface forms

42
Overall Scheme
43
  • http//nltk.sourceforge.net/index.php/Documentatio
    n
Write a Comment
User Comments (0)
About PowerShow.com