Finite State Machinery - PowerPoint PPT Presentation

About This Presentation
Title:

Finite State Machinery

Description:

Many Domains of Application. Tokenization. Sentence breaking. Spelling correction ... The resulting lexicon contains the same six words ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 33
Provided by: MikeR2
Category:

less

Transcript and Presenter's Notes

Title: Finite State Machinery


1
Advanced Topics in NLP
  • Finite State Machinery
  • Xerox Tools

2
Finite State Methods
  • Many Domains of Application
  • Tokenization
  • Sentence breaking
  • Spelling correction
  • Morphology (analysis/generation)
  • Phonological disambiguation (Speech Recognition)
  • Morphological disambiguation (Tagging)
  • Pattern matching (Named Entity Recognition)
  • Shallow Parsing

3
The Xerox Approach
  • Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo
    Koskienniemi.
  • Meta-languages for describing regular languages
    and regular relations.
  • Compiler for mapping meta-language "programs"
    into efficient FS machinery
  • Several tools and applications

4
xerox tools
  • xfst Xerox Finite-State Tool
  • lexc Finite-State Lexicon Compiler
  • twolc Two-Level Rule Compiler

5
xerox tools
  • All of these applications are built around a
    central library, now written in C, called c-fsm.
  • The library defines the data structures, provides
    the input/output routines, and implements the
    fundamental operations on finite-state networks.
  • All based on long-term Xerox research, originated
    by Ronald M. Kaplan and Martin Kay at PARC in the
    early 1980s.

6
Textbook
CLSI Publications Studies in Computational
Linguistics series See also www.fsmbook.com
website
7
xfst
  • xfst is a general tool for creating and
    manipulating finite state networks, both simple
    automota and transducers.
  • xfst and other Xerox tools employ a special "xfst
    notation" (more powerful than that used in Unix,
    Perl, C etc.)

8
Simple Regular Expressions
  • Atomic Expressions
  • Complex Expressions

9
Atomic Expressions
  • The simplest kind of RE is a symbol. Typically, a
    symbol is the sort of item that can appear on the
    arc of a network.
  • For example, the symbol a is an RE that
    designates the language containing the string "a"
    and nothing else
  • Multicharacter symbols such as Plur are also
    symbols, but they happen to have multicharacter
    print names.

10
Special Atomic Expressions
  • The epsilon (e) symbol 0 denotes the empty string
    language "".
  • The ANY symbol ? denotes the language of all
    single symbol strings.
  • The empty string is not included in ?.

11
Complex REs Union
  • If A and B are arbitrary REs, A B is the
    union of A and B which denotes the union of the
    languages denoted by A and B respectively.
  • If A is an arbitrarily complex RE, A is
    equivalent to A.
  • Checkpoint Write down the strings in the
    language denoted by a b ab.

12
Complex REs Intersection
  • If A and B are arbitrary REs, A B is the
    intersection of A and B which denotes the
    intersection of the languages denoted by A and B
    respectively.
  • Checkpoint Write down the strings in the
    language denoted by
  • a b c d e d e f g

13
Complex REs Concatenation
  • If A and B are arbitrary REs A B is the
    concatenation of A and B
  • Checkpoint note the difference between
  • d o g
  • dog
  • d og

14
Concatenation over Reg. Expression and Language
  • Regular Expression
  • E1 ab
  • E2 cd
  • E1 E2
  • ab cd
  • Language
  • L1 "a", "b"
  • L2 "c", "d"
  • L1 L2
  • "ac", "ad", "bc", "bd"

15
Concatenation overFS Automata
a
c

b
d
a
c

b
d
16
Complex REs Closures
  • A denotes the concatenation of A with itself
    zero or more times.
  • A (Kleene Star) denotes A 0.

17
Other Operations
  • Minus A - B denotes the set difference of the
    languages denoted by A and B. (A-B A B)
  • Checkpoint What is the language denoted by
  • dog cat elephant -
  • elephant horse cow

18
Some Other Conventions
  • A Closure (Kleene Star)
  • (A) Optional Element
  • ? Any symbol
  • \b Any symbol other than b
  • A Complement ( ? - A )
  • 0 Empty string language
  • A ? A ?

19
Simple Commands
  • In addition to the language there are also
    commands
  • define give a name to an RE
  • print print information
  • read read information
  • various stack operations
  • file interaction
  • various command line options

20
define command
  • define name regexp
  • xfst0 define foo d o g c a t
  • xfst0 define R1 a b c d
  • xfst0 define R2 d e f g
  • xfst0 define R3 f g h i j

x0
21
print command
  • print words name - see the words in the language
    called name
  • print net name - see detailed information about
    the network name.
  • xfst0 print words foo
  • xfst0 print net baz
  • xfst0 define baz R1 R2

22
Exercise
  • Compute the words in
  • R1 minus R2.
  • R2 intersect R1
  • Define a network that contains the words "eeny",
    "meeny", "miny", "mo".
  • Determine how many states there are in each
    result.

23
Basic Stack Operations
  • read regex push network onto stack
  • print stack list items on stack
  • print net detailed info on top stack item
  • pop stack remove top item from stack
  • define name set name to value of top stack item

24
Stack Operations
  • Normally the stack is loaded with suitable
    arguments,
  • Command is issued requiring N arguments.
  • These are popped from the stack, the operation is
    performed, and the result written back onto the
    stack.
  • For correct results, items should be pushed onto
    the stack in reverse order.

25
Stack Demo 1
  • xfst0 clear stack
  • xfst0 read regex d c e b w
  • xfst1 read regex b s h w
  • xfst2 read regex s d c f w
  • xfst3 print stack
  • xfst3 intersect net
  • xfst1 print stack
  • xfst1 print net
  • xfst1 print words

26
Stack Exercise 2
  • xfst0 clear stack
  • xfst0 read regex e d i n g s
  • xfst1 read regex t a l k k i c k
  • xfst2 print stack
  • xfst2 print net
  • xfst2 print words
  • xfst2 concatenate net
  • xfst1 print words

27
lexc
Source File
Compiled Network
lexc
  • lexc is a high level programming language and
    compiler that is well suited for defining NL
    lexicons.
  • The output is a compiled form of FS network in a
    format identical to other Xerox tools (xfst,
    twolc).

28
lexc source file
  • !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  • ! ex0-lex.txt
  • LEXICON Root
  • dine
  • dines
  • dined
  • line
  • lines
  • lined
  • END
  • !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

29
lexc
  • ! ex1-lex.txt
  • LEXICON Root
  • Noun
  • Verb
  • LEXICON Noun
  • line NounSuffix
  • LEXICON Verb
  • dine VerbSuffix
  • line VerbSuffix
  • LEXICON NounSuffix s
  • LEXICON VerbSuffix
  • s
  • d

30
Running lexc
  • lexcgt compile-source ex1-lex.txt?
  • Opening 'ex1-lex.txt'...
  • Root...2, Noun...1, Verb...2, NounSuffix...2,
    VerbSuffix...3
  • Building lexicon...Minimizing...Done!
  • SOURCE 6 states, 7 arcs, 6 words
  • lexcgt

31
lexc
  • The resulting lexicon contains the same six words
  • The form lines actually gets constructed twice,
    once as a verb, once as a noun.
  • After minimization, only one of them remains.
  • The compiler first processes each sublexicon
    separately, keeping track of continuation
    pointers, and then joins the structures to a
    single network which is determinized and
    minimized.

32
Resulting FSA
s
d
i
e
n
l
d
Write a Comment
User Comments (0)
About PowerShow.com