Title: Computational Linguistics Introduction
1 Computational Linguistics Introduction
- Finite State Machinery and Language Description
2Acknowledgement
- The material for this lecture is derived from a
series of talks given by - Dr. Ken Beesley (Xerox European Research Centre,
Grenoble) - in Malta, 2001.
3Todays Topics
- Finite State Technology
- Regular Languages and Relations
- Review of Set Theory
- Understand the mathematical operations that can
be performed on such Languages. - Understand how Languages, Relations, Regular
Expressions, and Networks are interrelated.
4What is Finite State Technology?
- Finite State Technology refers to a collection of
techniques for application of Finite State
Automata (FSA) to a range of linguistically
motivated problems. - Such Techniques include
- Design of user languages for specifying FSA
- Compilation of such languages into efficient
transition networks. - Development environments and runtime systems
5What is Finite-State Technology Good For?
- Finite-state techniques cannot handle central
embedding - the man the dog the cat bit followed ate.
- They are well suited to lower-level natural
language processing such as - Tokenization what is the next word?
- Spelling error detection does the next word
belong to a list? - Morphological/phonological analysis/generation
- Shallow syntactic parsing and chunking
6Tokenisation Problems
- VfB Stuttgart scored twice in quick success-ion
early in the second half on their way to a
deserved 2-1 victory over Manchester United in
the Champions League on Wednesday.(example from
Mary Dalrymple, University of London) - VfB Stuttgart, Manchester United
- succession
- 2-1
- Wednesday
- Finite state techniques provide a means to
specify the language of words, thus defining what
it means to be the next token. - There are three ways to specify such languages
7Languages,Notations and Machines
LANGUAGE (set of strings)
NOTATION
8Languages,Notations and Machines
FINITE STATE LANGUAGE
FINITE STATE NOTATION
9FINITE STATE AUTOMATApreliminary definition
- A finite state automaton includes
- A finite set of states
- A finite set of labelled transitions between
states
10Physical Machines with Finite States
UP
OFF
ON
DOWN
11Physical Machines with Finite States
- The Lightswitch Toggle Machine
PUSH
OFF
ON
PUSH
12The Five Cent Machine
- Problem
- Assume you have one, two, and five cent pieces
- Design a finite state automaton which accepts
exactly 5 cents.
13The Cola Machine
- Need to enter 25 cents (USA) to get a drink
- Accepts the following coins
- Nickel 5 cents
- Dime 10 cents
- Quarter 25 cents
- For simplicity, our machine needs exact change
- We will model only the coin-accepting mechanism
14Physical Machines with Finite States
Start State
Final State
N
N
N
N
N
5
10
15
20
25
0
D
D
D
D
Q
15The Cola Machine Language
- List of all the sequences of coins accepted
- Q, DDN, DND, NDD, DNNN, NDNN,
- NNDNNNND, NNNNN
- Think of the coins as SYMBOLS or CHARACTERS
- The set of symbols accepted is the ALPHABET of
the machine - Think of sequences of coins as WORDS or strings
- The set of words accepted by the machine is its
LANGUAGE
16FINITE STATE AUTOMATAbetter definition
- A finite state automaton includes
- A finite set of states
- Initial State
- Final State (s)
- A finite set of labelled transitions beween
states - Labels are symbols from an alphabet
- Recognises a language
- Generates a language as well!
17A Network that Accepts aOne Word Language
c
n
a
t
o
18A Network that Accepts aThree Word Language
a
n
t
o
c
t
g
i
r
e
m
a
e
s
19Scaling Up the Network
- Imagine the same network expanded to handle three
million words, all of them corresponding to valid
words of a given language. - We supply a word and apply it to the network.
If it is accepted by the network, then it is a
valid word. Otherwise it does not belong to the
language - This is the basis for a Spanish spelling error
detector.
20Looking Up a Word
a
n
t
o
c
t
g
i
r
e
m
a
e
s
Apply
m e s a
21Lookup Failure
- Lookup succeeds when all input is consumed and
final state is reached. Lookup can fail because - Not all input is consumed ("libro", "tigra")
- Input is fully consumed but state is not final
("cant") - Final state is reached but there is still
unconsumed output ("mesas")
22Shared Structure
c
l
e
a
r
v
e
e
23Transducers
Lookdown
mesaNounFemPl
m
e
s
a
Noun
Fem
Pl
m
e
s
a
0
0
s
Lookup
m e s a s
24A Morphological Analyzer
dog n pl
Transducer
dogs
25A Morphological Analyzer
Lexical Language
Transducer
Surface Language
26A Quick Review of Set Theory
- A set is a collection of objects.
B
A
E
D
We can enumerate the members or elements of
finite sets A, D, B, E . There is no
significant order in a set, so A, D, B, E is
the same set as E, A, D, B , etc.
27Uniqueness of Elements
- You cannot have two or more
- A elements in the same set
B
A
D
E
A, A, D, B, E is just a redundant
specification of the set A, D, B, E .
28Cardinality of Sets
- The Empty Set
- A Finite Set
- An Infinite Set e.g. The Set of all Positive
Integers
Norway Denmark Sweden
29Simple Operations on Sets Union
A
B
D
E
C
Set 1
Set 2
B C A D E
Union of Set1 and Set 2
30Simple Operations on Sets (2) Union
A
B
C
D
C
Set 1
Set 2
B C A D
Union of Set1 and Set 2
31Simple Operations on Sets (3) Intersection
A
B
C
D
C
Set 1
Set 2
C
Intersection of Set1 and Set 2
32Simple Operations on Sets (4) Subtraction
A
B
C
D
C
Set 1
Set 2
A B
Set 1 minus Set 2
33Formal Languages
Very Important Concept in Formal Language Theory
A Language is just a Set of Words.
- We use the terms word and string
interchangeably. - A Language can be empty, have finite
cardinality, or be infinite in size. - You can union, intersect and subtract languages,
just like any other sets.
34Union of Languages (Sets)
dog cat rat
elephant mouse
Language 1
Language 2
dog cat rat elephant mouse
Union of Language 1 and Language 2
35Intersection of Languages (Sets)
dog cat rat
elephant mouse
Language 1
Language 2
Intersection of Language 1 and Language 2
36Intersection of Languages (Sets)
dog cat rat
rat mouse
Language 1
Language 2
rat
Intersection of Language 1 and Language 2
37Subtraction of Languages (Sets)
dog cat rat
rat mouse
Language 1
Language 2
dog cat
Language 1 minus Language 2
38Languages
- A language is a set of words (strings).
- Words (strings) are composed of symbols (letters)
that are concatenated together. - At another level, words are composed of
morphemes. - In most natural languages, we concatenate
morphemes together to form whole words.
For sets consisting of words (i.e. for
Languages), the operation of concatenation is
very important.
39Concatenation of Languages
work talk walk
0 ing ed s
Root Language
Suffix Language
The concatenation of the Suffix language after
the Root language.
work working worked works talk talking talked
talks walk walking walked walks
40Languages and Networks
0
t
a
s
w a l k
s
s
o
i
r
n g
e
Network/Language 1
d
Network/Language 2
0
a
t
s
w a l k
The concatenation of Network 1 and Network 2
s
i
o
n g
r
e
d
41Why is Finite State Computing so Interesting?
- Finite-state systems are mathematically elegant,
easily manipulated and modifiable. - Computationally efficient. Usually very compact.
- The programming we linguists do is declarative.
We describe the facts of our natural language
i.e. we write grammars. We do not hack ad hoc
code. - The runtime code, which applies our systems to
linguistic input, is already written and it is
completely language-independent. - Finite-state systems are inherently
bidirectional we can use the same system to
analyze and to generate.
42Languages,Notations and Machines
FINITE STATE LANGUAGE
FINITE STATE NOTATION