TP2663 Asas Pemprosesan Bahasa Tabii

1 / 79

About This Presentation

Title:

TP2663 Asas Pemprosesan Bahasa Tabii

Description:

–

Number of Views:59

Avg rating:3.0/5.0

Slides: 80

Provided by: ftsm3

Category:

more less

Transcript and Presenter's Notes

Title: TP2663 Asas Pemprosesan Bahasa Tabii

1
TP2663 Asas Pemprosesan Bahasa Tabii

Regular Expression (RE) Finite State Automata
(FSA)

2
Regular Expression (RE)
3
Regular Expressions

RE ialah bahasa utk menentukan rentetan carian
teks (text search strings)
RE telah digunakan utk carian teks dlm UNIX (vi,
Perl, Emacs, grep) Ms Word(version 6 above)
Byk fitur RE digunakan dlm pelbagai enjin carian
RE juga merupakan alatan teoretikal dlm bidang
sains komputer linguistik.

4
Regular Expressions

Satu RE ialah satu formula dlm bahasa tertentu yg
digunakan utk menentukan kelas string yang mudah.
String turutan symbols utk tujuan kebanyakan
teknik carian berasaskan teks
String turutan aksara alphanumeric (huruf,
nombor, spaces, tabs dan puntuation)
Dlm contoh seterusnya, carian akan memulangkan
satu baris ayat yg lengkap.

5
Asas Regular Expressions Patterns

Jenis RE yang paling mudah ialah turutan aksara
mudah.
Contoh, utk cari woodchuck, taip /woodchuck/
Carian mungkin akan memulangkan ayat Im called a
little woodchuck
Contoh lain

6
Asas Regular Expressions Patterns

RE adlh case-sensitive.
Cth /woodchucks/ tidak padan dgn Woodchucks
Gunakan dan to specify disjunction of
characters.
Cth lain

7
Asas Regular Expressions Patterns

Cara utk letakkan sempadan/range utk hasil carian
adalah dgn menggunakan dash(-)
Brackets and dash (-)used to specify any one
character in a range.
Cth lain

8
Asas Regular Expressions Patterns

Penggunaan dan boleh digunakan utk memastikan
satu aksara yang tidak boleh berada dlm hasil
carian.
Penggunaan mesti hadir selepas untuk tujuan
di atas.
If the is the first symbol after the open
bracket , the resulting pattern is negated.
Cth pattern /a/ matches any single character
except a.
Cth lain

9
Asas Regular Expressions Patterns

Penggunaan /?/ digunakan utk mencari aksara yang
sebelumnya samada wujud ataupun tidak.
/?/ Means the preceding character or nothing
Cth lain

10
Asas Regular Expressions Patterns

Penggunaan Kleene adalah utk mencari aksara
yang sebelumnya samada wujud ataupun tidak dan
mungkin wujud secara berjujukan.
Kleene star () means zero or more occurences
of the immediately previous character or regular
expression.
/a/ means any string of zero or more as
Cth lain

11
Asas Regular Expressions Patterns

Penggunaan Kleene adalah utk mencari aksara
yang sebelumnya yang mesti wujud
sekurang-kurangnya sekali.
Kleene plus() means one or more of previous
character.
Is a shorter way to specify at least one
Cth

12
Asas Regular Expressions Patterns

Penggunaan /./ adalah utk padankan dengan apa
juga aksara(kecuali carriage return) - wildcard
The wildcard is often used with the Kleene to
mean any string of characters
Contoh

13
Asas Regular Expressions Patterns

Penggunaan Anchors adalah utk menghubungkan RE
dgn lokasi tertentu dlm sesebuah rentetan
(string).
Cth anchor yang paling biasa digunakan dan
digunakan utk padanan permulaan baris
Cth /The/ padan utk The yg berada di
permulaan baris.
digunakan utk padanan akhir baris
Contoh

14
Asas Regular Expressions Patterns

Penggunaan Anchors yg lain adalah \b dan \B
\b digunakan utk padanan sempadan perkataan (a
word boundary)
\B digunakan utk padanan bukan-sempadan utk
perkataan (a non-boundary)
Contoh

15
Finite State Automata (FSA)
16
Three Views

Three equivalent formal ways to look at what
were up to.

Regular Expressions
Finite State Automata
Regular Languages
17
Finite State Automata

Terminologi Finite State Automata, Finite State
Machines, FSA, Finite Automata
Regular expressions ialah satu cara utk
menerangkan finite-state automata(FSA).
REs boleh dilaksanakan dgn FSA.
FSA blh diterangkan dgn RE
RE FSA boleh digunakan utk menerangkan Regular
Languages.

18
Intuition FSAs as Graphs

Lets start with the sheep language from the text
/baa!/

19
Sheep FSA

We can say the following things about this
machine
It has 5 states
At least b,a, and ! are in its alphabet
q0 is the start state
q4 is an accept state/final state
It has 5 transitions

20
But note

There are other machines that correspond to this
language
More on this one later

21
More Formally Defining an FSA

You can specify an FSA by enumerating the
following things.
The set of states Q
A finite alphabet S
A start state q0
A set F of accepting/final states F?Q
A transition function ?(q,i) that maps QxS to Q

22
Yet Another View

State-transition table

To read the first row If were in state 0 and
we see the input b, we must go to the state 1. If
were in state 0 and we see the input a or !, we
fail.
23
Latihan

Hasilkan state-transition table utk finite-state
automaton yg berikut

24
Recognition

Recognition is the process of determining if a
string should be accepted by a machine
Or its the process of determining if a string
is in the language were defining with the
machine
Or its the process of determining if a regular
expression matches a string

25
Recognition

Traditionally, (Turings idea) this process is
depicted with a tape.

26
Recognition

Start in the start state
Examine the current input
Consult the table
Go to a new state and update the tape pointer.
Until you run out of tape.

If we are in the accepting state when we run out
of input, the machine has successfully recognized
the language
27
D-Recognize a deterministic algorithm
Key points
28
Tracing D-Recognize
To read the first row If were in state 0 and
we see the input b, we must go to the state 1. If
were in state 0 and we see the input a or !, we
fail.
29
Key Points

Deterministic means that at each point in
processing there is always one unique thing to do
(no choices).
D-recognize is a simple table-driven interpreter
The algorithm is universal for all unambiguous
languages.
To change the machine, you change the table.

D-Recognize
30
Key Points

Crudely therefore matching strings with regular
expressions (ala Perl) is a matter of
translating the expression into a machine (table)
and
passing the table to an interpreter

31
Generative Formalisms

Formal Languages are sets of strings composed of
symbols from a finite set of symbols.
Finite-state automata define formal languages
(without having to enumerate all the strings in
the language)
The term Generative is based on the view that you
can run the machine as a generator to get strings
from the language.

32
Generative Formalisms

FSAs can be viewed from two perspectives
Acceptors that can tell you if a string is in the
language
Generators to produce all and only the strings in
the language

33
Dollars and Cents
34
Summary

Regular expressions are just a compact textual
representation of FSAs
Recognition is the process of determining if a
string/input is in the language defined by some
machine.
Recognition is straightforward with deterministic
machines.

35
Three Views

Three equivalent formal ways to look at what
were up to (thanks to Martin Kay)

Regular Expressions
Finite State Automata
Regular Languages
36
Non-Determinism
DFSA
NFSA
37
Non-Determinism cont.

Yet another technique
Epsilon transitions
Key point these transitions do not examine or
advance the tape during recognition
Read If we are in state 3, we are allowed to
move to state 2 w/o looking at the input, or
advancing our input pointer.

e
38
Using NFSA to accept strings

Since there is more than one choice at some time,
we might take the wrong choice.
3 standard solutions
Backup put a marker when we come to a choice
point. If we took the wrong choice, go back up
and try another path
Look-ahead look ahead in the input to help us
decide which path to take
Parallelism look every alternative path in
parallel when we come to a choice point.

39
Non-Deterministic Recognition

In a NFSA there exists at least one path through
the machine for a string that is in the language
defined by the machine.
But not all paths directed through the machine
for an accept string lead to an accept state.
No paths through the machine lead to an accept
state for a string not in the language.

40
Non-Deterministic Recognition

So success in a non-deterministic recognition
occurs when a path is found through the machine
that ends in an accept state.
Failure occurs when none of the possible paths
lead to an accept state.

41
Example
42
Example
43
Example
44
Example
45
Example
46
Example
47
Example
48
Example
49
Example
50
Key Points

States in the search space are pairings of tape
positions and states in the machine (search-state
vs machine-state).
By keeping track of as yet unexplored states, a
recognizer can systematically explore all the
paths through the machine given an input.

51
ND-Recognize Code
52
ND-Recognize Code cont.

Agenda - keep track of all the currently
unexplored choices generated during the course of
processing
Current-search-state the branch choice being
currently explored
ACCEPT-STATE? return accept if the current
search-state contains accepting machine-state and
a pointer to the end of the tape.
GENERATE-NEW-STATES creates search-states for
any e-transitions and any normal input-symbol
transitions from the transition table.

53
Recognition as Search

ND_RECOGNIZE accomplishes the task of recognizing
strings in a regular languages by providing a way
to systematically explore all the possible paths
through a machine.
You can view this algorithm as state-space
search.
States are pairings of tape positions and state
numbers given by the machine.
Operators are compiled into the table
Goal state is a pairing with the end of tape
position and a final accept state.

54
Infinite Search

If youre not careful such searches can go into
an infinite loop.
How?

55
Why Bother?

Non-determinism doesnt get us more formal power
and it causes headaches so why bother?
More natural solutions
Machines based on construction are too big

56
Equivalence

Non-deterministic machines can be converted to
deterministic ones with a fairly simple
construction
That means that they have the same power
non-deterministic machines are not more powerful
than deterministic ones
It also means that one way to do recognition with
a non-deterministic machine is to turn it into a
deterministic one.

57
Regular languages

The class of languages characterizable by regular
expressions
Given alphabet ?, the reg. lgs. over ? is
The empty set ? is a regular language
?a ? ? ? ?, a is a regular language
If L1 and L2 are regular lgs, then so are
L1 L2 xyx ? L1,y ? L2, concatenation of L1
L2
L1 ? L2, the union of L1 and L2
L1, the Kleene closure of L1

58
Going from regexp to FSA

Since all regular lgs meet above properties
And reg lgs are the lgs characterizable by
regular expressions
All regular expression operators can be
implemented by combinations of union,
disjunction, closure
Counters (,) are repetition plus closure
Anchors are individual symbols
and () and . are kinds of disjunction

59
Going from regexp to FSA

So if we could just show how to turn
closure/union/concat from regexps to FSAs, this
would give an idea of how FSA compilation works.
The actual proof that reg lgs FSAs has 2 parts
An FSA can be built for each regular lg
A regular lg can be built for each automaton
So Ill give the intuition of the first part
Take any regular expression and build an
automaton
Intuition induction
Base case build an automaton for single symbol
(say a)
Inductive step Show how to imitate the 3 regexp
operations in automata

60
Union

Accept a string in either of two languages

61
Concatenation

Accept a string consisting of a string from
language L1 followed by a string from language L2.

62
FSAs and Computational Morphology

An important use of FSAs is for morphology, the
study of word parts

63
English Morphology

Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes
We can usefully divide morphemes into two classes
Stems The core meaning bearing units
Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions

64
Nouns and Verbs (English)

Nouns are simple (not really)
Markers for plural and possessive
Verbs are only slightly more complex
Markers appropriate to the tense of the verb

65
Regulars and Irregulars

Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules)
Mouse/mice, goose/geese, ox/oxen
Go/went, fly/flew
The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.

66
Regular and Irregular Nouns and Verbs

Regulars
Walk, walks, walking, walked, walked
Table, tables
Irregulars
Eat, eats, eating, ate, eaten
Catch, catches, catching, caught, caught
Cut, cuts, cutting, cut, cut
Goose, geese

67
Compute

Many paths are possible
Start with compute
Computer -gt computerize -gt computerization
Computation -gt computational
Computer -gt computerize -gt computerizable
Compute -gt computee

68
Why care about morphology?

Stemming in information retrieval
Might want to search for aardvark and find
pages with both aardvark and aardvarks
Morphology in machine translation
Need to know that the Spanish words quiero and
quieres are both related to querer want
Morphology in spell checking
Need to know that miscellaneous and antidote are
not words despite being made up of word parts

69
Cant just list all words

Turkish
Uygarlastiramadiklarimizdanmissinizcasina
(behaving) as if you are among those whom we
could not civilize
Uygar civilized las become tir cause
ama not able dik past lar plural imiz
p1pl dan abl mis past siniz 2pl
casina as if

70
What we want

Something to automatically do the following kinds
of mappings
Cats cat N PL
Cat cat N SG
Cities city N PL
Merging merge V Present-participle
Caught catch V past-participle

71
FSAs and the Lexicon

This will actual require a kind of FSA we wont
be studying the Finite State Transducer (FST)
But well give a quick overview anyhow
First well capture the morphotactics
The rules governing the ordering of affixes in a
language.
Then well add in the actual words

72
Simple Rules
73
Adding the Words
74
Derivational Rules
75
Parsing/Generation vs. Recognition

Recognition is usually not quite what we need.
Usually if we find some string in the language we
need to find the structure in it (parsing)
Or we have some structure and we want to produce
a surface form (production/generation)
Example
From cats to cat N PL

76
Finite State Transducers