TP2663 Asas Pemprosesan Bahasa Tabii - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

TP2663 Asas Pemprosesan Bahasa Tabii

Description:

– PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 80
Provided by: ftsm3
Category:

less

Transcript and Presenter's Notes

Title: TP2663 Asas Pemprosesan Bahasa Tabii


1
TP2663 Asas Pemprosesan Bahasa Tabii
  • Regular Expression (RE) Finite State Automata
    (FSA)

2
Regular Expression (RE)
3
Regular Expressions
  • RE ialah bahasa utk menentukan rentetan carian
    teks (text search strings)
  • RE telah digunakan utk carian teks dlm UNIX (vi,
    Perl, Emacs, grep) Ms Word(version 6 above)
  • Byk fitur RE digunakan dlm pelbagai enjin carian
  • RE juga merupakan alatan teoretikal dlm bidang
    sains komputer linguistik.

4
Regular Expressions
  • Satu RE ialah satu formula dlm bahasa tertentu yg
    digunakan utk menentukan kelas string yang mudah.
  • String turutan symbols utk tujuan kebanyakan
    teknik carian berasaskan teks
  • String turutan aksara alphanumeric (huruf,
    nombor, spaces, tabs dan puntuation)
  • Dlm contoh seterusnya, carian akan memulangkan
    satu baris ayat yg lengkap.

5
Asas Regular Expressions Patterns
  • Jenis RE yang paling mudah ialah turutan aksara
    mudah.
  • Contoh, utk cari woodchuck, taip /woodchuck/
  • Carian mungkin akan memulangkan ayat Im called a
    little woodchuck
  • Contoh lain

6
Asas Regular Expressions Patterns
  • RE adlh case-sensitive.
  • Cth /woodchucks/ tidak padan dgn Woodchucks
  • Gunakan dan to specify disjunction of
    characters.
  • Cth lain

7
Asas Regular Expressions Patterns
  • Cara utk letakkan sempadan/range utk hasil carian
    adalah dgn menggunakan dash(-)
  • Brackets and dash (-)used to specify any one
    character in a range.
  • Cth lain

8
Asas Regular Expressions Patterns
  • Penggunaan dan boleh digunakan utk memastikan
    satu aksara yang tidak boleh berada dlm hasil
    carian.
  • Penggunaan mesti hadir selepas untuk tujuan
    di atas.
  • If the is the first symbol after the open
    bracket , the resulting pattern is negated.
  • Cth pattern /a/ matches any single character
    except a.
  • Cth lain

9
Asas Regular Expressions Patterns
  • Penggunaan /?/ digunakan utk mencari aksara yang
    sebelumnya samada wujud ataupun tidak.
  • /?/ Means the preceding character or nothing
  • Cth lain

10
Asas Regular Expressions Patterns
  • Penggunaan Kleene adalah utk mencari aksara
    yang sebelumnya samada wujud ataupun tidak dan
    mungkin wujud secara berjujukan.
  • Kleene star () means zero or more occurences
    of the immediately previous character or regular
    expression.
  • /a/ means any string of zero or more as
  • Cth lain

11
Asas Regular Expressions Patterns
  • Penggunaan Kleene adalah utk mencari aksara
    yang sebelumnya yang mesti wujud
    sekurang-kurangnya sekali.
  • Kleene plus() means one or more of previous
    character.
  • Is a shorter way to specify at least one
  • Cth

12
Asas Regular Expressions Patterns
  • Penggunaan /./ adalah utk padankan dengan apa
    juga aksara(kecuali carriage return) - wildcard
  • The wildcard is often used with the Kleene to
    mean any string of characters
  • Contoh

13
Asas Regular Expressions Patterns
  • Penggunaan Anchors adalah utk menghubungkan RE
    dgn lokasi tertentu dlm sesebuah rentetan
    (string).
  • Cth anchor yang paling biasa digunakan dan
  • digunakan utk padanan permulaan baris
  • Cth /The/ padan utk The yg berada di
    permulaan baris.
  • digunakan utk padanan akhir baris
  • Contoh

14
Asas Regular Expressions Patterns
  • Penggunaan Anchors yg lain adalah \b dan \B
  • \b digunakan utk padanan sempadan perkataan (a
    word boundary)
  • \B digunakan utk padanan bukan-sempadan utk
    perkataan (a non-boundary)
  • Contoh

15
Finite State Automata (FSA)
16
Three Views
  • Three equivalent formal ways to look at what
    were up to.

Regular Expressions
Finite State Automata
Regular Languages
17
Finite State Automata
  • Terminologi Finite State Automata, Finite State
    Machines, FSA, Finite Automata
  • Regular expressions ialah satu cara utk
    menerangkan finite-state automata(FSA).
  • REs boleh dilaksanakan dgn FSA.
  • FSA blh diterangkan dgn RE
  • RE FSA boleh digunakan utk menerangkan Regular
    Languages.

18
Intuition FSAs as Graphs
  • Lets start with the sheep language from the text
  • /baa!/

19
Sheep FSA
  • We can say the following things about this
    machine
  • It has 5 states
  • At least b,a, and ! are in its alphabet
  • q0 is the start state
  • q4 is an accept state/final state
  • It has 5 transitions

20
But note
  • There are other machines that correspond to this
    language
  • More on this one later

21
More Formally Defining an FSA
  • You can specify an FSA by enumerating the
    following things.
  • The set of states Q
  • A finite alphabet S
  • A start state q0
  • A set F of accepting/final states F?Q
  • A transition function ?(q,i) that maps QxS to Q

22
Yet Another View
  • State-transition table

To read the first row If were in state 0 and
we see the input b, we must go to the state 1. If
were in state 0 and we see the input a or !, we
fail.
23
Latihan
  • Hasilkan state-transition table utk finite-state
    automaton yg berikut

24
Recognition
  • Recognition is the process of determining if a
    string should be accepted by a machine
  • Or its the process of determining if a string
    is in the language were defining with the
    machine
  • Or its the process of determining if a regular
    expression matches a string

25
Recognition
  • Traditionally, (Turings idea) this process is
    depicted with a tape.

26
Recognition
  • Start in the start state
  • Examine the current input
  • Consult the table
  • Go to a new state and update the tape pointer.
  • Until you run out of tape.

If we are in the accepting state when we run out
of input, the machine has successfully recognized
the language
27
D-Recognize a deterministic algorithm
Key points
28
Tracing D-Recognize
To read the first row If were in state 0 and
we see the input b, we must go to the state 1. If
were in state 0 and we see the input a or !, we
fail.
29
Key Points
  • Deterministic means that at each point in
    processing there is always one unique thing to do
    (no choices).
  • D-recognize is a simple table-driven interpreter
  • The algorithm is universal for all unambiguous
    languages.
  • To change the machine, you change the table.

D-Recognize
30
Key Points
  • Crudely therefore matching strings with regular
    expressions (ala Perl) is a matter of
  • translating the expression into a machine (table)
    and
  • passing the table to an interpreter

31
Generative Formalisms
  • Formal Languages are sets of strings composed of
    symbols from a finite set of symbols.
  • Finite-state automata define formal languages
    (without having to enumerate all the strings in
    the language)
  • The term Generative is based on the view that you
    can run the machine as a generator to get strings
    from the language.

32
Generative Formalisms
  • FSAs can be viewed from two perspectives
  • Acceptors that can tell you if a string is in the
    language
  • Generators to produce all and only the strings in
    the language

33
Dollars and Cents
34
Summary
  • Regular expressions are just a compact textual
    representation of FSAs
  • Recognition is the process of determining if a
    string/input is in the language defined by some
    machine.
  • Recognition is straightforward with deterministic
    machines.

35
Three Views
  • Three equivalent formal ways to look at what
    were up to (thanks to Martin Kay)

Regular Expressions
Finite State Automata
Regular Languages
36
Non-Determinism
DFSA
NFSA
37
Non-Determinism cont.
  • Yet another technique
  • Epsilon transitions
  • Key point these transitions do not examine or
    advance the tape during recognition
  • Read If we are in state 3, we are allowed to
    move to state 2 w/o looking at the input, or
    advancing our input pointer.


e
38
Using NFSA to accept strings
  • Since there is more than one choice at some time,
    we might take the wrong choice.
  • 3 standard solutions
  • Backup put a marker when we come to a choice
    point. If we took the wrong choice, go back up
    and try another path
  • Look-ahead look ahead in the input to help us
    decide which path to take
  • Parallelism look every alternative path in
    parallel when we come to a choice point.

39
Non-Deterministic Recognition
  • In a NFSA there exists at least one path through
    the machine for a string that is in the language
    defined by the machine.
  • But not all paths directed through the machine
    for an accept string lead to an accept state.
  • No paths through the machine lead to an accept
    state for a string not in the language.

40
Non-Deterministic Recognition
  • So success in a non-deterministic recognition
    occurs when a path is found through the machine
    that ends in an accept state.
  • Failure occurs when none of the possible paths
    lead to an accept state.

41
Example
42
Example
43
Example
44
Example
45
Example
46
Example
47
Example
48
Example
49
Example
50
Key Points
  • States in the search space are pairings of tape
    positions and states in the machine (search-state
    vs machine-state).
  • By keeping track of as yet unexplored states, a
    recognizer can systematically explore all the
    paths through the machine given an input.

51
ND-Recognize Code
52
ND-Recognize Code cont.
  • Agenda - keep track of all the currently
    unexplored choices generated during the course of
    processing
  • Current-search-state the branch choice being
    currently explored
  • ACCEPT-STATE? return accept if the current
    search-state contains accepting machine-state and
    a pointer to the end of the tape.
  • GENERATE-NEW-STATES creates search-states for
    any e-transitions and any normal input-symbol
    transitions from the transition table.

53
Recognition as Search
  • ND_RECOGNIZE accomplishes the task of recognizing
    strings in a regular languages by providing a way
    to systematically explore all the possible paths
    through a machine.
  • You can view this algorithm as state-space
    search.
  • States are pairings of tape positions and state
    numbers given by the machine.
  • Operators are compiled into the table
  • Goal state is a pairing with the end of tape
    position and a final accept state.

54
Infinite Search
  • If youre not careful such searches can go into
    an infinite loop.
  • How?

55
Why Bother?
  • Non-determinism doesnt get us more formal power
    and it causes headaches so why bother?
  • More natural solutions
  • Machines based on construction are too big

56
Equivalence
  • Non-deterministic machines can be converted to
    deterministic ones with a fairly simple
    construction
  • That means that they have the same power
    non-deterministic machines are not more powerful
    than deterministic ones
  • It also means that one way to do recognition with
    a non-deterministic machine is to turn it into a
    deterministic one.

57
Regular languages
  • The class of languages characterizable by regular
    expressions
  • Given alphabet ?, the reg. lgs. over ? is
  • The empty set ? is a regular language
  • ?a ? ? ? ?, a is a regular language
  • If L1 and L2 are regular lgs, then so are
  • L1 L2 xyx ? L1,y ? L2, concatenation of L1
    L2
  • L1 ? L2, the union of L1 and L2
  • L1, the Kleene closure of L1

58
Going from regexp to FSA
  • Since all regular lgs meet above properties
  • And reg lgs are the lgs characterizable by
    regular expressions
  • All regular expression operators can be
    implemented by combinations of union,
    disjunction, closure
  • Counters (,) are repetition plus closure
  • Anchors are individual symbols
  • and () and . are kinds of disjunction

59
Going from regexp to FSA
  • So if we could just show how to turn
    closure/union/concat from regexps to FSAs, this
    would give an idea of how FSA compilation works.
  • The actual proof that reg lgs FSAs has 2 parts
  • An FSA can be built for each regular lg
  • A regular lg can be built for each automaton
  • So Ill give the intuition of the first part
  • Take any regular expression and build an
    automaton
  • Intuition induction
  • Base case build an automaton for single symbol
    (say a)
  • Inductive step Show how to imitate the 3 regexp
    operations in automata

60
Union
  • Accept a string in either of two languages

61
Concatenation
  • Accept a string consisting of a string from
    language L1 followed by a string from language L2.

62
FSAs and Computational Morphology
  • An important use of FSAs is for morphology, the
    study of word parts

63
English Morphology
  • Morphology is the study of the ways that words
    are built up from smaller meaningful units called
    morphemes
  • We can usefully divide morphemes into two classes
  • Stems The core meaning bearing units
  • Affixes Bits and pieces that adhere to stems to
    change their meanings and grammatical functions

64
Nouns and Verbs (English)
  • Nouns are simple (not really)
  • Markers for plural and possessive
  • Verbs are only slightly more complex
  • Markers appropriate to the tense of the verb

65
Regulars and Irregulars
  • Ok so it gets a little complicated by the fact
    that some words misbehave (refuse to follow the
    rules)
  • Mouse/mice, goose/geese, ox/oxen
  • Go/went, fly/flew
  • The terms regular and irregular will be used to
    refer to words that follow the rules and those
    that dont.

66
Regular and Irregular Nouns and Verbs
  • Regulars
  • Walk, walks, walking, walked, walked
  • Table, tables
  • Irregulars
  • Eat, eats, eating, ate, eaten
  • Catch, catches, catching, caught, caught
  • Cut, cuts, cutting, cut, cut
  • Goose, geese

67
Compute
  • Many paths are possible
  • Start with compute
  • Computer -gt computerize -gt computerization
  • Computation -gt computational
  • Computer -gt computerize -gt computerizable
  • Compute -gt computee

68
Why care about morphology?
  • Stemming in information retrieval
  • Might want to search for aardvark and find
    pages with both aardvark and aardvarks
  • Morphology in machine translation
  • Need to know that the Spanish words quiero and
    quieres are both related to querer want
  • Morphology in spell checking
  • Need to know that miscellaneous and antidote are
    not words despite being made up of word parts

69
Cant just list all words
  • Turkish
  • Uygarlastiramadiklarimizdanmissinizcasina
  • (behaving) as if you are among those whom we
    could not civilize
  • Uygar civilized las become tir cause
    ama not able dik past lar plural imiz
    p1pl dan abl mis past siniz 2pl
    casina as if

70
What we want
  • Something to automatically do the following kinds
    of mappings
  • Cats cat N PL
  • Cat cat N SG
  • Cities city N PL
  • Merging merge V Present-participle
  • Caught catch V past-participle

71
FSAs and the Lexicon
  • This will actual require a kind of FSA we wont
    be studying the Finite State Transducer (FST)
  • But well give a quick overview anyhow
  • First well capture the morphotactics
  • The rules governing the ordering of affixes in a
    language.
  • Then well add in the actual words

72
Simple Rules
73
Adding the Words
74
Derivational Rules
75
Parsing/Generation vs. Recognition
  • Recognition is usually not quite what we need.
  • Usually if we find some string in the language we
    need to find the structure in it (parsing)
  • Or we have some structure and we want to produce
    a surface form (production/generation)
  • Example
  • From cats to cat N PL

76
Finite State Transducers
  • The simple story
  • Add another tape
  • Add extra symbols to the transitions
  • On one tape we read cats, on the other we write
    cat N PL

77
Transitions
  • cc means read a c on one tape and write a c on
    the other
  • Ne means read a N symbol on one tape and write
    nothing on the other
  • PLs means read PL and write an s

78
Lexical to Intermediate Level
79
End of Topic 4
Write a Comment
User Comments (0)
About PowerShow.com