Semitic Languages, Linguistics and Computers - PowerPoint PPT Presentation

About This Presentation

Title:

Semitic Languages, Linguistics and Computers

Description:

The Challenge of Fixed-length Reduplication in Tagalog (Antworth 1990:156-162) ... About 4930 roots in the underlying dictionary ... – PowerPoint PPT presentation

Number of Views:253

Avg rating:3.0/5.0

Slides: 33

Provided by: Nik35

Category:

more less

Transcript and Presenter's Notes

Title: Semitic Languages, Linguistics and Computers

1
Semitic Languages, Linguistics and Computers
Kenneth R. BEESLEY Xerox Research Centre Europe
(XRCE) ken.beesley_at_xrce.xerox.com University
of Malta March 2001
2
Ken Beesley Brief Introduction

B.A., Linguistics and Computer Science, Brigham
Young University, 1978
Diploma, Linguistics and Phonetics, Univ. of
Glasgow, 1979
D.Phil., Epistemics (Cognitive Science), Univ.
of Edinburgh, 1983
ALPNET, computer assisted translation, 1984-1990
1988-1990 Arabic morphology project, exposure to
Finite-State Morphology (Two-Level Morphology)
from Lauri Karttunen at COLING 1988
Microlytics (Xerox spinoff), 1990-1993
Xerox Corporation 1993-present
Computational Morphology projects Arabic,
Spanish, Portuguese, Italian, Dutch, (Malay),
(Aymara) also teaching finite-state programming
techniques
Some people are into finite-state programming for
the mathematics and algorithms Im in it because
it lets me build working systems for interesting
natural languages.

3
Overview of Todays Talk

Formal Morphology
Morphotacticsstudy and description of word
formation
Morphophonologystudy and description of
alternations
Challenges/Issues in Semitic morphology
Computational Morphology (Finite-State Morphology
paradigm)
General challenges/successes around the world
Semitic languagesalways seem to be a bit harder
Significant computational work already done on
Semitic languages
Hope to inspire more

4
Concatenative-Polysynthetic (Inuktitut)

Lexical natsiqviniqtuqlauqsimavitli
Surface natsi viniq tu lauq si ma vi l li
natsiq seal (open-class stem)
viniq meat (closed-class substem)
tuq eat (closed-class substem)
lauq before
si perfective
ma resulting state
vi question marker
t you
li but

(but) have you ever eaten seal meat before?
5
Inuktitut

Parismutnngaujumaniraqlauqsimanngitjunga
Pari mu nngau juma nira lauq si ma nngit tunga
Paris Paris
mut terminalis-case
nngau direction-to
juma want
niraq declare that
lauq past
si perfective
ma resulting state
nngit negative
junga 1P pres. indic

I never said that I wanted to go to Paris
6
Concatenative-Agglutinative (Aymara)

Lexical utamana-kapxaraki-iwa
Surface uta ma n ka p xa rak i wa
uta house (noun stem)
ma 2nd person possessive (your)
na in (case suffix)
-ka locative (also verbalizes)
p plural
xa perfect aspect
raki also
-i 3rd person present tense
wa affirmative sentencial
also they are in your house

7
Aymara

Morphophonemic Chuñu wi na -ka si -ka
-iri yat(a) wa
Surface Chuñü wi n ka s
k irï yät wa
chuñu N freeze-dried potatoes
NgtV be/make
wi VgtN place-of
na in (location)
-ka NgtV be-in (location)
si continuative
-ka imperfect
-iri VgtN one who
NgtV be
yata 1P recent past
wa affirmative sentencial
I was (one who was) always at the place for
making chuñu

8
Theory-Neutral Morphological Analysis
Analyses
Analysis undoes the morphotactic and
morphophonological processes, separating and
identifying the morphemes
Generation is ideally just the inverse of
analysis.
Black-Box Morphological Analyzer
Words
Chuñüwinkaskirïyätwa
9
The Claim/Goal of Xerox Finite-State Morphology

Both the morphotactics and the morphophonological
alternations can be described with regular
expressions, or equivalent shorthand notations,
which are compiled into finite-state transducers
(networks)

Combine via Composition at compile-time
Morphotactic Description (regular expression or
lexc)
FST
.o.
gt
FST
Compiler
Alternation Rules (regular expression)
FST
Lexical Transducer
10
A Properly Defined Lexical Transducer FST can
Perform Morphological Analysis and Generation

bidirectional
same network for both analysis and generation
efficient
process thousands of words/second
compact
less than 1MB in compressed form

vouloirIndPSGP3
Finite-state network
veut
canonical form
inflection codes
inflected form
11
Why is finite-state power interesting?

Formally constrained (not just a bunch of ad hoc
code)
Flexiblegrammars compile into finite-state
automata (networks) that can themselves be
combined and modified without needing to change
the original grammar
Networks provide efficient storage
Networks can be applied very efficientlymorphol
ogical analyzers typically run at thousands of
words per second on modern machines
Networks are bi-directional
The application code is language-independent

12
Some Aymara alternation rules

a -gt ä, i -gt ï, u -gt ü _
a i u -gt 0 _ -
c h i - -gt s _ t s
s t ä (-gt) t ä s k i _ t a

You can see and download the set of real Aymara
alternation rules at http//www.xrce.xerox.com/res
earch/mltt/aymara
13
Finite-State Morphology

Software ImplementationsDevelopment Environments
Two-Level Morphology (e.g. PC-KIMMO)
Xerox Finite-State Morphology (lexc, xfst, twolc,
)
ATT Library, Lextools
Univ. of Groningen, Fsa Utils 6
Morphological Applications
All the commercially interesting Indo-European
languages
Also Finnish, Hungarian, Turkish, Swahili,
Korean, Japanese
Significant research in Irish, Basque, Malay,
Aymara,

14
Criticism of Traditional Finite-State
Morphotactics

Two-Level and Finite-State Morphology in general
have been widely criticized for handling only
concatenative morphotactics.
Only restricted infixation and reduplication can
be handled adequately with the present system.
Some extensions or revisions will be necessary
for an adequate description of languages
possessing extensive infixation or
reduplication. (Koskenniemi, 1983, p. 27)
In particular, it is often charged that
finite-state morphology is not capable of
handling Semitic languages.

15
The Challenge of Fixed-length Reduplication in
Tagalog (Antworth 1990156-162)

pili choose gt pipili
tahi sew gt tatahi
kuha take gt kukuha
Antworth defines a morphophonemic lexical prefix
RE plus alternation rules that realize R as the
first following consonant and E as the first
following vowel.
Lexical REpili REtahi
REkuha
Surface p i pili t a tahi
k u kuha
Thus solution is adequate and even elegant for
such fixed-length reduplication.

16
Challenge Malay/Indonesian Full-Stem
Reduplication

Simple reduplication bukuredup Stem buku
(book)
buku-buku books
Prefixed reduplication bagimeNredup Stem
bagi (divide)
membagi-bagi divide into separate parts
pijitmeNredup Stem pijit (get a
massage)
memijit-mijit squeeze
Redup, prefix-suffix merahkeredupan Stem
merah (red)
kemerah-merahan reddish
Prefix-suffix, redup ubahredupperan Stem
ubah (difference)
perubahan-perubahan alternations/changes

17
The Xerox compile-replace algorithm

An algorithm that takes a finite-state network as
an argument and returns a modified (still
finite-state) network
Can be applied to the upper-side and/or the
lower-side of a network, perhaps multiple times.
compile-replace
finds delimited substrings of the form
string , where the string is just a string of
symbols, joined by concatenation, but which
happens to have the format of a regular
expression
compiles the string as a regular expression, and
then
replaces the delimited substring with the result
of the compilation.

18
The (Xerox) finite-state iteration operator

n n concatenations, for any integer n
A2 denotes two concatenations of the language A
with itself, equivalent to A A.
A bagi
A2 bagibagi
Finite-state languages and relations are closed
under n-ary concatenation.

19
Iteration in Morphotactics Malay

define pref 0 .x.
define root b a g i p e r a t u r a n
define suff Noun0 Pl .x.
2
Sg .x. 0
define Nouns (pref) root suff
The resulting intermediate FST will relate string
pairs like the following
(we filter out strings with unmatched delimiters
and )
Upper bagiNounSg
0 0bagiNounPl
Lower bagi0 0
bagi 0 2

20
compile-replace before and after

Upper bagiNounPl
peraturanNounPl
Lower bagi2
peraturan2
xfst compile-replace lower
Upper bagiNounPl
peraturanNounPl
Lower bagibagi
peraturanperaturan

Before
After
And it applies similarly to all delimited
regular-expression substrings on the lower side.
There must be a finite number of them. Note that
this operation is performed just once at
compile-time.
21
Another Challenge Arabic Stem Interdigitation

wasayaktubuwnahaA
wa and
sa future marker
ya imperfect prefix
ktb root k t b
CCVC Form I imperfect template CCVC ? ktub
(stem)
u Active-voice vocalization u
una they masc. Plural (imperfect suffix)
ha it/them (direct-object clitic pronoun
suffix)
English gloss and they will write it

Stem Interdigitation
22
Some Formal Analyses of Semitic Stems

Harris, 1944 b r k t
b
n_a_i_
_a_a_
nabir
katab
McCarthy, 1981 n
b r k t b
CCVCVC CVCVC
a i a
nabir katab

Root-Pattern
Root-Template-Vocalization
Another alternative is simply to ignore or deny
the concept of roots and treat stems as
monolithic morphemes.
23
Finite-State Computational Semitic

Kay, 1987 Arabic stem interdigitation via
multi-level transducers (Kiraz, 2000)
Lavie et al., 1988 Two-Level Morphology adapted
to Hebrew verbs
Kataja Koskenniemi, 1988 Ancient
Akkadian
Concatenating languages are just a special case
Morphotactics defined using regular
expressions/operations
Roots and patterns formalized as regular
languages
Roots are INTERSECTED with patterns, rather than
concatenated, to form stems
Sublexicon of Roots Sublexicon of
Patterns
? k ? t ? b ?
CaCaC
Pre-intersected by awk scripts
katab
Then compiled by TwoL

24
Beesley Arabic Stem Intersection at Runtime

ALPNET (88-90) k t b
wasayaCCuCunaha
Roots and patterns resided in separate
sublexicons
Root and pattern sublexicons were traversed in
parallel at runtime
Intersection was simulated in C code
(detouring) at runtime
ktb and CCuC were returned as separate morphemes
in the analyses
Still mostly a Two-Level System
Xerox (1996-98) Reimplementation using
Xerox Finite-State Morphology
On-line demo available http//www.xrce.xerox.co
m/research/mltt/arabic
Use any Java-enabled browser

Beesley Stem Intersection at Compile-time
25
Xerox Arabic Morphological Analyzer

About 4930 roots in the underlying dictionary
Each root is encoded to show which patterns it
can combine with
Roots and patterns are intersected to form over
90,000 stems
With various combinations of prefixes and
suffixes, the system encodes 72,000,000
fully-voweled words, with their morphological
analyses
In addition, it analyzes unvoweled and partially
voweled spellings
The compiled analyzer network is currently
storable in about 5 MB
The web demo is Unicode based and renders Arabic
script as you type
Roots, patterns and other affixes are separated
and returned

26
Intersecting Stems on One Side of a Transducer at
Compile Time

Start with a Two-Level Lexicon
Compose FS Intersecting Rules at Compile Time
Upper wasayaktb CCuCunaha
Lower wasayaktb CCuCunaha
.o.
Finite-State Stem-Intersection Rules
Result
Upper wasayaktb CCuCunaha
Lower wasaya ktub
unaha
Then apply the finite-state morphophonological
alternation/realization rules, handling weak
roots, hamza orthography in general,
assimilation, deletion,

27
Finite-State Merge fast special-case intersection

.mgt. is the merge to the right operator and
.ltm. is the merge to the left operator
ktb .mgt. CVVCVC .ltm. a gt kaatab
ktb .mgt. CVVCVC .ltm. ui gt kuutib

28
The compile-replace algorithm before and after

Upper ktb.mgt.CVVCVC.ltm.uia
Lower ktb.mgt.CVVCVC.ltm.uia
xfst list C k t b d r s m n
xfst list V a i u
xfst compile-replace lower
Upper ktb.mgt.CVVCVC.ltm.uia
Lower kuutib
a

Before
After
and similarly for about 90,000 stems
29
The compile-replace algorithm

A general compile-time technique that allows the
regular-expression compiler to apply to and
modify its own output.
Somewhat similar in operation to eval in LISP
and Perl.
Appears to handle some classic examples of
non-concatenative morphotactics full-stem
reduplication and Semitic stem interdigitation,
either
Two-way root-pattern theory, or
Three-way root-template-vocalization theory
Weve only begun to explore the possibilities.

30
What is Finite-State Computing Good For?

Mostly lower-level natural language processing
Tokenization
Spelling checking/correction
Phonology
Morphological Analysis/Generation
Part-of-Speech Tagging
Shallow Syntactic Parsing and Chunking

Finite-state techniques cannot do everything but
for tasks where they do apply, they are extremely
attractive.
31
What about Maltese?

Necessary preliminary work has already started
Corpora
Lexicography
Formal linguistic description
Finite-state implementation
Xerox finite-state calculus already licensed at
Univ. of Malta
The compile-replace algorithm will soon be
released
The Book (Beesley and Karttunen, forthcoming)
Unique opportunity
Semitic component
Routinely written, in a culture with high literacy

32
Final Observations

Successful computational linguistic projects are
often the result of cooperation between a
computational linguist and a more traditional
descriptive linguist
Computational linguistics can be commercially
rewardng
Computational linguistics is a healthy discipline
from the descriptive point of view
Your grammars can literally be tested on millions
of words
Any mistakes or gaps in your grammars soon become
apparent