Title: Semitic Languages, Linguistics and Computers
1Semitic Languages, Linguistics and Computers
Kenneth R. BEESLEY Xerox Research Centre Europe
(XRCE) ken.beesley_at_xrce.xerox.com University
of Malta March 2001
2Ken Beesley Brief Introduction
- B.A., Linguistics and Computer Science, Brigham
Young University, 1978 - Diploma, Linguistics and Phonetics, Univ. of
Glasgow, 1979 - D.Phil., Epistemics (Cognitive Science), Univ.
of Edinburgh, 1983 - ALPNET, computer assisted translation, 1984-1990
- 1988-1990 Arabic morphology project, exposure to
Finite-State Morphology (Two-Level Morphology)
from Lauri Karttunen at COLING 1988 - Microlytics (Xerox spinoff), 1990-1993
- Xerox Corporation 1993-present
- Computational Morphology projects Arabic,
Spanish, Portuguese, Italian, Dutch, (Malay),
(Aymara) also teaching finite-state programming
techniques - Some people are into finite-state programming for
the mathematics and algorithms Im in it because
it lets me build working systems for interesting
natural languages.
3Overview of Todays Talk
- Formal Morphology
- Morphotacticsstudy and description of word
formation - Morphophonologystudy and description of
alternations - Challenges/Issues in Semitic morphology
- Computational Morphology (Finite-State Morphology
paradigm) - General challenges/successes around the world
- Semitic languagesalways seem to be a bit harder
- Significant computational work already done on
Semitic languages - Hope to inspire more
4Concatenative-Polysynthetic (Inuktitut)
- Lexical natsiqviniqtuqlauqsimavitli
- Surface natsi viniq tu lauq si ma vi l li
- natsiq seal (open-class stem)
- viniq meat (closed-class substem)
- tuq eat (closed-class substem)
- lauq before
- si perfective
- ma resulting state
- vi question marker
- t you
- li but
(but) have you ever eaten seal meat before?
5Inuktitut
- Parismutnngaujumaniraqlauqsimanngitjunga
- Pari mu nngau juma nira lauq si ma nngit tunga
- Paris Paris
- mut terminalis-case
- nngau direction-to
- juma want
- niraq declare that
- lauq past
- si perfective
- ma resulting state
- nngit negative
- junga 1P pres. indic
I never said that I wanted to go to Paris
6Concatenative-Agglutinative (Aymara)
- Lexical utamana-kapxaraki-iwa
- Surface uta ma n ka p xa rak i wa
- uta house (noun stem)
- ma 2nd person possessive (your)
- na in (case suffix)
- -ka locative (also verbalizes)
- p plural
- xa perfect aspect
- raki also
- -i 3rd person present tense
- wa affirmative sentencial
- also they are in your house
7Aymara
- Morphophonemic Chuñu wi na -ka si -ka
-iri yat(a) wa - Surface Chuñü wi n ka s
k irï yät wa - chuñu N freeze-dried potatoes
- NgtV be/make
- wi VgtN place-of
- na in (location)
- -ka NgtV be-in (location)
- si continuative
- -ka imperfect
- -iri VgtN one who
- NgtV be
- yata 1P recent past
- wa affirmative sentencial
- I was (one who was) always at the place for
making chuñu
8Theory-Neutral Morphological Analysis
Analyses
Analysis undoes the morphotactic and
morphophonological processes, separating and
identifying the morphemes
Generation is ideally just the inverse of
analysis.
Black-Box Morphological Analyzer
Words
Chuñüwinkaskirïyätwa
9The Claim/Goal of Xerox Finite-State Morphology
- Both the morphotactics and the morphophonological
alternations can be described with regular
expressions, or equivalent shorthand notations,
which are compiled into finite-state transducers
(networks)
Combine via Composition at compile-time
Morphotactic Description (regular expression or
lexc)
FST
.o.
gt
FST
Compiler
Alternation Rules (regular expression)
FST
Lexical Transducer
10A Properly Defined Lexical Transducer FST can
Perform Morphological Analysis and Generation
- bidirectional
- same network for both analysis and generation
- efficient
- process thousands of words/second
- compact
- less than 1MB in compressed form
vouloirIndPSGP3
Finite-state network
veut
canonical form
inflection codes
inflected form
11Why is finite-state power interesting?
- Formally constrained (not just a bunch of ad hoc
code) - Flexiblegrammars compile into finite-state
automata (networks) that can themselves be
combined and modified without needing to change
the original grammar - Networks provide efficient storage
- Networks can be applied very efficientlymorphol
ogical analyzers typically run at thousands of
words per second on modern machines - Networks are bi-directional
- The application code is language-independent
12Some Aymara alternation rules
- a -gt ä, i -gt ï, u -gt ü _
- a i u -gt 0 _ -
- c h i - -gt s _ t s
- s t ä (-gt) t ä s k i _ t a
You can see and download the set of real Aymara
alternation rules at http//www.xrce.xerox.com/res
earch/mltt/aymara
13Finite-State Morphology
- Software ImplementationsDevelopment Environments
- Two-Level Morphology (e.g. PC-KIMMO)
- Xerox Finite-State Morphology (lexc, xfst, twolc,
) - ATT Library, Lextools
- Univ. of Groningen, Fsa Utils 6
- Morphological Applications
- All the commercially interesting Indo-European
languages - Also Finnish, Hungarian, Turkish, Swahili,
Korean, Japanese - Significant research in Irish, Basque, Malay,
Aymara,
14Criticism of Traditional Finite-State
Morphotactics
- Two-Level and Finite-State Morphology in general
have been widely criticized for handling only
concatenative morphotactics. - Only restricted infixation and reduplication can
be handled adequately with the present system.
Some extensions or revisions will be necessary
for an adequate description of languages
possessing extensive infixation or
reduplication. (Koskenniemi, 1983, p. 27) - In particular, it is often charged that
finite-state morphology is not capable of
handling Semitic languages.
15The Challenge of Fixed-length Reduplication in
Tagalog (Antworth 1990156-162)
- pili choose gt pipili
- tahi sew gt tatahi
- kuha take gt kukuha
- Antworth defines a morphophonemic lexical prefix
RE plus alternation rules that realize R as the
first following consonant and E as the first
following vowel. - Lexical REpili REtahi
REkuha - Surface p i pili t a tahi
k u kuha - Thus solution is adequate and even elegant for
such fixed-length reduplication.
16Challenge Malay/Indonesian Full-Stem
Reduplication
- Simple reduplication bukuredup Stem buku
(book) - buku-buku books
- Prefixed reduplication bagimeNredup Stem
bagi (divide) - membagi-bagi divide into separate parts
- pijitmeNredup Stem pijit (get a
massage) - memijit-mijit squeeze
- Redup, prefix-suffix merahkeredupan Stem
merah (red) - kemerah-merahan reddish
- Prefix-suffix, redup ubahredupperan Stem
ubah (difference) - perubahan-perubahan alternations/changes
17The Xerox compile-replace algorithm
- An algorithm that takes a finite-state network as
an argument and returns a modified (still
finite-state) network - Can be applied to the upper-side and/or the
lower-side of a network, perhaps multiple times. - compile-replace
- finds delimited substrings of the form
string , where the string is just a string of
symbols, joined by concatenation, but which
happens to have the format of a regular
expression - compiles the string as a regular expression, and
then - replaces the delimited substring with the result
of the compilation.
18The (Xerox) finite-state iteration operator
- n n concatenations, for any integer n
- A2 denotes two concatenations of the language A
with itself, equivalent to A A. - A bagi
- A2 bagibagi
- Finite-state languages and relations are closed
under n-ary concatenation.
19Iteration in Morphotactics Malay
- define pref 0 .x.
- define root b a g i p e r a t u r a n
- define suff Noun0 Pl .x.
2 -
Sg .x. 0 - define Nouns (pref) root suff
- The resulting intermediate FST will relate string
pairs like the following - (we filter out strings with unmatched delimiters
and ) - Upper bagiNounSg
0 0bagiNounPl - Lower bagi0 0
bagi 0 2
20compile-replace before and after
- Upper bagiNounPl
peraturanNounPl - Lower bagi2
peraturan2 - xfst compile-replace lower
- Upper bagiNounPl
peraturanNounPl - Lower bagibagi
peraturanperaturan
Before
After
And it applies similarly to all delimited
regular-expression substrings on the lower side.
There must be a finite number of them. Note that
this operation is performed just once at
compile-time.
21Another Challenge Arabic Stem Interdigitation
- wasayaktubuwnahaA
- wa and
- sa future marker
- ya imperfect prefix
- ktb root k t b
- CCVC Form I imperfect template CCVC ? ktub
(stem) - u Active-voice vocalization u
- una they masc. Plural (imperfect suffix)
- ha it/them (direct-object clitic pronoun
suffix) - English gloss and they will write it
Stem Interdigitation
22Some Formal Analyses of Semitic Stems
- Harris, 1944 b r k t
b - n_a_i_
_a_a_ - nabir
katab - McCarthy, 1981 n
-
b r k t b - CCVCVC CVCVC
-
a i a - nabir katab
Root-Pattern
Root-Template-Vocalization
Another alternative is simply to ignore or deny
the concept of roots and treat stems as
monolithic morphemes.
23Finite-State Computational Semitic
- Kay, 1987 Arabic stem interdigitation via
multi-level transducers (Kiraz, 2000) - Lavie et al., 1988 Two-Level Morphology adapted
to Hebrew verbs - Kataja Koskenniemi, 1988 Ancient
Akkadian - Concatenating languages are just a special case
- Morphotactics defined using regular
expressions/operations - Roots and patterns formalized as regular
languages - Roots are INTERSECTED with patterns, rather than
concatenated, to form stems - Sublexicon of Roots Sublexicon of
Patterns - ? k ? t ? b ?
CaCaC - Pre-intersected by awk scripts
- katab
- Then compiled by TwoL
24Beesley Arabic Stem Intersection at Runtime
- ALPNET (88-90) k t b
-
wasayaCCuCunaha - Roots and patterns resided in separate
sublexicons - Root and pattern sublexicons were traversed in
parallel at runtime - Intersection was simulated in C code
(detouring) at runtime - ktb and CCuC were returned as separate morphemes
in the analyses - Still mostly a Two-Level System
- Xerox (1996-98) Reimplementation using
Xerox Finite-State Morphology - On-line demo available http//www.xrce.xerox.co
m/research/mltt/arabic - Use any Java-enabled browser
-
Beesley Stem Intersection at Compile-time
25Xerox Arabic Morphological Analyzer
- About 4930 roots in the underlying dictionary
- Each root is encoded to show which patterns it
can combine with - Roots and patterns are intersected to form over
90,000 stems - With various combinations of prefixes and
suffixes, the system encodes 72,000,000
fully-voweled words, with their morphological
analyses - In addition, it analyzes unvoweled and partially
voweled spellings - The compiled analyzer network is currently
storable in about 5 MB - The web demo is Unicode based and renders Arabic
script as you type - Roots, patterns and other affixes are separated
and returned
26Intersecting Stems on One Side of a Transducer at
Compile Time
- Start with a Two-Level Lexicon
- Compose FS Intersecting Rules at Compile Time
- Upper wasayaktb CCuCunaha
- Lower wasayaktb CCuCunaha
- .o.
- Finite-State Stem-Intersection Rules
- Result
- Upper wasayaktb CCuCunaha
- Lower wasaya ktub
unaha - Then apply the finite-state morphophonological
alternation/realization rules, handling weak
roots, hamza orthography in general,
assimilation, deletion,
27Finite-State Merge fast special-case intersection
- .mgt. is the merge to the right operator and
- .ltm. is the merge to the left operator
- ktb .mgt. CVVCVC .ltm. a gt kaatab
- ktb .mgt. CVVCVC .ltm. ui gt kuutib
28The compile-replace algorithm before and after
- Upper ktb.mgt.CVVCVC.ltm.uia
- Lower ktb.mgt.CVVCVC.ltm.uia
- xfst list C k t b d r s m n
- xfst list V a i u
- xfst compile-replace lower
- Upper ktb.mgt.CVVCVC.ltm.uia
- Lower kuutib
a
Before
After
and similarly for about 90,000 stems
29The compile-replace algorithm
- A general compile-time technique that allows the
regular-expression compiler to apply to and
modify its own output. - Somewhat similar in operation to eval in LISP
and Perl. - Appears to handle some classic examples of
non-concatenative morphotactics full-stem
reduplication and Semitic stem interdigitation,
either - Two-way root-pattern theory, or
- Three-way root-template-vocalization theory
- Weve only begun to explore the possibilities.
30What is Finite-State Computing Good For?
- Mostly lower-level natural language processing
- Tokenization
- Spelling checking/correction
- Phonology
- Morphological Analysis/Generation
- Part-of-Speech Tagging
- Shallow Syntactic Parsing and Chunking
Finite-state techniques cannot do everything but
for tasks where they do apply, they are extremely
attractive.
31What about Maltese?
- Necessary preliminary work has already started
- Corpora
- Lexicography
- Formal linguistic description
- Finite-state implementation
- Xerox finite-state calculus already licensed at
Univ. of Malta - The compile-replace algorithm will soon be
released - The Book (Beesley and Karttunen, forthcoming)
- Unique opportunity
- Semitic component
- Routinely written, in a culture with high literacy
32Final Observations
- Successful computational linguistic projects are
often the result of cooperation between a
computational linguist and a more traditional
descriptive linguist - Computational linguistics can be commercially
rewardng - Computational linguistics is a healthy discipline
from the descriptive point of view - Your grammars can literally be tested on millions
of words - Any mistakes or gaps in your grammars soon become
apparent