Title: EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS
1EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO
BLOGS
- Karine Megerdoomian
- University of Maryland,
- College Park
- karinem_at_umiacs.umd.edu
????? ?????? ?????? ???? ????? ? ??????
2Talk Outline
- Persian Weblogs
- Persian is the 4th largest blog language in the
world (75,000 sites) - Description of a finite-state morphological
analyzer for Persian - System description
- Language issues and implementation
- Computational issues in weblogs
3Language of Blogs
- Contain both formal and informal morphology
- Morphology
- Informal text is very different from formal
- ??? ????? ??? ?????? ??
- Features that dont exist in formal
- ????????? ????
- Shortened verbal stems and inflection
- ?? ????? ?????
4Language of Blogs
- Morphology
- Colloquial pronunciation
- ????? ?????? ? ??? ????? ? ????????? ? ??????? ?
??? ??? - ????? ? ?????? ? ???? ??? ??? ? ?????? ????
- Spelling errors and non-standard punctuation
spacing - Emoticons ? and hyperlinks
5Language of Blogs
- Lexicon
- Wordforms follow pronunciation
- ????? ? ???? ? ????? ??? ? ???? ? ???? ? ?????? ?
?? ???? - Colloquial forms
- ?? ??????? ? ???? ???????
- New words
- ???????? ? ?????? ????? ????
6Language of Blogs
- Lexicon
- Loan words
- ?? ??? ? ?? ???? ? ??? ??? ????
- Interjections
- ?????! ? ???? ? ??? ? ?????!
- More idiomatic expressions
- ??? ??? ???
7Language of Blogs
- Huge amount of variation!!
- Need for flexible rules
- Phonological rules to represent colloquial speech
- Need to disambiguate (statistical component?)
- Formal blog text is also different from
traditional formal text
8Language of Blogs
- BBC ???????
- ????? ??? ???????
- ?????? ??? ????????
- ???? ?? ?????
- ?? ?? ????
- ???? ?? ??????
- ??? ???
-
9Finite-State Transducers (FST)
- Two-level network or transducer
- Input lower-side of arc
- Output upper-side of arc
-
b
i
r
d
Noun
Pl
b
i
r
d
s
10MA System Description
- Developed on Xerox Finite State Technology (XFST)
Karttunen Beesley 1992 - Components
- Lexicon and morphology rules (lexc)
- Phonological rules (regular expressions)
- Compiled into a FST (finite-state transducer)
- FST for each part of speech created separately
then composed ? final FST for morphological
analysis
11MA System Description
Input string
Noun FST
Phonology rules
Verb FST
Final FST For Morphology
?
COMPOSITION
Adverb FST
Output string
12MA System Description
- Coverage formal Persian language
- Full verbal conjugation
- Nonverbal inflection ??????? ? ????
- Productive derivational morphology ????? ???
- 20 phonological rules
- Proper nouns of people, places, organizations
13Inflectional Morphology
- LEXICON Root
- ktab Noun
- LEXICON Noun
- Plha ??????
- Pl_ha ???? ??
- Sg0 ????
Pla ?????
14Complex Tokens
- Two different POS categories
- ?????? ??? ? ??????? ?????? - ?????? ? ????
- bhPreplteqydhNounSg ??????
- drPrepltdftrNounSg ??????
- ktabNounPlgtavPronPersPoss1PPl
?????????? - ??????? bradrNounSggtavPronPersPoss1PPl
- gtbvdnVerbIndPres3PSg
15Verbal Morphology
16Verbal Morphology
- LEXICON PastStem
- tvanst Infl1
- rft Infl1
- xndyd Infl1
- LEXICON PresentStem
- tvansttvan Infl2
- rftrv Infl2
- xndydxnd Infl2
- LEXICON PstStemBlog
- tvnst InflBlog1
- LEXICON PrStemBlog
- tvansttvn Infl2
- rftr Infl2
17Long Distance Dependencies
- Some tenses of the verb can only be determined if
we take into account the co-occurrence of the
prefix and the person inflection / auxiliary?
problem for linear approaches
18Long Distance Dependencies
- Leads to very complex paths and continuation
classes in lexc - Using filters largely increases the size of the
FST - Use flag diacritics for unification
(_at_U.Feature.Value_at_)
- Keeps FST small- Can apply constraints between
non-adjacent morphemes
19Phonology Rules
- Form of affixes may change based on the ending
character of the stem - Formal ????? ? ??? ????/????? ? ?????? ??
- Informal ????? ? ?????/???? ? ??????
define clitic1 NB ? 0 Cons __ define
clitic2 NB ? y Vowel __ define
clitic3 NB ? \u200c a e __
ktabNBš SdaNBš hmsayeNBš
20Evaluation
- FST 178,452 states 928,982 arcs before
optimization - Speed 20.84 CPU time in seconds for 10 MB file,
on SunSparcStation - Coverage97.5 Accuracy95
- Unanalyzed tokens proper nouns missing lexicon
words - No weblog language rules included yet!
21Conclusion
- Challenges in morphological analysis of Persian
formal text ? Solutions in XFST system - New issues and variance due to blog language
- Need robust system
Lexicon updated with colloquial forms Flexible
morphological rules derivational morphology
rules Transliteration component for loan
words Statistical approach to disambiguate and to
deal with unknowns