EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS

Description:

Colloquial pronunciation. ????? ?????? ? ??? ????? ? ????????? ? ??????? ? ??? ??? ... Wordforms follow pronunciation. ????? ? ???? ? ????? ??? ? ???? ? ???? ? ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 22
Provided by: use451
Category:

less

Transcript and Presenter's Notes

Title: EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS


1
EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO
BLOGS
  • Karine Megerdoomian
  • University of Maryland,
  • College Park
  • karinem_at_umiacs.umd.edu

????? ?????? ?????? ???? ????? ? ??????
  • ??????? ?????

2
Talk Outline
  • Persian Weblogs
  • Persian is the 4th largest blog language in the
    world (75,000 sites)
  • Description of a finite-state morphological
    analyzer for Persian
  • System description
  • Language issues and implementation
  • Computational issues in weblogs

3
Language of Blogs
  • Contain both formal and informal morphology
  • Morphology
  • Informal text is very different from formal
  • ??? ????? ??? ?????? ??
  • Features that dont exist in formal
  • ????????? ????
  • Shortened verbal stems and inflection
  • ?? ????? ?????

4
Language of Blogs
  • Morphology
  • Colloquial pronunciation
  • ????? ?????? ? ??? ????? ? ????????? ? ??????? ?
    ??? ???
  • ????? ? ?????? ? ???? ??? ??? ? ?????? ????
  • Spelling errors and non-standard punctuation
    spacing
  • Emoticons ? and hyperlinks

5
Language of Blogs
  • Lexicon
  • Wordforms follow pronunciation
  • ????? ? ???? ? ????? ??? ? ???? ? ???? ? ?????? ?
    ?? ????
  • Colloquial forms
  • ?? ??????? ? ???? ???????
  • New words
  • ???????? ? ?????? ????? ????

6
Language of Blogs
  • Lexicon
  • Loan words
  • ?? ??? ? ?? ???? ? ??? ??? ????
  • Interjections
  • ?????! ? ???? ? ??? ? ?????!
  • More idiomatic expressions
  • ??? ??? ???

7
Language of Blogs
  • Huge amount of variation!!
  • Need for flexible rules
  • Phonological rules to represent colloquial speech
  • Need to disambiguate (statistical component?)
  • Formal blog text is also different from
    traditional formal text

8
Language of Blogs
  • BBC ???????
  • ????? ??? ???????
  • ?????? ??? ????????
  • ???? ?? ?????
  • ?? ?? ????
  • ???? ?? ??????
  • ??? ???

9
Finite-State Transducers (FST)
  • Two-level network or transducer
  • Input lower-side of arc
  • Output upper-side of arc

b
i
r
d
Noun
Pl
b
i
r
d
s
10
MA System Description
  • Developed on Xerox Finite State Technology (XFST)
    Karttunen Beesley 1992
  • Components
  • Lexicon and morphology rules (lexc)
  • Phonological rules (regular expressions)
  • Compiled into a FST (finite-state transducer)
  • FST for each part of speech created separately
    then composed ? final FST for morphological
    analysis

11
MA System Description
Input string
Noun FST
Phonology rules
Verb FST
Final FST For Morphology
?
COMPOSITION
Adverb FST
Output string
12
MA System Description
  • Coverage formal Persian language
  • Full verbal conjugation
  • Nonverbal inflection ??????? ? ????
  • Productive derivational morphology ????? ???
  • 20 phonological rules
  • Proper nouns of people, places, organizations

13
Inflectional Morphology
  • LEXICON Root
  • ktab Noun
  • LEXICON Noun
  • Plha ??????
  • Pl_ha ???? ??
  • Sg0 ????

Pla ?????
14
Complex Tokens
  • Two different POS categories
  • ?????? ??? ? ??????? ?????? - ?????? ? ????
  • bhPreplteqydhNounSg ??????
  • drPrepltdftrNounSg ??????
  • ktabNounPlgtavPronPersPoss1PPl
    ??????????
  • ??????? bradrNounSggtavPronPersPoss1PPl
  • gtbvdnVerbIndPres3PSg

15
Verbal Morphology
  • Two different stems

16
Verbal Morphology
  • LEXICON PastStem
  • tvanst Infl1
  • rft Infl1
  • xndyd Infl1
  • LEXICON PresentStem
  • tvansttvan Infl2
  • rftrv Infl2
  • xndydxnd Infl2
  • LEXICON PstStemBlog
  • tvnst InflBlog1
  • LEXICON PrStemBlog
  • tvansttvn Infl2
  • rftr Infl2

17
Long Distance Dependencies
  • Some tenses of the verb can only be determined if
    we take into account the co-occurrence of the
    prefix and the person inflection / auxiliary?
    problem for linear approaches

18
Long Distance Dependencies
  • Leads to very complex paths and continuation
    classes in lexc
  • Using filters largely increases the size of the
    FST
  • Use flag diacritics for unification
    (_at_U.Feature.Value_at_)

- Keeps FST small- Can apply constraints between
non-adjacent morphemes
19
Phonology Rules
  • Form of affixes may change based on the ending
    character of the stem
  • Formal ????? ? ??? ????/????? ? ?????? ??
  • Informal ????? ? ?????/???? ? ??????

define clitic1 NB ? 0 Cons __ define
clitic2 NB ? y Vowel __ define
clitic3 NB ? \u200c a e __
ktabNBš SdaNBš hmsayeNBš
20
Evaluation
  • FST 178,452 states 928,982 arcs before
    optimization
  • Speed 20.84 CPU time in seconds for 10 MB file,
    on SunSparcStation
  • Coverage97.5 Accuracy95
  • Unanalyzed tokens proper nouns missing lexicon
    words
  • No weblog language rules included yet!

21
Conclusion
  • Challenges in morphological analysis of Persian
    formal text ? Solutions in XFST system
  • New issues and variance due to blog language
  • Need robust system

Lexicon updated with colloquial forms Flexible
morphological rules derivational morphology
rules Transliteration component for loan
words Statistical approach to disambiguate and to
deal with unknowns
Write a Comment
User Comments (0)
About PowerShow.com