Hindi Parts-of-Speech Tagging - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Hindi Parts-of-Speech Tagging

Description:

Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI What's in? Why POS tagging & chunking? Approach Challenges Unseen tag sequences Unknown words Results Future ... – PowerPoint PPT presentation

Number of Views:249
Avg rating:3.0/5.0
Slides: 33
Provided by: baskaran
Category:
Tags: hindi | parts | phrase | prep | speech | tagging

less

Transcript and Presenter's Notes

Title: Hindi Parts-of-Speech Tagging


1
Hindi Parts-of-Speech Tagging Chunking
  • Baskaran S
  • MSRI

2
What's in?
  • Why POS tagging chunking?
  • Approach
  • Challenges
  • Unseen tag sequences
  • Unknown words
  • Results
  • Future work
  • Conclusion

3
Intro Motivation
4
POS
  • Parts-of-Speech
  • Dionysius Thrax (ca 100 BC)
  • 8 types noun, verb, pronoun, preposition,
    adverb, conjunction, participle and article
  • I get my thing in action. (Verb, that's what's
    happenin') To work, (Verb!) To play,
    (Verb!) To live, (Verb!) To love... (Verb!...)
  • - Schoolhouse Rock

5
Tagging
  • Assigning the appropriate POS
  • or lexical class marker
  • to words in a given text
  • Symbols, punctuation markers etc. are also
    assigned specific tag(s)

6
Why POS tagging?
  • Gives significant information about a word and
    its neighbours
  • Adjective near noun
  • Adverb near verb
  • Gives clue on how a word is pronounced
  • OBject as noun
  • obJECT as verb
  • Speech synthesis, full parsing of sentences, IR,
    word sense disambiguation etc.

7
Chunking
  • Identifying simple phrases
  • Noun phrase, verb phrase, adjectival phrase
  • Useful as a first step to Parsing
  • Named entity recognition

8
POS tagging Chunking
9
Stochastic approaches
  • Availability of tagged corpora in large quantity
  • Most are based on HMM
  • Weischedel 93
  • DeRose 88
  • Skut and Brants 98 extending HMM to chunking
  • Zhou and Su 00
  • and lots more

10
HMM
  • Assumptions
  • Probability of a word is dependent only on its
    tag
  • Approximate the tag history to the most recent
    two tags

11
Structural tags
  • A triple POS tag, structural relation chunk
    tag
  • Originally proposed by Skut Brants 98
  • Seven relations
  • Enables embedded and overlapping chunks

12
Structural relations
??????? ??? ?? ????? ?????? ??????? ?? ??
???????? ??? ?????? ?????? ????? ???????? ??
?????? ?? ??????? ??? ?
SSF
NP
VG
??????? ???
? End
SSF
SSF
00
09
NP
NP
VG
Beg ???????
?????? ???????
90
99
13
Decoding
  • Viterbi mostly used (also A or stack)
  • Aims at finding the best path (tag sequence)
    given observation sequence
  • Possible tags are identified for each transition,
    with associated probabilities
  • The best path is the one that maximizes the
    product of these transition probabilities

14
?? ???? ?? ?? ???? ???
???? ????? ??? ?
JJ NLOC NN PREP PRP QFN RB VFM SYM









15
?? ???? ?? ?? ???? ???
???? ????? ??? ?
JJ NLOC NN PREP PRP QFN RB VFM SYM









16
?? ???? ?? ?? ???? ???
???? ????? ??? ?
JJ NLOC NN PREP PRP QFN RB VFM SYM









17
Issues
18
1. Unseen tag sequences
  • Smoothing (Add-One, Good-Turing) and/ or Backoff
    (Deleted interpolation)
  • Idea is to distribute some fractional probability
    (of seen occurrences) to unseen
  • Good-Turing
  • Re-estimates the probability mass of lower count
    N-grams by that of higher counts
  • - Number of N-grams occurring c times

19
2. Unseen words
  • Insufficient corpus (even after 10 mn words)
  • Not all of them are proper names
  • Treat them as rare words that occur once in the
    corpus - Baayen and Sproat 96, Dermatas and
    Kokkinakis 95
  • Known Hindi corpus of 25 K words and unseen
    corpus of 6 K words
  • All words vs. Hapax vs. Unknown

20
Tag distribution analysis
21
3. Features
  • Can we use other features?
  • Capitalization
  • Word endings and Hyphenations
  • Weishedel 93 reports about 66 reduction in
    error rate with word endings and hyphenations
  • Capitalizations, though useful for proper nouns
    are not very effective

22
Contd
  • String length
  • Prefix suffix fixed characters width
  • Character encoding range
  • Complete analysis remains to be done
  • Expected to be very effective for morphologically
    rich languages
  • To be experimented with Tamil

23
4. Multi-part words
  • Examples
  • In/ terms/ of/
  • United/ States/ of/ America/
  • More problematic in Hindi
  • United/NNPC States/NNPC of/NNPC America/NNP
  • Central/NNC government/NN
  • NNPC Compound proper noun, NN - noun
  • NNP Proper noun, NNC Compound noun
  • How does the system identify the last word in
    multi-part word?
  • 10 of errors is due to this in Hindi (6 K words
    tested)

24
Results
25
Evaluation metrics
  • Tag precision
  • Unseen word accuracy
  • of unseen words that are correctly tagged
  • Estimates the goodness of unseen words
  • reduction in error
  • Reduction in error after the application of a
    particular feature

26
Results - Tagger
  • No structural tags ? better smoothing
  • Unseen data significantly more unknowns

Dev S-1 S-2 S-3 S-4 Test
words 8511 6388 6397 6548 5847 5000
Correctly tagged 6749 5538 5504 5558 5060 3961
Precision 79.29 86.69 86.04 86.06 86.54 79.22
Unseen 1543 660 648 589 603 1012
Correctly tagged 672 354 323 265 312 421
Unseen Precision 43.55 53.63 49.84 44.99 51.74 41.6
27
Results Chunk tagger
  • Training 22 K, development data 8 K
  • 4-cross validation
  • Test data 5 K

POS tagging Precision Chunk Chunk Chunk Chunk
POS tagging Precision Identification Identification Labelling Labelling
POS tagging Precision Pre Rec Pre Rec
Dev data 76.16 69.54 69.05 66.73 66.27
Average 85.02 72.26 73.52 70.01 71.35
Test data 76.49 58.72 61.28 54.36 56.73
28
Results Tagging error analysis
  • Significant issues with nouns/multi-part words
  • NNP ? NN
  • NNC ? NN
  • Also,
  • VAUX ? VFM VFM ? VAUX and
  • NVB ? NN NN ? NVB

29
HMM performance (English)
  • gt 96 reported accuracies
  • About 85 for unknown words
  • Advantage
  • Simple and most suitable with the availability of
    annotated data

30
Conclusion
31
Future work
  • Handling unseen words
  • Smoothing
  • Can we exploit other features?
  • Especially morphological ones
  • Multi-part words

32
Summary
  • Statistical approaches now include linguistic
    features for higher accuracies
  • Improvement required
  • Tagging
  • Precision 79.22
  • Unknown words 41.6
  • Chunking
  • Precision 60
  • Recall 62
Write a Comment
User Comments (0)
About PowerShow.com