Title: Hindi Parts-of-Speech Tagging
1Hindi Parts-of-Speech Tagging Chunking
2What's in?
- Why POS tagging chunking?
- Approach
- Challenges
- Unseen tag sequences
- Unknown words
- Results
- Future work
- Conclusion
3Intro Motivation
4POS
- Parts-of-Speech
- Dionysius Thrax (ca 100 BC)
- 8 types noun, verb, pronoun, preposition,
adverb, conjunction, participle and article - I get my thing in action. (Verb, that's what's
happenin') To work, (Verb!) To play,
(Verb!) To live, (Verb!) To love... (Verb!...) - - Schoolhouse Rock
5Tagging
- Assigning the appropriate POS
- or lexical class marker
- to words in a given text
- Symbols, punctuation markers etc. are also
assigned specific tag(s)
6Why POS tagging?
- Gives significant information about a word and
its neighbours - Adjective near noun
- Adverb near verb
- Gives clue on how a word is pronounced
- OBject as noun
- obJECT as verb
- Speech synthesis, full parsing of sentences, IR,
word sense disambiguation etc.
7Chunking
- Identifying simple phrases
- Noun phrase, verb phrase, adjectival phrase
- Useful as a first step to Parsing
- Named entity recognition
8POS tagging Chunking
9Stochastic approaches
- Availability of tagged corpora in large quantity
- Most are based on HMM
- Weischedel 93
- DeRose 88
- Skut and Brants 98 extending HMM to chunking
- Zhou and Su 00
- and lots more
10HMM
- Assumptions
- Probability of a word is dependent only on its
tag - Approximate the tag history to the most recent
two tags
11Structural tags
- A triple POS tag, structural relation chunk
tag - Originally proposed by Skut Brants 98
- Seven relations
- Enables embedded and overlapping chunks
12Structural relations
??????? ??? ?? ????? ?????? ??????? ?? ??
???????? ??? ?????? ?????? ????? ???????? ??
?????? ?? ??????? ??? ?
SSF
NP
VG
??????? ???
? End
SSF
SSF
00
09
NP
NP
VG
Beg ???????
?????? ???????
90
99
13Decoding
- Viterbi mostly used (also A or stack)
- Aims at finding the best path (tag sequence)
given observation sequence - Possible tags are identified for each transition,
with associated probabilities - The best path is the one that maximizes the
product of these transition probabilities
14?? ???? ?? ?? ???? ???
???? ????? ??? ?
JJ NLOC NN PREP PRP QFN RB VFM SYM
15?? ???? ?? ?? ???? ???
???? ????? ??? ?
JJ NLOC NN PREP PRP QFN RB VFM SYM
16?? ???? ?? ?? ???? ???
???? ????? ??? ?
JJ NLOC NN PREP PRP QFN RB VFM SYM
17Issues
181. Unseen tag sequences
- Smoothing (Add-One, Good-Turing) and/ or Backoff
(Deleted interpolation) - Idea is to distribute some fractional probability
(of seen occurrences) to unseen - Good-Turing
- Re-estimates the probability mass of lower count
N-grams by that of higher counts - - Number of N-grams occurring c times
192. Unseen words
- Insufficient corpus (even after 10 mn words)
- Not all of them are proper names
- Treat them as rare words that occur once in the
corpus - Baayen and Sproat 96, Dermatas and
Kokkinakis 95 - Known Hindi corpus of 25 K words and unseen
corpus of 6 K words - All words vs. Hapax vs. Unknown
20Tag distribution analysis
213. Features
- Can we use other features?
- Capitalization
- Word endings and Hyphenations
- Weishedel 93 reports about 66 reduction in
error rate with word endings and hyphenations - Capitalizations, though useful for proper nouns
are not very effective
22Contd
- String length
- Prefix suffix fixed characters width
- Character encoding range
- Complete analysis remains to be done
- Expected to be very effective for morphologically
rich languages - To be experimented with Tamil
234. Multi-part words
- Examples
- In/ terms/ of/
- United/ States/ of/ America/
- More problematic in Hindi
- United/NNPC States/NNPC of/NNPC America/NNP
- Central/NNC government/NN
- NNPC Compound proper noun, NN - noun
- NNP Proper noun, NNC Compound noun
- How does the system identify the last word in
multi-part word? - 10 of errors is due to this in Hindi (6 K words
tested)
24Results
25Evaluation metrics
- Tag precision
- Unseen word accuracy
- of unseen words that are correctly tagged
- Estimates the goodness of unseen words
- reduction in error
- Reduction in error after the application of a
particular feature
26Results - Tagger
- No structural tags ? better smoothing
- Unseen data significantly more unknowns
Dev S-1 S-2 S-3 S-4 Test
words 8511 6388 6397 6548 5847 5000
Correctly tagged 6749 5538 5504 5558 5060 3961
Precision 79.29 86.69 86.04 86.06 86.54 79.22
Unseen 1543 660 648 589 603 1012
Correctly tagged 672 354 323 265 312 421
Unseen Precision 43.55 53.63 49.84 44.99 51.74 41.6
27Results Chunk tagger
- Training 22 K, development data 8 K
- 4-cross validation
- Test data 5 K
POS tagging Precision Chunk Chunk Chunk Chunk
POS tagging Precision Identification Identification Labelling Labelling
POS tagging Precision Pre Rec Pre Rec
Dev data 76.16 69.54 69.05 66.73 66.27
Average 85.02 72.26 73.52 70.01 71.35
Test data 76.49 58.72 61.28 54.36 56.73
28Results Tagging error analysis
- Significant issues with nouns/multi-part words
- NNP ? NN
- NNC ? NN
- Also,
- VAUX ? VFM VFM ? VAUX and
- NVB ? NN NN ? NVB
29HMM performance (English)
- gt 96 reported accuracies
- About 85 for unknown words
- Advantage
- Simple and most suitable with the availability of
annotated data
30Conclusion
31Future work
- Handling unseen words
- Smoothing
- Can we exploit other features?
- Especially morphological ones
- Multi-part words
32Summary
- Statistical approaches now include linguistic
features for higher accuracies - Improvement required
- Tagging
- Precision 79.22
- Unknown words 41.6
- Chunking
- Precision 60
- Recall 62