Title: Combining Classifiers for Chinese Word Segmentation
1Combining Classifiers for Chinese Word
Segmentation
- Nianwen Xue
- Institute for Research in Cognitive Science
- Susan P. Converse
- Department of Computer and Information Science
- University of Pennsylvania
2Organization of the presentation
- The Chinese word segmentation problem
- Recasting the problem
- Supervised machine learning approach combining
classifiers - A maximum entropy tagger
- A transformation-based tagger
- Experiments
- Conclusion and future work
3The Chinese word segmentation problem
- Ambiguity (example from Richard Sproat)
- ?? ?? ?? ? ?
- Japanese octopus how say
- "How do you say octopus in Japanese?"
- ? ?? ? ?? ? ?
- Japan article fish how say
- New words
- Proper names (???(?) Roosevelt (Road))
- Abbreviations
- Neologisms (?? cell phone)
4The Chinese word segmentation problem
A common approach has been to view the problem as
a substring matching problem, and to use the
maximum matching algorithm.
- This formulation of the problem implies the use
of a dictionary - Difficulties with this approach are
- Ambiguity a string can map onto different
sequences of words from a dictionary - Words not found in the dictionary being used
5Organization of the presentation
- The Chinese word segmentation problem
- Recasting the problem
- Supervised machine learning approach combining
classifiers - A maximum entropy tagger
- A transformation-based tagger
- Experiments
- Conclusion and future work
6A different approach to solving the segmentation
problem
- Word segmentation is difficult because a given
Chinese character can occur in different
positions within a word - e.g., ? produce by itself
? produce on the left
?? product in the middle ???
productivity on the right ??
'start production' - If we could reliably determine the position of
each character within its word in a string of
words, the problem of word segmentation would be
solved
7Towards a solution tag the character position
- Assign a tag to each character in a sentence
based on the position of the character within a
word (POC tag)by itself ?/LR ?
produce on the left ?/LL
?? product in the middle ?/MM ???
productivity on the right ? RR ??
'start production' - Ambiguity arises when a character has multiple
tags?/LLLR ?/RRLL ?/LLRR ?/RRLR ?/LL
?/RR ?/LR ?The task then is to pick the correct
tag based on the context, in a manner similar to
the part-of-speech tagging problem.
8The advantages of this reformulation of the
problem
- Easier to manipulate than the N-gram rules used
in other machine-learning approaches - Easy to take advantage of new advances in POS
tagging technology
9Feasibility
- Chinese characters are distributed in a
constrained manner. Some charactersare not
ambiguous or not ambiguous in all possible ways. - Chinese words are generally short, generally
fewer than four characters
10Contrast with the dictionary-based formulation
- Substring ambiguity is turned into a different
type of ambiguity in which a character can have
multiple tags. - Although new words are common in Chinese since
some word formation processes are highly
productive, new characters are less common. More
likely to see a new combination of characters
than a new character.
11Organization of the presentation
- The Chinese word segmentation problem
- Recasting the problem
- Supervised machine learning approach combining
classifiers - A maximum entropy tagger
- A transformation-based tagger
- Experiments
- Conclusion and future work
12Combining Classifiers for Word Segmentation
- Supervised learning approach to POC tag
assignment that combines - a Maximum Entropy tagger with
- a Transformation-based Error-driven tagger
- Division of labor
- using the maximum entropy tagger as the main
workhorse - using the transformation-based tagger to clean up
tagging inconsistencies
13Training procedure
- Convert a manually segmented training corpus
into a corpus of tagged characters - Train a maximum entropy tagger on the tagged
corpus - Train a transformation-based tagger using the
the output tagged by the maxent tagger and using
the manually segmented corpus as a reference.
14Example of POC training data
- A Manually Segmented Sentence
- ?? ?? ? ?? ?? ?? ???? ,?? ??
- ?? ?? ?? ?? , ?? ?? ?? ?? ?? ??
- ? ?? ?
- A POC-tagged Sequence Automatically Derived from
the Manual Segmentation - ?_LL ?_RR ?_LL ?_RR ?_LR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ?_LL ?_MM ?_MM ?_RR ,_LR ?_LL ?_RR ?_LL
- ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ,_LR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR
- ?_LL ?_RR ?_LR ?_LL ?_RR ?_LR
15Testing procedure
- POC tag the testing corpus (different from the
training corpus) with the maximum entropy tagger - Clean up some tagging inconsistencies with the
transformation-based tagger - Convert the output into a segmented corpus
- There will be problems when there are
inconsistent tagging sequences, e.g. LL,LL
16Example of tagging by MaxEnt tagger
- The same example tagged by the Maximum Entropy
tagger (note the tagging inconsistency) - ?_LL ?_RR ?_LL ?_RR ?_LR ?_LL ?_RR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
- ?_LL ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
- ?_LL ?_RR ?_LR
17Example after transformations
- Tagging inconsistency is fixed, but tagging is
still wrong in this Case - ?_LL ?_RR ?_LL ?_RR ?_LR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ?_LR ?_LR ,_LR ?_LL ?_RR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ?_LL ?_RR ,_LR ?_LL ?_RR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ?_LL ?_RR ?_LR ?_LL ?_RR
18Training the maximum entropy tagger (1)
- Encode contextual information that is useful in
predicting the tag of a character with features. - Examples
- If the current character is Wi, then it should be
tagged Ti - If the previous character is tagged Ti-1, then
the current character should be tagged Ti - If the previous character is Wi-1 and the current
character is Wi, then the current character
should be tagged Ti - The features will be instantiations of a
pre-defined set of feature templates
19Training the maximum entropy tagger (2)
- Feature templates used for this tagger
- The current character
- The previous (next) character and the current
character - The previous (next) two characters
- The previous character and the next character
- The tag of the previous character
- The tag of the character two before the current
character - The maxent training process effectively assigns a
'weight' to each feature. The weight indicates
how effective a feature is in predicting how the
character should be tagged.
20Training the transformation-based tagger (Brill
1995)
- The tagger learns a ranked set of rules from a
pre-defined set of rule templates - After training, the rules it learned are applied
to the input during testing - A sampling of the type of rule templates used to
learn rules - Change tag a to tag b when
- The preceding (following) character is tagged z.
- The preceding character is tagged z and the
following character is tagged as w. - The preceding (following) character is c.
- One of the two preceding (following) characters
is c. - The current character is c and the preceding
(following) character is tagged z. - where a, b, z and w are variables over the set of
four tags (LL, RR, LR, MM)
21Top Five Transformations Learned
- RR MM NEXTTAG RR
- RR RR ? MM RR
- LL LR NEXTTAG LL
- LL LL ? LR LR
- LL LR NEXTTAG LR
- MM RR NEXTBIGRAM LR LR
- RR LR PREVBIGRAM RR LR
- RR LR RR ? RR LR LR
22Organization of the presentation
- The Chinese word segmentation problem
- Recasting the problem
- Supervised machine learning approach combining
classifiers - A maximum entropy tagger
- A transformation-based tagger
- Experiments
- Conclusion and future work
23Three experiments
- Training corpus size 238K words (405K hanzi)
- Testing corpus size 13K words (22K hanzi)
24Experiments I
- Two sub-experiments
- A forward maximum-matching algorithm with a
dictionary compiled from the training corpus. - 497 new words in the testing corpus (3.95).
- The same algorithm with a dictionary compiled
from both the training andtesting corpora. - No new words.
25Experiment II
- The maximum entropy tagger only
26Experiment III
- Tag the testing corpus with the maximum entropy
tagger - Clean up the output with the transformation-based
tagger
27Results
- tagger(s) Tagging accuracy Segmentation
accuracy training testing
testing p() r() f()1a
n/a n/a 87.34
92.34 89.771b n/a n/a
94.51 95.80 95.152
97.55 95.95 94.90 94.88 94.893
97.81 96.07 95.21
95.13 95.171a maximum matching algorithm
applied to testing data with new words1b
maximum matching algorithm applied to testing
data without new words2 maximum entropy
tagger3 maximum entropy tagger combined with
the transformation-based tagger
28Conclusions
- The maximum entropy model is effective when
applied to Chinese word segmentation - The transformation-based tagger improves the
tagging accuracy by only .12, but it does clean
up some tagging inconsistencies. The tagging
errors are reduced by 3.0. The segmentation
accuracy improves by .28 (F-score).
29Residual issues
- If we use only two tags to label the characters,
one for characters that start a word and one for
those that do not, there will be no tagging
inconsistency. But the segmentation accuracy
drops to about 94.3 - Will more training data help? We will try the
segmenter on the Rocling corpus.
30Thank You
31The Chinese word segmentation problem
- ?????
- ? ?? ? ??
- ?????
- ?? ? ? ??
- ?? ? ? ? ??
32New Results
- tagger(s) Segmentation accuracy
testing p() r() f() - 87.34 92.34 89.77 94.51 95.80
95.15 94.90 94.88 94.89 95.21
95.13 95.17
33Example
- The Same Sentence Tagged by the Maximum Entropy
Tagger (Note the Tagging Inconsistency) - ?_LL ?_RR ?_LL ?_RR ?_LR ?_LL ?_RR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
- ?_LL ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
- ?_RR ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR
- ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
- ?_LL ?_RR ?_LR