Combining Classifiers for Chinese Word Segmentation

About This Presentation

Title:

Combining Classifiers for Chinese Word Segmentation

Description:

Convert a manually segmented training corpus into a corpus of tagged characters. Train a maximum entropy tagger on the tagged corpus ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 34

Provided by: IRCS

Category:

more less

Transcript and Presenter's Notes

Title: Combining Classifiers for Chinese Word Segmentation

1
Combining Classifiers for Chinese Word
Segmentation

Nianwen Xue
Institute for Research in Cognitive Science
Susan P. Converse
Department of Computer and Information Science
University of Pennsylvania

2
Organization of the presentation

The Chinese word segmentation problem
Recasting the problem
Supervised machine learning approach combining
classifiers
A maximum entropy tagger
A transformation-based tagger
Experiments
Conclusion and future work

3
The Chinese word segmentation problem

Ambiguity (example from Richard Sproat)
?? ?? ?? ? ?
Japanese octopus how say
"How do you say octopus in Japanese?"
? ?? ? ?? ? ?
Japan article fish how say
New words
Proper names (???(?) Roosevelt (Road))
Abbreviations
Neologisms (?? cell phone)

4
The Chinese word segmentation problem
A common approach has been to view the problem as
a substring matching problem, and to use the
maximum matching algorithm.

This formulation of the problem implies the use
of a dictionary
Difficulties with this approach are
Ambiguity a string can map onto different
sequences of words from a dictionary
Words not found in the dictionary being used

5
Organization of the presentation

The Chinese word segmentation problem
Recasting the problem
Supervised machine learning approach combining
classifiers
A maximum entropy tagger
A transformation-based tagger
Experiments
Conclusion and future work

6
A different approach to solving the segmentation
problem

Word segmentation is difficult because a given
Chinese character can occur in different
positions within a word
e.g., ? produce by itself
? produce on the left
?? product in the middle ???
productivity on the right ??
'start production'
If we could reliably determine the position of
each character within its word in a string of
words, the problem of word segmentation would be
solved

7
Towards a solution tag the character position

Assign a tag to each character in a sentence
based on the position of the character within a
word (POC tag)by itself ?/LR ?
produce on the left ?/LL
?? product in the middle ?/MM ???
productivity on the right ? RR ??
'start production'
Ambiguity arises when a character has multiple
tags?/LLLR ?/RRLL ?/LLRR ?/RRLR ?/LL
?/RR ?/LR ?The task then is to pick the correct
tag based on the context, in a manner similar to
the part-of-speech tagging problem.

8
The advantages of this reformulation of the
problem

Easier to manipulate than the N-gram rules used
in other machine-learning approaches
Easy to take advantage of new advances in POS
tagging technology

9
Feasibility

Chinese characters are distributed in a
constrained manner. Some charactersare not
ambiguous or not ambiguous in all possible ways.
Chinese words are generally short, generally
fewer than four characters

10
Contrast with the dictionary-based formulation

Substring ambiguity is turned into a different
type of ambiguity in which a character can have
multiple tags.
Although new words are common in Chinese since
some word formation processes are highly
productive, new characters are less common. More
likely to see a new combination of characters
than a new character.

11
Organization of the presentation

The Chinese word segmentation problem
Recasting the problem
Supervised machine learning approach combining
classifiers
A maximum entropy tagger
A transformation-based tagger
Experiments
Conclusion and future work

12
Combining Classifiers for Word Segmentation

Supervised learning approach to POC tag
assignment that combines
a Maximum Entropy tagger with
a Transformation-based Error-driven tagger
Division of labor
using the maximum entropy tagger as the main
workhorse
using the transformation-based tagger to clean up
tagging inconsistencies

13
Training procedure

Convert a manually segmented training corpus
into a corpus of tagged characters
Train a maximum entropy tagger on the tagged
corpus
Train a transformation-based tagger using the
the output tagged by the maxent tagger and using
the manually segmented corpus as a reference.

14
Example of POC training data

A Manually Segmented Sentence
?? ?? ? ?? ?? ?? ???? ,?? ??
?? ?? ?? ?? , ?? ?? ?? ?? ?? ??
? ?? ?
A POC-tagged Sequence Automatically Derived from
the Manual Segmentation
?_LL ?_RR ?_LL ?_RR ?_LR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ?_LL ?_MM ?_MM ?_RR ,_LR ?_LL ?_RR ?_LL
?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ,_LR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR
?_LL ?_RR ?_LR ?_LL ?_RR ?_LR

15
Testing procedure

POC tag the testing corpus (different from the
training corpus) with the maximum entropy tagger
Clean up some tagging inconsistencies with the
transformation-based tagger
Convert the output into a segmented corpus
There will be problems when there are
inconsistent tagging sequences, e.g. LL,LL

16
Example of tagging by MaxEnt tagger

The same example tagged by the Maximum Entropy
tagger (note the tagging inconsistency)
?_LL ?_RR ?_LL ?_RR ?_LR ?_LL ?_RR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
?_LL ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
?_LL ?_RR ?_LR

17
Example after transformations

Tagging inconsistency is fixed, but tagging is
still wrong in this Case
?_LL ?_RR ?_LL ?_RR ?_LR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ?_LR ?_LR ,_LR ?_LL ?_RR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ?_LL ?_RR ,_LR ?_LL ?_RR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ?_LL ?_RR ?_LR ?_LL ?_RR

18
Training the maximum entropy tagger (1)

Encode contextual information that is useful in
predicting the tag of a character with features.
Examples
If the current character is Wi, then it should be
tagged Ti
If the previous character is tagged Ti-1, then
the current character should be tagged Ti
If the previous character is Wi-1 and the current
character is Wi, then the current character
should be tagged Ti
The features will be instantiations of a
pre-defined set of feature templates

19
Training the maximum entropy tagger (2)

Feature templates used for this tagger
The current character
The previous (next) character and the current
character
The previous (next) two characters
The previous character and the next character
The tag of the previous character
The tag of the character two before the current
character
The maxent training process effectively assigns a
'weight' to each feature. The weight indicates
how effective a feature is in predicting how the
character should be tagged.

20
Training the transformation-based tagger (Brill
1995)

The tagger learns a ranked set of rules from a
pre-defined set of rule templates
After training, the rules it learned are applied
to the input during testing
A sampling of the type of rule templates used to
learn rules
Change tag a to tag b when
The preceding (following) character is tagged z.
The preceding character is tagged z and the
following character is tagged as w.
The preceding (following) character is c.
One of the two preceding (following) characters
is c.
The current character is c and the preceding
(following) character is tagged z.
where a, b, z and w are variables over the set of
four tags (LL, RR, LR, MM)

21
Top Five Transformations Learned

RR MM NEXTTAG RR
RR RR ? MM RR
LL LR NEXTTAG LL
LL LL ? LR LR
LL LR NEXTTAG LR
MM RR NEXTBIGRAM LR LR
RR LR PREVBIGRAM RR LR
RR LR RR ? RR LR LR

22
Organization of the presentation

The Chinese word segmentation problem
Recasting the problem
Supervised machine learning approach combining
classifiers
A maximum entropy tagger
A transformation-based tagger
Experiments
Conclusion and future work

23
Three experiments

Training corpus size 238K words (405K hanzi)
Testing corpus size 13K words (22K hanzi)

24
Experiments I

Two sub-experiments
A forward maximum-matching algorithm with a
dictionary compiled from the training corpus.
497 new words in the testing corpus (3.95).
The same algorithm with a dictionary compiled
from both the training andtesting corpora.
No new words.

25
Experiment II

The maximum entropy tagger only

26
Experiment III

Tag the testing corpus with the maximum entropy
tagger
Clean up the output with the transformation-based
tagger

27
Results

tagger(s) Tagging accuracy Segmentation
accuracy training testing
testing p() r() f()1a
n/a n/a 87.34
92.34 89.771b n/a n/a
94.51 95.80 95.152
97.55 95.95 94.90 94.88 94.893
97.81 96.07 95.21
95.13 95.171a maximum matching algorithm
applied to testing data with new words1b
maximum matching algorithm applied to testing
data without new words2 maximum entropy
tagger3 maximum entropy tagger combined with
the transformation-based tagger

28
Conclusions

The maximum entropy model is effective when
applied to Chinese word segmentation
The transformation-based tagger improves the
tagging accuracy by only .12, but it does clean
up some tagging inconsistencies. The tagging
errors are reduced by 3.0. The segmentation
accuracy improves by .28 (F-score).

29
Residual issues

If we use only two tags to label the characters,
one for characters that start a word and one for
those that do not, there will be no tagging
inconsistency. But the segmentation accuracy
drops to about 94.3
Will more training data help? We will try the
segmenter on the Rocling corpus.

30
Thank You

???????????????????????

31
The Chinese word segmentation problem

?????
? ?? ? ??
?????
?? ? ? ??
?? ? ? ? ??

32
New Results

tagger(s) Segmentation accuracy
testing p() r() f()
87.34 92.34 89.77 94.51 95.80
95.15 94.90 94.88 94.89 95.21
95.13 95.17

33
Example

The Same Sentence Tagged by the Maximum Entropy
Tagger (Note the Tagging Inconsistency)
?_LL ?_RR ?_LL ?_RR ?_LR ?_LL ?_RR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
?_LL ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ?_LL ?_RR ?_LL ?_RR ?_LL
?_RR ,_LR ?_LL ?_RR ?_LL ?_RR ?_LL ?_RR
?_LL ?_RR ?_LL ?_RR ?_LL ?_RR ?_LR
?_LL ?_RR ?_LR

Write a Comment

User Comments (0)

About PowerShow.com

Combining Classifiers for Chinese Word Segmentation - PowerPoint PPT Presentation

Combining Classifiers for Chinese Word Segmentation

Convert a manually segmented training corpus into a corpus of tagged characters. Train a maximum entropy tagger on the tagged corpus ... – PowerPoint PPT presentation