Title: Chinese Named Entity Recognition with Multiple Features
1Chinese Named Entity Recognition with Multiple
Features
HLT/EMNLP 2005, Vancouver, B.C., Canada, October
6-8, 2005
- Youzheng Wu, Jun Zhao, Bo Xu, Hao Yu
- Institute of Automation, Chinese Academy of
Sciences - Fujitsu RD Center Co., Ltd
- yzwu, jzhao,bxu_at_nlpr.ia.ac.cn
- yu_at_frdc.fujitsu.com.cn
2Outline
- Introduction
- Related Work
- Chinese NER with Multiple Features
- The Hybrid Model (Word Model POS Model)
- Word Model (Word Entity Model Word Context
Model) - POS Model (POS Entity Model POS Context Model)
- Heuristic Human Knowledge
- Experiments
- Conclusion
3Introduction
- Named Entity Recognition is one of the key
techniques in IE, QA, Parsing, Metadata tagging,
etc. - The definitions of NEs by PKU is our focus.
- PN is divided into 5 sub-classes, that is,
Chinese PN, Japanese PN, Russian PN, Euramerican
PN, Abbreviated PN. - LN is divided into 2 sub-classes, that is, CLN,
ALN. - ORG, TIM, NUM.
4Differences Between ENE CNE Recognition
- The main differences between Chinese NE
Recognition and English NE Recognition include - Unlike English, Chinese lacks the capitalization
information which plays very important roles in
identifying named entities. - There is no space between words in Chinese, so
we have to segment the text before NER.
Consequently, the errors in word segmentation
will affect the result of NER.
5Related Work
- Approaches for NER is focused on machine learning
- CoTraining, CoBoosting
- Hidden Markov Model
- Maximum entropy model
- Transformation-based learning
- Typical System of Chinese NER
- Hsin-His Chen, et al. 1997
- Yu, et al. 1998
- CHUA, et al. 2000
- Sun, et al. 2002
6Chinese NER with Multiple Features
- Features of Chinese NE
- Chinese NEs have very distinct word features in
their composition and contextual information. - Data sparseness is very serious when only using
word features. - Three Basic Idea
- Combine coarse particle feature (POS Model) with
fine particle feature (Word Model). - Introduce heuristic human knowledge into
statistical model - Use sub-models to respectively describe
Japanese, Russian and Euramerican transliterated
person name and Chinese person name.
7The Hybrid Model
- The task of NER
- Given a word/pos sequenceto find the optimal
sequence WC/TC by splitting, combining and
classifying the given sequence - We could obtain the optimal sequence WC/TC
through the following three models - the Word Model
- the POS Model
- the Hybrid Model
8Word Model POS Model
- Word Model
- Estimates the probability of generating a NE
from the viewpoint of word sequence - The
- POS Model
- Estimates the probability of generating a NE
from the viewpoint of POS sequence - The
9The Hybrid Model
- Combines Word Model with POS Model
- where, factor a 0 is to balance Word Model and
POS Model. - The Hybrid Model consists of four sub-models
word context model P(WC), POS context model
P(TC), word entity model P(WWC) and POS entity
model P(TTC).
10Word Definition in Word Model
11POS Definition in POS Model
- PKU POS specification is adopted in our system.
- The size of POS set is 48.
12Context Model
- Word Context Model
- POS Context Model
13Entity Model
- Class Definition in Entity Model
14Word Entity Model
- Word Entity Model for PN
- Word Entity Model for LN and ON
- Word Entity Model for ALN
15POS Entity Model
- POS Entity Model for PN
- Use Word Entity Model for PN to replace the POS
Entity Model for PN. - POS Entity Model for LN and ON
- POS Entity Model for ALN
16Training Corpus for Context Model
- Five-months Peoples Daily tagged with NER tags
17Training Corpus for Entity Model
- Chinese PN 15.6 million
- Japanese PN 0.15 million
- Euramerican PN 0.4 million
- Russian PN 0.44 million
18Heuristic Human Knowledge for PN
- Chinese PN surname list (including 476 items)
- Japanese PN surnames list (including 9189 items)
- Russian PN characters lists
- Euramerican PN characters lists
- NE Length restriction
19Heuristic Human Knowledge for LN
- Location keyword list (including 607 items)
- General word list (such as verbs and
prepositions) Words in the list usually is
followed by a location name, such as "?/at",
"?/go". - ALN name list (including 407 items)
20Heuristic Human Knowledge for ON
- Organization keyword list (including 3129 items)
- An organization name template list which is used
to recognize the missed ONs in the statistical
model. Some of these templates are as follows. - ON -- LN D OrgKeyWord
- ON -- PN D OrgKeyWord
- ON -- ON OrgKeyWord
21Back-off Model to Smooth
- Escape probability to smooth the statistical
model
22Experiments
- Three Experiments
- Will the Hybrid Model be more effective than the
Word Model and the POS Model? - Will the conclusion from different testing sets
be consistent? - Will the performance be improved significantly
after combining heuristic human knowledge? - Metrics for evaluations
- Precision, Recall and F-Measure.
23Experiment I
- a in the Hybrid Model denotes the balancing
factor of the Word Model and the POS Model - The largera, the larger contribution of the POS
Model. The smallera, the larger contribution of
the Word Model. - To find the best value of aon the testing corpus
of one-month's People's Daily.
24Experiment I (cont.)
- Performance of Recognizing PNs Impacted by a
25Experiment I (cont.)
- Performance of Recognizing LNs Impacted by a
26Experiment I (cont.)
- Performance of Recognizing ONs Impacted by a
27Experiment I (cont.)
- Choose the best value a 2.8 after comparing the
figures.
28Experiment I (cont.)
- Conclusion 1
- The Hybrid Model can improve the performance of
both the Word Model and the POS Model. - The improvements for PN, LN and ON are different.
That is, the POS Model has obvious side-effect on
the recall of ON recognition at all times, while
the recalls for PN and ON recognition are
improved in the beginning but decreased in the
ending with the increasing of a.
29Experiment II
- Experiments on the MET-2 testing corpus to
validate the conclusion from Experiment I. - This table validate that the Hybrid Model is
better than both the Word Model and the POS
Model.
30Experiment II (cont.)
- Conclusion 2
- Our algorithm has consistence on different
testing, i.e. the Hybrid Model which combining
the Word Model with the POS Model can achieve
better performance than either the Word Model or
the POS Model on different testing sets.
31Experiment III
- To validate the idea heuristic human knowledge
can not only reduce the search space, but also
improve the performance.
Model I Statistical Model without
knowledge Model II Statistical Model with
knowledge
32Experiment III
- Conclusion 3
- From this experiment, we learn that human
knowledge can not only reduce the search space,
but also significantly improve the performance of
pure statistical model.
33Conclusion
- Propose a hybrid Chinese NER model which
combines multiple features - The main contributions are as follows
- The proposed Hybrid Model emphasizes on
integrating coarse particle feature (POS Model)
with fine particle feature (Word Model), so that
it can make up the disadvantages of each other - In order to reduce the search space and improve
the efficiency of model, we incorporate heuristic
human knowledge into statistical model, which
could increase the performance of NER
significantly - For capturing intrinsic features in different
types of entities, we design several sub-models
for different entities. Especially, we divide
transliterated person name into three sub-classes
according to their characters set, that is, CPN
JPN, RPN and EPN.
34 35Compare with Sun, et al. 2002(2)