Chinese Named Entity Recognition with Multiple Features - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Chinese Named Entity Recognition with Multiple Features

Description:

Unlike English, Chinese lacks the capitalization information which plays very ... There is no space between words in Chinese, so we have to segment the text ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 36

Provided by: sdum

Category:

more less

Transcript and Presenter's Notes

Title: Chinese Named Entity Recognition with Multiple Features

1
Chinese Named Entity Recognition with Multiple
Features
HLT/EMNLP 2005, Vancouver, B.C., Canada, October
6-8, 2005

Youzheng Wu, Jun Zhao, Bo Xu, Hao Yu
Institute of Automation, Chinese Academy of
Sciences
Fujitsu RD Center Co., Ltd
yzwu, jzhao,bxu_at_nlpr.ia.ac.cn
yu_at_frdc.fujitsu.com.cn

2
Outline

Introduction
Related Work
Chinese NER with Multiple Features
The Hybrid Model (Word Model POS Model)
Word Model (Word Entity Model Word Context
Model)
POS Model (POS Entity Model POS Context Model)
Heuristic Human Knowledge
Experiments
Conclusion

3
Introduction

Named Entity Recognition is one of the key
techniques in IE, QA, Parsing, Metadata tagging,
etc.
The definitions of NEs by PKU is our focus.
PN is divided into 5 sub-classes, that is,
Chinese PN, Japanese PN, Russian PN, Euramerican
PN, Abbreviated PN.
LN is divided into 2 sub-classes, that is, CLN,
ALN.
ORG, TIM, NUM.

4
Differences Between ENE CNE Recognition

The main differences between Chinese NE
Recognition and English NE Recognition include
Unlike English, Chinese lacks the capitalization
information which plays very important roles in
identifying named entities.
There is no space between words in Chinese, so
we have to segment the text before NER.
Consequently, the errors in word segmentation
will affect the result of NER.

5
Related Work

Approaches for NER is focused on machine learning
CoTraining, CoBoosting
Hidden Markov Model
Maximum entropy model
Transformation-based learning
Typical System of Chinese NER
Hsin-His Chen, et al. 1997
Yu, et al. 1998
CHUA, et al. 2000
Sun, et al. 2002

6
Chinese NER with Multiple Features

Features of Chinese NE
Chinese NEs have very distinct word features in
their composition and contextual information.
Data sparseness is very serious when only using
word features.
Three Basic Idea
Combine coarse particle feature (POS Model) with
fine particle feature (Word Model).
Introduce heuristic human knowledge into
statistical model
Use sub-models to respectively describe
Japanese, Russian and Euramerican transliterated
person name and Chinese person name.

7
The Hybrid Model

The task of NER
Given a word/pos sequenceto find the optimal
sequence WC/TC by splitting, combining and
classifying the given sequence
We could obtain the optimal sequence WC/TC
through the following three models
the Word Model
the POS Model
the Hybrid Model

8
Word Model POS Model

Word Model
Estimates the probability of generating a NE
from the viewpoint of word sequence
The
POS Model
Estimates the probability of generating a NE
from the viewpoint of POS sequence
The

9
The Hybrid Model

Combines Word Model with POS Model
where, factor a 0 is to balance Word Model and
POS Model.
The Hybrid Model consists of four sub-models
word context model P(WC), POS context model
P(TC), word entity model P(WWC) and POS entity
model P(TTC).

10
Word Definition in Word Model
11
POS Definition in POS Model

PKU POS specification is adopted in our system.
The size of POS set is 48.

12
Context Model

Word Context Model
POS Context Model

13
Entity Model

Class Definition in Entity Model

14
Word Entity Model

Word Entity Model for PN
Word Entity Model for LN and ON
Word Entity Model for ALN

15
POS Entity Model

POS Entity Model for PN
Use Word Entity Model for PN to replace the POS
Entity Model for PN.
POS Entity Model for LN and ON
POS Entity Model for ALN

16
Training Corpus for Context Model

Five-months Peoples Daily tagged with NER tags

17
Training Corpus for Entity Model

Chinese PN 15.6 million
Japanese PN 0.15 million
Euramerican PN 0.4 million
Russian PN 0.44 million

18
Heuristic Human Knowledge for PN

Chinese PN surname list (including 476 items)
Japanese PN surnames list (including 9189 items)
Russian PN characters lists
Euramerican PN characters lists
NE Length restriction

19
Heuristic Human Knowledge for LN

Location keyword list (including 607 items)
General word list (such as verbs and
prepositions) Words in the list usually is
followed by a location name, such as "?/at",
"?/go".
ALN name list (including 407 items)

20
Heuristic Human Knowledge for ON

Organization keyword list (including 3129 items)
An organization name template list which is used
to recognize the missed ONs in the statistical
model. Some of these templates are as follows.
ON -- LN D OrgKeyWord
ON -- PN D OrgKeyWord
ON -- ON OrgKeyWord

21
Back-off Model to Smooth

Escape probability to smooth the statistical
model

22
Experiments

Three Experiments
Will the Hybrid Model be more effective than the
Word Model and the POS Model?
Will the conclusion from different testing sets
be consistent?
Will the performance be improved significantly
after combining heuristic human knowledge?
Metrics for evaluations
Precision, Recall and F-Measure.

23
Experiment I

a in the Hybrid Model denotes the balancing
factor of the Word Model and the POS Model
The largera, the larger contribution of the POS
Model. The smallera, the larger contribution of
the Word Model.
To find the best value of aon the testing corpus
of one-month's People's Daily.

24
Experiment I (cont.)

Performance of Recognizing PNs Impacted by a

25
Experiment I (cont.)

Performance of Recognizing LNs Impacted by a

26
Experiment I (cont.)

Performance of Recognizing ONs Impacted by a

27
Experiment I (cont.)

Choose the best value a 2.8 after comparing the
figures.

28
Experiment I (cont.)

Conclusion 1
The Hybrid Model can improve the performance of
both the Word Model and the POS Model.
The improvements for PN, LN and ON are different.
That is, the POS Model has obvious side-effect on
the recall of ON recognition at all times, while
the recalls for PN and ON recognition are
improved in the beginning but decreased in the
ending with the increasing of a.

29
Experiment II

Experiments on the MET-2 testing corpus to
validate the conclusion from Experiment I.
This table validate that the Hybrid Model is
better than both the Word Model and the POS
Model.

30
Experiment II (cont.)

Conclusion 2
Our algorithm has consistence on different
testing, i.e. the Hybrid Model which combining
the Word Model with the POS Model can achieve
better performance than either the Word Model or
the POS Model on different testing sets.

31
Experiment III

To validate the idea heuristic human knowledge
can not only reduce the search space, but also
improve the performance.

Model I Statistical Model without
knowledge Model II Statistical Model with
knowledge
32
Experiment III

Conclusion 3
From this experiment, we learn that human
knowledge can not only reduce the search space,
but also significantly improve the performance of
pure statistical model.

33
Conclusion

Propose a hybrid Chinese NER model which
combines multiple features
The main contributions are as follows
The proposed Hybrid Model emphasizes on
integrating coarse particle feature (POS Model)
with fine particle feature (Word Model), so that
it can make up the disadvantages of each other
In order to reduce the search space and improve
the efficiency of model, we incorporate heuristic
human knowledge into statistical model, which
could increase the performance of NER
significantly
For capturing intrinsic features in different
types of entities, we design several sub-models
for different entities. Especially, we divide
transliterated person name into three sub-classes
according to their characters set, that is, CPN
JPN, RPN and EPN.