Chinese Term Extraction Based on Delimiters

About This Presentation

Title:

Chinese Term Extraction Based on Delimiters

Description:

Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 25

Provided by: uaes

Category:

more less

Transcript and Presenter's Notes

Title: Chinese Term Extraction Based on Delimiters

1
Chinese Term Extraction Based on Delimiters

Yuhang Yang, Qin Lu, Tiejun Zhao
School of Computer Science and Technology, Harbin
Institute of Technology
Department of Computing,
The Hong Kong Polytechnic University
May, 2008

2
Outline

Introduction
Related Works
Methodology
Experiment and Discussion
Conclusion

3
Basic Concepts

Terms(terminology) lexical units of the most
fundamental knowledge of a domain
Term extraction
Term candidate extraction
Unithood
Terminology verification
Termhood

4
Major Problems

Term boundary identification based on term
features
Fewer features are not enough
More features lead to more conflicts
Limitation in scope
low frequency terms
long compound terms
dependency on Chinese segmentation

5
Main Idea

Delimiter based Term candidates extraction
identifying the relative stable and domain
independent words immediate before and after
these terms
??????????????????????????Scan tunneling
microscope is a kind of quantum tunnelling
effect-based high angular resolution microscope
???????????????????
Socialist system is the basic system of the
People's Republic of China
Potential Advantages of the proposed approach
No strict limits on frequency or word length
No need for full segmentation
Relatively domain independent

6
Related worksStatistic-based Measures

Internal measure (Schone and Jurafsky, 2001)
Internal associative measures between
constituents of the candidate characters, such
as
Frequency
Mutual information
Contextual measure
Dependency of candidates on its context
The left/right entropy (Sornlertlamvanich et al.,
2000)
The left/right context dependency (Chien, 1999)
Accessor variety criteria (Feng et al., 2004).

7
Hybrid Approaches

The UnitRate algorithm (Chen et al., 2006)
occurrence probability marginal variety
probability
The TCE_SEFCV algorithm (Ji et al, 2007)
significance estimation function C-value
measure
Limitations
Data sparseness for low frequency terms and long
terms
Cascading errors by full segmentation

8
Observations

Sentences are constituted by substantives and
functional words
Domain specific terms (terms for short) are more
likely to be domain substantives
Predecessors and successors of terms are more
likely to be functional words or general
substantives connecting terms
Predecessors and successors are markers of terms,
referred to as term delimiters (or simply
delimiters)

9
Delimiter Based Term Extraction

Characteristics of delimiters
Mainly functional words and general substantives
Relatively stable
Domain independent
Can be extracted more easily
Proposed model
Identifying features of delimiters
Identify terms by finding their predecessors and
successors as their boundary words

10
Algorithm design

TCE_DI (Term Candidate Extraction Delimiter
Identification)
Input Corpusextract (domain corpus ), DListlist
)
(1). Partition Corpusextract to char strings by
punctuations.
(2). Partition char strings by delimiters to
obtain term candidates.
If there is no delimiter contained in a string,
the whole string is regarded as a term candidate.

11
Acquisition of DList

From a given stop word list
Produced by experts or from a general corpus
No training is needed
DList_Ext algorithm
Given a training corpus CorpusD_training, and
A domain lexicon LexiconDomain

12
The DList_Ext algorithm

S1 For each term in LexiconDomain
mark Ti in CorpusD_training as a lexical unit
S2 Segment the remaining text
S3 Extracts predecessors and successors of all
Ti as delimiter candidates
S4 Remove all Ti from delimiter candidates
S5 Rank delimiter candidates by frequency
Use of a simple threshold NDI

13
ExperimentsData Preparation

Delimiter List
DListIT Extracted by using CorpusIT_Small and
LexiconIT
DListLegal Extracted by using CorpusLegal_Small
and LexiconLegal
DListSW 494 general stop words

14
Performance Measurements

Evaluation Precision(sampling) Rate of NTE
Reference algorithms
SEFC-value (Ji et al, 2007) for term candidate
extraction
TFIDF (Frank et al., 1999) for both term
candidate extraction and terminology
verification
LA_TV (Link Analysis based Terminology
Verification) for fair comparison

15
EvaluationDList_Ext algorithm NDI
CorpusLegal_Large (11,048 sentences) CorpusIT_Large (60,508 sentences)
DListIT (Top100) 77.6 89.1
DListIT (Top300) 84.6 92.6
DListIT (Top500) 90.3 93.4
DListIT (Top700) 92.7 93.9
DListlegal (Top100) 95.8 92.6
DListlegal (Top300) 97.8 96.2
DListlegal (Top500) 98.7 96.8
DListlegal (Top700) 99.1 97.1
DListSW 98.1 98.1
Coverage of Delimiters on Different Corpora
16
EvaluationDList_Ext algorithm NDI
Frequency of Delimiters on Domain Corpora
17
EvaluationDList_Ext algorithm NDI
Performance of DListIT on CorpusIT_Large
Performance of DListLegal on CorpusIT_Large
18
NDI 500
Performance of DListIT on CorpusLegal_Large
Performance of DListLegal on CorpusLegal_Large
19
Evaluation on Term Extraction
Performance of Different Algorithms on IT Domain
and Legal Domain
20
Performance Analysis

Domain independent and stable delimiters
Being extracted easily and useful
Larger granularity of domain specific terms
Keeping many noisy strings out
Less frequency sensitivity
Concentrating on delimiters without regards to
the frequencies of the candidates

21
Evaluation on New Term Extraction RNTE
Performance of Different Algorithms for New Term
Extraction
22
Error Analysis

Figure of Speech phrases
????(it is not difficult to see that.)
????(in the new methods)
General words
????(mental state)
??(architecture)
Long strings which contain short terms
??????(access shared resources),
????(traverse again)

23
Conclusion

A delimiter based approach for term candidate
extraction
Advantages
Less sensitivity to term frequency
Requiring little prior domain knowledge,
relatively less adaptation for new domains
Quite significant improvements for term
extraction
Much better performance for new term extraction
Future works
Improving overall term extraction algorithms
Applying to related NLP tasks such as NER
Applying to other languages