Chinese Term Extraction Based on Delimiters - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Chinese Term Extraction Based on Delimiters

Description:

Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 25
Provided by: uaes
Category:

less

Transcript and Presenter's Notes

Title: Chinese Term Extraction Based on Delimiters


1
Chinese Term Extraction Based on Delimiters
  • Yuhang Yang, Qin Lu, Tiejun Zhao
  • School of Computer Science and Technology, Harbin
    Institute of Technology
  • Department of Computing,
  • The Hong Kong Polytechnic University
  • May, 2008

2
Outline
  • Introduction
  • Related Works
  • Methodology
  • Experiment and Discussion
  • Conclusion

3
Basic Concepts
  • Terms(terminology) lexical units of the most
    fundamental knowledge of a domain
  • Term extraction
  • Term candidate extraction
  • Unithood
  • Terminology verification
  • Termhood

4
Major Problems
  • Term boundary identification based on term
    features
  • Fewer features are not enough
  • More features lead to more conflicts
  • Limitation in scope
  • low frequency terms
  • long compound terms
  • dependency on Chinese segmentation

5
Main Idea
  • Delimiter based Term candidates extraction
    identifying the relative stable and domain
    independent words immediate before and after
    these terms
  • ??????????????????????????Scan tunneling
    microscope is a kind of quantum tunnelling
    effect-based high angular resolution microscope
  • ???????????????????
  • Socialist system is the basic system of the
    People's Republic of China
  • Potential Advantages of the proposed approach
  • No strict limits on frequency or word length
  • No need for full segmentation
  • Relatively domain independent

6
Related worksStatistic-based Measures
  • Internal measure (Schone and Jurafsky, 2001)
  • Internal associative measures between
    constituents of the candidate characters, such
    as
  • Frequency
  • Mutual information
  • Contextual measure
  • Dependency of candidates on its context
  • The left/right entropy (Sornlertlamvanich et al.,
    2000)
  • The left/right context dependency (Chien, 1999)
  • Accessor variety criteria (Feng et al., 2004).

7
Hybrid Approaches
  • The UnitRate algorithm (Chen et al., 2006)
  • occurrence probability marginal variety
    probability
  • The TCE_SEFCV algorithm (Ji et al, 2007)
  • significance estimation function C-value
    measure
  • Limitations
  • Data sparseness for low frequency terms and long
    terms
  • Cascading errors by full segmentation

8
Observations
  • Sentences are constituted by substantives and
    functional words
  • Domain specific terms (terms for short) are more
    likely to be domain substantives
  • Predecessors and successors of terms are more
    likely to be functional words or general
    substantives connecting terms
  • Predecessors and successors are markers of terms,
    referred to as term delimiters (or simply
    delimiters)

9
Delimiter Based Term Extraction
  • Characteristics of delimiters
  • Mainly functional words and general substantives
  • Relatively stable
  • Domain independent
  • Can be extracted more easily
  • Proposed model
  • Identifying features of delimiters
  • Identify terms by finding their predecessors and
    successors as their boundary words

10
Algorithm design
  • TCE_DI (Term Candidate Extraction Delimiter
    Identification)
  • Input Corpusextract (domain corpus ), DListlist
    )
  • (1). Partition Corpusextract to char strings by
    punctuations.
  • (2). Partition char strings by delimiters to
    obtain term candidates.
  • If there is no delimiter contained in a string,
    the whole string is regarded as a term candidate.

11
Acquisition of DList
  • From a given stop word list
  • Produced by experts or from a general corpus
  • No training is needed
  • DList_Ext algorithm
  • Given a training corpus CorpusD_training, and
  • A domain lexicon LexiconDomain

12
The DList_Ext algorithm
  • S1 For each term in LexiconDomain
  • mark Ti in CorpusD_training as a lexical unit
  • S2 Segment the remaining text
  • S3 Extracts predecessors and successors of all
  • Ti as delimiter candidates
  • S4 Remove all Ti from delimiter candidates
  • S5 Rank delimiter candidates by frequency
  • Use of a simple threshold NDI

13
ExperimentsData Preparation
  • Delimiter List
  • DListIT Extracted by using CorpusIT_Small and
    LexiconIT
  • DListLegal Extracted by using CorpusLegal_Small
    and LexiconLegal
  • DListSW 494 general stop words

14
Performance Measurements
  • Evaluation Precision(sampling) Rate of NTE
  • Reference algorithms
  • SEFC-value (Ji et al, 2007) for term candidate
    extraction
  • TFIDF (Frank et al., 1999) for both term
    candidate extraction and terminology
    verification
  • LA_TV (Link Analysis based Terminology
    Verification) for fair comparison

15
EvaluationDList_Ext algorithm NDI
CorpusLegal_Large (11,048 sentences) CorpusIT_Large (60,508 sentences)
DListIT (Top100) 77.6 89.1
DListIT (Top300) 84.6 92.6
DListIT (Top500) 90.3 93.4
DListIT (Top700) 92.7 93.9
DListlegal (Top100) 95.8 92.6
DListlegal (Top300) 97.8 96.2
DListlegal (Top500) 98.7 96.8
DListlegal (Top700) 99.1 97.1
DListSW 98.1 98.1
Coverage of Delimiters on Different Corpora
16
EvaluationDList_Ext algorithm NDI
Frequency of Delimiters on Domain Corpora
17
EvaluationDList_Ext algorithm NDI
Performance of DListIT on CorpusIT_Large
Performance of DListLegal on CorpusIT_Large
18
NDI 500
Performance of DListIT on CorpusLegal_Large
Performance of DListLegal on CorpusLegal_Large
19
Evaluation on Term Extraction
Performance of Different Algorithms on IT Domain
and Legal Domain
20
Performance Analysis
  • Domain independent and stable delimiters
  • Being extracted easily and useful
  • Larger granularity of domain specific terms
  • Keeping many noisy strings out
  • Less frequency sensitivity
  • Concentrating on delimiters without regards to
    the frequencies of the candidates

21
Evaluation on New Term Extraction RNTE
Performance of Different Algorithms for New Term
Extraction
22
Error Analysis
  • Figure of Speech phrases
  • ????(it is not difficult to see that.)
  • ????(in the new methods)
  • General words
  • ????(mental state)
  • ??(architecture)
  • Long strings which contain short terms
  • ??????(access shared resources),
  • ????(traverse again)

23
Conclusion
  • A delimiter based approach for term candidate
    extraction
  • Advantages
  • Less sensitivity to term frequency
  • Requiring little prior domain knowledge,
    relatively less adaptation for new domains
  • Quite significant improvements for term
    extraction
  • Much better performance for new term extraction
  • Future works
  • Improving overall term extraction algorithms
  • Applying to related NLP tasks such as NER
  • Applying to other languages

24
Q A
  • Thank You !
Write a Comment
User Comments (0)
About PowerShow.com