Terminology Extraction and Automatic Indexing - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Terminology Extraction and Automatic Indexing

Description:

Terminology = 'Set of terms representing the system of concepts ... Source: Microsoft Encarta. entry on 'Airplane' Terminology Extraktion and Automatic Indexing ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 19
Provided by: frie83
Category:

less

Transcript and Presenter's Notes

Title: Terminology Extraction and Automatic Indexing


1
Terminology Extraction and Automatic Indexing
  • - Comparison and Qualitative Evaluation of
    Methods -

2
Outline
  • Introduction
  • Thesis / Aim
  • Methods
  • Experiments
  • Results
  • Conclusions

3
Introduction
  • 1.1 Terminology Extraction
  • Terminology Set of terms representing the
    system of concepts of a particular subject field
    (ISO 1087)
  • Task identify terms belonging to these concepts,
    feed them into terminological databases
  • Applications
  • Bilingual translation
  • Monoloingual dictionaries, standardization,

4
Introduction
  • 1.2 Automatic Indexing
  • Index term content-bearing key
    (Sparck-Jones99)
  • Task identify terms that best describe the
    content of a document and help to discriminate it
    from others
  • Applications
  • Classical Information Retrieval
  • Feature Selection for Classification / Clustering
  • Knowledge Mangement (e.g. topic maps, )

5
Thesis / Aim
  • 2.1 Thesis
  • The tasks of Terminology Extraction and Automatic
    Indexing have much in common gt the same set of
    basic algorithms can be applied in both areas
  • Especially linguistic methods applied in
    terminology extraction should be tried in
    automatic indexing
  • 2.2 Aim
  • Transfer some ideas from terminology extraction
    to automatic indexing, evaluate their performance
  • Find out which adjustments have to be made

6
Methods
  • 3.1 Statistics
  • Idea from Terminology Extraction
  • Use large, well-balanced reference corpus R
  • Extract terms from domain-specific text that
    occur with significantly higher relative
    frequency than in R
  • Apply statistical significance measure to single
    out those terms (e.g. likelihood ratio)
  • Transfer to Automatic Indexing
  • Idea is not new TF/IDF is very similar
  • Might need extra frequency threshold to assure
    that selected terms are representative of the
    documents content
  • Advantages of this approach
  • Deterministic
  • Indexing in stand-alone fashion no large
    collection needed

7
Methods
  • 3.1 Statistics Likelihood Ratio Example

Source Microsoft Encarta entry on Airplane
8
Methods
  • 3.2 Syntax
  • Idea from Terminology Extraction
  • Technical terms are often noun phrases of a
    characteristic form (e.g. N N or A N or just N)
  • Combining regular expressions over part-of-speech
    tags with frequency filters can yield good
    (multi-word) term lists
  • Transfer to Automatic Indexing
  • Use of phrases as index terms could not be shown
    to improve retrieval performance or
    classification accuracygt normally, it is
    sufficient to include the phrases constituents
    as index terms
  • Part-of-speech helpful? Thesis nouns are more
    content-bearing than other word classes (see
    experiments)

9
Methods
  • 3.2 Syntax Example for patterns A N, N N

Source Microsoft Encarta entry on Airplane
10
Methods
  • 3.3 Morphology
  • Idea from Terminology Extraction
  • One-word compounding languages (Danish, German,
    Korean, ) technical terms are often compounds
  • There are some domain-specific morphemes (cf.
    Heid98) that appear in many compounds
  • Identify these and then extract compounds that
    contain them
  • Transfer to Automatic Indexing
  • Parts of compounds are often free morphemesgt
    domain specific morphemes can be used as index
    terms themselves
  • Compound constituents are more general in meaning
    than the compound itself gt describe document
    content in a general way

11
Methods
  • 3.3 Morphology Example

12
Experiments
  • 4.1 Hypotheses to be verified
  • Morphemes are more general in meaning than other
    descriptors gt high overlap between morphemes
    selected from different documents
  • Morphemes and nouns are good features for
    classification because they are content-bearing
  • Statistical selection of index terms likelihood
    ratio tests are superior to simple TF/IDF

13
Experiments
  • 4.2 Setup Classification of newspaper articles
  • Corpus 991 newspaper articles, 10 categories
  • Apply different indexing methods as feature
    selection techniques
  • Classification using WEKA package multinomial
    naïve Bayes with 10-fold cross validation

14
Experiments
  • 4.3 Indexing methods evaluated
  • TF/IDF extract terms with high TF/IDF, use
    additional stop word list
  • Statistical analysis extract terms with high
    likelihood ratio values and minimum frequency of
    2
  • Nouns extract most frequent nouns
  • Morphemes extract domain-specific morphemes,
    i.e. morphemes that occur in many different
    compounds throughout the text
  • For each method extract the best X terms as a
    feature vector (X varied)gt not a standard
    method, but we want to measure quality of index
    terms without further intervention!!

15
Results
5.1 Generality and Overlap
16
Results
5.2 Classification accuracy
17
Conclusions
  • Morphemes and nouns are more general in meaning
    than terms acquired from statistical analyses
  • Likelihood ratio outperforms TF/IDF
  • Linguistic knowledge (part-of-speech, morphemes)
    can improve classification results significantly
    when compared to purely statistical approaches
  • Effects become less apparent as more features are
    used gt good for reducing dimensionality

18
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com