Terminology Extraction and Automatic Indexing

1 / 18

About This Presentation

Title:

Description:

Number of Views:70

Avg rating:3.0/5.0

Slides: 19

Provided by: frie83

Category:

Tags: automatic | encarta | extraction | indexing | terminology

Transcript and Presenter's Notes

Title: Terminology Extraction and Automatic Indexing

1
Terminology Extraction and Automatic Indexing

2
Outline

3
Introduction

1.1 Terminology Extraction
Terminology Set of terms representing the
system of concepts of a particular subject field
(ISO 1087)
Task identify terms belonging to these concepts,
feed them into terminological databases
Applications
Bilingual translation
Monoloingual dictionaries, standardization,

4
Introduction

1.2 Automatic Indexing
Index term content-bearing key
(Sparck-Jones99)
Task identify terms that best describe the
content of a document and help to discriminate it
from others
Applications
Classical Information Retrieval
Feature Selection for Classification / Clustering
Knowledge Mangement (e.g. topic maps, )

5
Thesis / Aim

2.1 Thesis
The tasks of Terminology Extraction and Automatic
Indexing have much in common gt the same set of
basic algorithms can be applied in both areas
Especially linguistic methods applied in
terminology extraction should be tried in
automatic indexing
2.2 Aim
Transfer some ideas from terminology extraction
to automatic indexing, evaluate their performance
Find out which adjustments have to be made

6
Methods

3.1 Statistics
Idea from Terminology Extraction
Use large, well-balanced reference corpus R
Extract terms from domain-specific text that
occur with significantly higher relative
frequency than in R
Apply statistical significance measure to single
out those terms (e.g. likelihood ratio)
Transfer to Automatic Indexing
Idea is not new TF/IDF is very similar
Might need extra frequency threshold to assure
that selected terms are representative of the
documents content
Advantages of this approach
Deterministic
Indexing in stand-alone fashion no large
collection needed

7
Methods

Source Microsoft Encarta entry on Airplane
8
Methods

3.2 Syntax
Idea from Terminology Extraction
Technical terms are often noun phrases of a
characteristic form (e.g. N N or A N or just N)
Combining regular expressions over part-of-speech
tags with frequency filters can yield good
(multi-word) term lists
Transfer to Automatic Indexing
Use of phrases as index terms could not be shown
to improve retrieval performance or
classification accuracygt normally, it is
sufficient to include the phrases constituents
as index terms
Part-of-speech helpful? Thesis nouns are more
content-bearing than other word classes (see
experiments)

9
Methods

Source Microsoft Encarta entry on Airplane
10
Methods

3.3 Morphology
Idea from Terminology Extraction
One-word compounding languages (Danish, German,
Korean, ) technical terms are often compounds
There are some domain-specific morphemes (cf.
Heid98) that appear in many compounds
Identify these and then extract compounds that
contain them
Transfer to Automatic Indexing
Parts of compounds are often free morphemesgt
domain specific morphemes can be used as index
terms themselves
Compound constituents are more general in meaning
than the compound itself gt describe document
content in a general way

11
Methods

12
Experiments

4.1 Hypotheses to be verified
Morphemes are more general in meaning than other
descriptors gt high overlap between morphemes
selected from different documents
Morphemes and nouns are good features for
classification because they are content-bearing
Statistical selection of index terms likelihood
ratio tests are superior to simple TF/IDF

13
Experiments

4.2 Setup Classification of newspaper articles
Corpus 991 newspaper articles, 10 categories
Apply different indexing methods as feature
selection techniques
Classification using WEKA package multinomial
naïve Bayes with 10-fold cross validation

14
Experiments

4.3 Indexing methods evaluated
TF/IDF extract terms with high TF/IDF, use
additional stop word list
Statistical analysis extract terms with high
likelihood ratio values and minimum frequency of
2
Nouns extract most frequent nouns
Morphemes extract domain-specific morphemes,
i.e. morphemes that occur in many different
compounds throughout the text
For each method extract the best X terms as a
feature vector (X varied)gt not a standard
method, but we want to measure quality of index
terms without further intervention!!

15
Results
5.1 Generality and Overlap
16
Results
5.2 Classification accuracy
17
Conclusions

Morphemes and nouns are more general in meaning
than terms acquired from statistical analyses
Likelihood ratio outperforms TF/IDF
Linguistic knowledge (part-of-speech, morphemes)
can improve classification results significantly
when compared to purely statistical approaches
Effects become less apparent as more features are
used gt good for reducing dimensionality

18
Thank you!

Write a Comment

User Comments (0)