Title: ThesaurusBased Automatic Keyphrase Indexing
1Thesaurus-BasedAutomatic Keyphrase Indexing
Olena Medelyan and Ian H. Witten Digital Library
LabDepartment of Computer ScienceThe University
of Waikato, New Zealand
- Agenda
- Indexing Task
- Keyphrases and vocabularies
- Existing approaches
- KEA
- How does it work
- Computing features
- Evaluation
- Evaluation
- Standard evaluation
- Indexing consistency
- Examples
KeyphraseExtractionAlgorithm
2Keyphrases
3Keyphrases
4Keyphrases
5Controlled Vocabulary
- FAOs domain-specific thesaurus Agrovoc
- 17,000 descriptors, i.e. allowed index terms
- 11,000 non-descriptors, that are linked to
descriptors, e.g. Obesity ? Overweight
United Nations Food and Agriculture Organization
6Manual Indexing
- Professional indexer
- reads the document
- determines the topics of the document
- assigns keyphrases from controlled vocabularies
- Time-consuming, expensive
- Assigning metadata for 1 catalogue entry 2h
- Low indexing consistency
- In digital libraries practically impossible!
7Automatic Indexing
- Existing approaches
- KEA KEA Controlled Vocabulary
Keyphrase Extraction
Select significant n-grams or NPs according to
their characteristics
- Easy and fast implementation
- Not much training required
- Restriction to syntax
- Low quality phrases
- No consistency
8Agenda
- Indexing Task
- Keyphrases and vocabularies
- Existing approaches
- KEA
- How does it work
- Computing features
- Evaluation
- Evaluation
- Standard evaluation
- Indexing consistency
9How KEA Works
CV
DOCs
extract candidates
pseudo-phrase matchingpredatory birds ? bird
predat
bird predat aquacultfisheri...
compute features
manualKEYs
no
yes
training?
compute probabilities
compute model
Naïve Bayes
MODEL
automaticKEYs
10KEAs Features
- TFIDF specific for a given document
- First Occurrence in the beginning/end
- Phrase Length ? 2 words
- Node Degree related to other phrases in the
doc
11Agenda
- Indexing Task
- Keyphrases and vocabularies
- Existing approaches
- KEA
- How does it work
- Computing features
- Evaluation
- Evaluation
- Standard evaluation
- Indexing consistency
- Examples
12Evaluation
- 10-fold cross validation on a 200 document set
() - Concept matching (Agrovoc links are taken into
account) predatory birds ? noxious
birds - Indexing Consistency
13Example
The Growing Global Obesity Problem Some Policy
Options to Address It
? 2 Indexers KEA Exact overweight overwe
ight food consumption food consumption taxe
s taxes developed countries developed
countries Similar prices price
fixing price policies controlled
prices diets body weight fiscal
policies nutrition policies No
match feeding habits saturated fat food
intake nutritional requirements
14The Global Obesity Problem
Agrovoc terms
energy value
public health
Indexers
1 2 3 4 5 6
nutritionaldisorders
regulations
weight reduction
nutrient excesses
developing countries
disease control
KEA
nutritional requirements
diet
dietary guidelines
nutritionstatus
nutrition programs
developed countries
body weight
feeding habits
meal patterns
nutrition surveillance
overweight
food policies
price fixing
nutritional physiology
price formation
controlled prices
saturated fat
foodintake
overeating
human nutrition
nutrition policies
price policies
foods
food consumption
fiscal policies
policies
prices
direct taxation
urbanization
globalization
taxes