Title: Keyphrase Indexing with Controlled Vocabularies
1Keyphrase Indexing with Controlled
Vocabularies Will Computers Index Better than
Humans?
Olena MedelyanDepartment of Computer ScienceThe
University of Waikato, New Zealand
- Agenda
- Manual Indexing
- Experiment with 6 professionals
- Inter-indexing consistency
- Automatic Indexing
- Existing approaches
- How KEA works
- Evaluation
- Lexical Chains
- Extraction from text
- Weighting
Thanks
2Indexing with a Controlled Vocabulary
An Experiment
- 10 documents related to agricultural topics
- 6 professional indexers from FAO
- FAOs domain-specific thesaurus Agrovoc
- 17,000 descriptors, i.e. allowed index terms
- 11,000 non-descriptors, that are linked to
descriptors, e.g. Obesity ? Overweight
United Nations Food and Agriculture Organization
3The Global Obesity Problem
Agrovoc terms
4The Global Obesity Problem
Agrovoc terms
energy value
public health
nutritionaldisorders
regulations
weight reduction
nutrient excesses
developing countries
disease control
nutritional requirements
diet
dietary guidelines
nutritionstatus
nutrition programs
developed countries
feeding habits
meal patterns
nutrition surveillance
overweight
food policies
nutritional physiology
price formation
foodintake
overeating
human nutrition
nutrition policies
price policies
foods
food consumption
fiscal policies
prices
direct taxation
urbanization
globalization
taxes
5The Global Obesity Problem
Agrovoc terms
energy value
public health
nutritionaldisorders
regulations
weight reduction
Indexers
1 2 3 4 5 6
nutrient excesses
developing countries
disease control
nutritional requirements
diet
dietary guidelines
nutritionstatus
nutrition programs
developed countries
feeding habits
meal patterns
nutrition surveillance
overweight
food policies
nutritional physiology
price formation
foodintake
overeating
human nutrition
nutrition policies
price policies
foods
food consumption
fiscal policies
prices
direct taxation
urbanization
globalization
taxes
6Inter-Indexer Consistency
- Measuring consistency in library catalogues
-
- Conceptual consistency gt terminological
consistency - 2 out of 3 non-matching terms match conceptually
- non-matching terms 28, non-matching concepts
10 - New measure M3 Consider semantic relations
between terms!
A C B
Indexer1
Indexer2
7Inter-Indexer Consistency
- Set of index terms as a vector
8Integrating Semantic Relations
- Create two matrices
- RT links
- BT/NT links
-
- Modified measure
9Consistency in a Group of Indexers
- Overall consistency
- is the vector of terms that
indexer i assigns to document D - Weights can be chosen to maximize M3'
- ? 0.2, ? 0.15
- Our indexers achieved
M1 38 M3 49M3' 51
10Human Indexing - Summary
- Professional indexing is expensive and
time-consuming - Their consistency is low under 50
- Even professional indexers disagree with each
other - Indexing by the same person may differdepending
on her or his day condition - Consistency ?Retrieval Efficiency!
- We do employ humans, because keyphrases are so
valuable - Organize document collections
- Provide quick access to documents
- Are a great search facility!
11Agenda
- Manual Indexing
- Experiment with 6 professionals
- Inter-indexing consistency
- Automatic Indexing
- Existing approaches
- How KEA works
- Evaluation and Examples
- Lexical Chains
- Extraction from text
- Weighting
12Automatic Indexing
- Existing approaches
- KEA KEA Controlled Vocabulary
Keyphrase Extraction
Select significant n-grams or NPs according to
their characteristics
- Easy and fast implementation
- Not much training required
- Restriction to syntax
- Low quality phrases
- No consistency
13How KEA Works
CV
DOCs
extract candidates
pseudo-phrase matchingpredatory birds ? bird
predat
bird predat aquacultfisheri...
compute features
manualKEYs
no
yes
training?
compute probabilities
compute model
Naïve Bayes
MODEL
automaticKEYs
14KEAs Features
- TFIDF
- First Occurrence in the beginning/end
- Phrase Length ? 2 words
- Node Degree related to other phrases in the
doc
15Evaluation
- 10-fold cross validation on a 200 document set
() - Concept matching (Agrovoc links are taken into
account) predatory birds ? noxious
birds - Indexing Consistency
16Example
The Growing Global Obesity Problem Some Policy
Options to Address It
? 2 Indexers KEA Exact overweight overwe
ight food consumption food consumption taxe
s taxes developed countries developed
countries Similar prices price
fixing price policies controlled
prices diets body weight fiscal
policies nutrition policies No
match feeding habits saturated fat food
intake nutritional requirements
17The Global Obesity Problem
Agrovoc terms
energy value
public health
Indexers
1 2 3 4 5 6
nutritionaldisorders
regulations
weight reduction
nutrient excesses
developing countries
disease control
KEA
nutritional requirements
diet
dietary guidelines
nutritionstatus
nutrition programs
developed countries
body weight
feeding habits
meal patterns
nutrition surveillance
overweight
food policies
price fixing
nutritional physiology
price formation
controlled prices
saturated fat
foodintake
overeating
human nutrition
nutrition policies
price policies
foods
food consumption
fiscal policies
policies
prices
direct taxation
urbanization
globalization
taxes
18Agenda
- Manual Indexing
- Experiment with 6 professionals
- Inter-indexing consistency
- Automatic Indexing
- Existing approaches
- How KEA works
- Evaluation and Examples
- Lexical Chains
- Extraction from text
- Weighting
19Natural Disasters in Vietnam
social institutions
government
nongovernmentalorganizations
vietnam
lowland
publicservices
impactassessment
dry season
water levels
localgovernment
planning
risk
high water
casestudies
storms
ruralcommunities
riskmanagement
cyclones
flooding
management
natural disasters
early warningsystems
sustainability
ruraldevelopment
communityinvolvement
livingstandards
disasters
weather hazards
emergencyrelief
sustainabledevelopment
damage
capacitybuilding
internationalcooperation
20Lexical Chains
- Lexical chains are sequences of related words
- cow, sheep, wool, scarf, hat
- Used to identify cohesive structure in a text,
its discourse flow and topic areas - Computable with any thesaurus containing
semantic links between words - In NLP
- used for text segmentation, text summarization,
question answering, word sense disambiguation... - but not yet for keyphrase indexing
21How a Lexical Chain is Created
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
natural disaster
hazard
flashflood
strong currents
wash away
swollen rivers
?
22How a Lexical Chain is Created
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
flood
natural disaster
flashflood
lexical net
23How a Lexical Chain is Created
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
More Information Required!
24Weighting Lexical Chains
- e.g. Barzilay Elhadad (97)
-
- Other known schemes weighting relations, graph
structure analysis - Alternative
Score (chain) all occurrences distinct
occurrences
floods(3)
natural disasters(1)
crops(2)
plant(1)
landslides(1)
score 2
score 1
Use multiple-indexers data to learn the
best scoring function
25Top 6 Lexical Nets and Human Indexers Terms
management
risk management
insurance
risk
planning
early warningsystems
disasters
damage
risk assessment
investment
entrepreneurship
forecasting
monitoring
- 6 vietnam
- 5 natural disasters
- 5 flooding
- 5 sustainable development
- 4 emergency relief
- 4 risk management
- 3 local government
- ...
- 2 planning
- ...
- 1 disasters
- 1 management
- 1 risk
laos
vietnam
marshes
natural disasters
flooding
earthquakes
flooded land
fires
landslides
26Lexical Chains - Summary
27Lexical Chains - Summary
- Lexical chains are practical and useful
- Goal Inclusion in KEA
- for candidate identification
- for calculating the features
- Decisions to be made
- Chains or nets?
- Over the whole text or its sequences?
- What additional lexical sources?
- What scoring function?
- Should non-occuring words be included?
- Multiple-indexers data can be helpful, however
it is difficult to acquire
28Agenda
Consistency is important, Difficult to achieve
for humans
- Manual Indexing
- Experiment with 6 professionals
- Inter-indexing consistency
- Automatic Indexing
- Existing approaches
- How KEA works
- Evaluation Examples
- Lexical Chains
- Extraction from text
- Weighting
Easy for computers! KEAs consistency
with humans is high
Human indexing is subjective
An algorithm learns from humansResults are
always justified
New promising techniques Other domains languages
KEA
Will computers index better than humans?