Keyphrase Indexing with Controlled Vocabularies - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Keyphrase Indexing with Controlled Vocabularies

Description:

17,000 descriptors, i.e. allowed index terms ... which poorer farmers rely between paddy harvests. High dependence on subsistence ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 29
Provided by: a15205
Category:

less

Transcript and Presenter's Notes

Title: Keyphrase Indexing with Controlled Vocabularies


1
Keyphrase Indexing with Controlled
Vocabularies Will Computers Index Better than
Humans?
Olena MedelyanDepartment of Computer ScienceThe
University of Waikato, New Zealand
  • Agenda
  • Manual Indexing
  • Experiment with 6 professionals
  • Inter-indexing consistency
  • Automatic Indexing
  • Existing approaches
  • How KEA works
  • Evaluation
  • Lexical Chains
  • Extraction from text
  • Weighting

Thanks
2
Indexing with a Controlled Vocabulary
An Experiment
  • 10 documents related to agricultural topics
  • 6 professional indexers from FAO

  • FAOs domain-specific thesaurus Agrovoc
  • 17,000 descriptors, i.e. allowed index terms
  • 11,000 non-descriptors, that are linked to
    descriptors, e.g. Obesity ? Overweight

United Nations Food and Agriculture Organization
3
The Global Obesity Problem
Agrovoc terms
4
The Global Obesity Problem
Agrovoc terms
energy value
public health
nutritionaldisorders
regulations
weight reduction
nutrient excesses
developing countries
disease control
nutritional requirements
diet
dietary guidelines
nutritionstatus
nutrition programs
developed countries
feeding habits
meal patterns
nutrition surveillance
overweight
food policies
nutritional physiology
price formation
foodintake
overeating
human nutrition
nutrition policies
price policies
foods
food consumption
fiscal policies
prices
direct taxation
urbanization
globalization
taxes
5
The Global Obesity Problem
Agrovoc terms
energy value
public health
nutritionaldisorders
regulations
weight reduction
Indexers
1 2 3 4 5 6
nutrient excesses
developing countries
disease control
nutritional requirements
diet
dietary guidelines
nutritionstatus
nutrition programs
developed countries
feeding habits
meal patterns
nutrition surveillance
overweight
food policies
nutritional physiology
price formation
foodintake
overeating
human nutrition
nutrition policies
price policies
foods
food consumption
fiscal policies
prices
direct taxation
urbanization
globalization
taxes
6
Inter-Indexer Consistency
  • Measuring consistency in library catalogues
  • Conceptual consistency gt terminological
    consistency
  • 2 out of 3 non-matching terms match conceptually
  • non-matching terms 28, non-matching concepts
    10
  • New measure M3 Consider semantic relations
    between terms!

A C B
Indexer1
Indexer2
7
Inter-Indexer Consistency
  • Set of index terms as a vector

8
Integrating Semantic Relations
  • Create two matrices
  • RT links
  • BT/NT links
  • Modified measure

9
Consistency in a Group of Indexers
  • Overall consistency
  • is the vector of terms that
    indexer i assigns to document D
  • Weights can be chosen to maximize M3'
  • ? 0.2, ? 0.15
  • Our indexers achieved

M1 38 M3 49M3' 51
10
Human Indexing - Summary
  • Professional indexing is expensive and
    time-consuming
  • Their consistency is low under 50
  • Even professional indexers disagree with each
    other
  • Indexing by the same person may differdepending
    on her or his day condition
  • Consistency ?Retrieval Efficiency!
  • We do employ humans, because keyphrases are so
    valuable
  • Organize document collections
  • Provide quick access to documents
  • Are a great search facility!

11
Agenda
  • Manual Indexing
  • Experiment with 6 professionals
  • Inter-indexing consistency
  • Automatic Indexing
  • Existing approaches
  • How KEA works
  • Evaluation and Examples
  • Lexical Chains
  • Extraction from text
  • Weighting

12
Automatic Indexing
  • Existing approaches
  • KEA KEA Controlled Vocabulary

Keyphrase Extraction
Select significant n-grams or NPs according to
their characteristics
  • Easy and fast implementation
  • Not much training required
  • Restriction to syntax
  • Low quality phrases
  • No consistency

13
How KEA Works
CV
DOCs
extract candidates
pseudo-phrase matchingpredatory birds ? bird
predat
bird predat aquacultfisheri...
compute features
manualKEYs
no
yes
training?
compute probabilities
compute model
Naïve Bayes
MODEL
automaticKEYs
14
KEAs Features
  • TFIDF
  • First Occurrence in the beginning/end
  • Phrase Length ? 2 words
  • Node Degree related to other phrases in the
    doc

15
Evaluation
  • 10-fold cross validation on a 200 document set
    ()
  • Concept matching (Agrovoc links are taken into
    account) predatory birds ? noxious
    birds
  • Indexing Consistency

16
Example
The Growing Global Obesity Problem Some Policy
Options to Address It
? 2 Indexers KEA Exact overweight overwe
ight food consumption food consumption taxe
s taxes developed countries developed
countries Similar prices price
fixing price policies controlled
prices diets body weight fiscal
policies nutrition policies No
match feeding habits saturated fat food
intake nutritional requirements
17
The Global Obesity Problem
Agrovoc terms
energy value
public health
Indexers
1 2 3 4 5 6
nutritionaldisorders
regulations
weight reduction
nutrient excesses
developing countries
disease control
KEA
nutritional requirements
diet
dietary guidelines
nutritionstatus
nutrition programs
developed countries
body weight
feeding habits
meal patterns
nutrition surveillance
overweight
food policies
price fixing
nutritional physiology
price formation
controlled prices
saturated fat
foodintake
overeating
human nutrition
nutrition policies
price policies
foods
food consumption
fiscal policies
policies
prices
direct taxation
urbanization
globalization
taxes
18
Agenda
  • Manual Indexing
  • Experiment with 6 professionals
  • Inter-indexing consistency
  • Automatic Indexing
  • Existing approaches
  • How KEA works
  • Evaluation and Examples
  • Lexical Chains
  • Extraction from text
  • Weighting

19
Natural Disasters in Vietnam
social institutions
government
nongovernmentalorganizations
vietnam
lowland
publicservices
impactassessment
dry season
water levels
localgovernment
planning
risk
high water
casestudies
storms
ruralcommunities
riskmanagement
cyclones
flooding
management
natural disasters
early warningsystems
sustainability
ruraldevelopment
communityinvolvement
livingstandards
disasters
weather hazards
emergencyrelief
sustainabledevelopment
damage
capacitybuilding
internationalcooperation
20
Lexical Chains
  • Lexical chains are sequences of related words
  • cow, sheep, wool, scarf, hat
  • Used to identify cohesive structure in a text,
    its discourse flow and topic areas
  • Computable with any thesaurus containing
    semantic links between words
  • In NLP
  • used for text segmentation, text summarization,
    question answering, word sense disambiguation...
  • but not yet for keyphrase indexing

21
How a Lexical Chain is Created
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
natural disaster
hazard
flashflood
strong currents
wash away
swollen rivers
?
22
How a Lexical Chain is Created
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
flood
natural disaster
flashflood
lexical net
23
How a Lexical Chain is Created
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
More Information Required!
24
Weighting Lexical Chains
  • e.g. Barzilay Elhadad (97)
  • Other known schemes weighting relations, graph
    structure analysis
  • Alternative

Score (chain) all occurrences distinct
occurrences
floods(3)
natural disasters(1)
crops(2)
plant(1)
landslides(1)
score 2
score 1
Use multiple-indexers data to learn the
best scoring function
25
Top 6 Lexical Nets and Human Indexers Terms
management
risk management
insurance
risk
planning
early warningsystems
disasters
damage
risk assessment
investment
entrepreneurship
forecasting
monitoring
  • 6 vietnam
  • 5 natural disasters
  • 5 flooding
  • 5 sustainable development
  • 4 emergency relief
  • 4 risk management
  • 3 local government
  • ...
  • 2 planning
  • ...
  • 1 disasters
  • 1 management
  • 1 risk

laos
vietnam
marshes
natural disasters
flooding
earthquakes
flooded land
fires
landslides
26
Lexical Chains - Summary
27
Lexical Chains - Summary
  • Lexical chains are practical and useful
  • Goal Inclusion in KEA
  • for candidate identification
  • for calculating the features
  • Decisions to be made
  • Chains or nets?
  • Over the whole text or its sequences?
  • What additional lexical sources?
  • What scoring function?
  • Should non-occuring words be included?
  • Multiple-indexers data can be helpful, however
    it is difficult to acquire

28
Agenda
Consistency is important, Difficult to achieve
for humans
  • Manual Indexing
  • Experiment with 6 professionals
  • Inter-indexing consistency
  • Automatic Indexing
  • Existing approaches
  • How KEA works
  • Evaluation Examples
  • Lexical Chains
  • Extraction from text
  • Weighting

Easy for computers! KEAs consistency
with humans is high
Human indexing is subjective
An algorithm learns from humansResults are
always justified
New promising techniques Other domains languages
KEA
Will computers index better than humans?
Write a Comment
User Comments (0)
About PowerShow.com