Automatic Keyphrase Extraction - PowerPoint PPT Presentation

About This Presentation

Title:

Automatic Keyphrase Extraction

Description:

What are some of the ways of obtaining keywords/keyphrases by machine learning? ... Keyphrases may serve as a mini-summary. Partial indexing ... – PowerPoint PPT presentation

Number of Views:677

Avg rating:3.0/5.0

Slides: 18

Provided by: jimnu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Keyphrase Extraction

1
Automatic Keyphrase Extraction

(Jim Nuyens)
Keywords are an everyday part of looking up
topics and specific content. What are some of the
ways of obtaining keywords/keyphrases by machine
learning?
Reviewing some of the work of Peter Turney from
NRC.
The papers are dated 1997 and 1999 so recent
developments in data mining may suggest
improvements.

2
Automated Keyphrase Extraction

Introduction and definitions
Applications and algorithms
Learning algorithms
Empirical results
Future work

3
Definitions

Information extraction text analysis in this
domain serves to provide user-anticipated
information.(Ex.The names of companies in news
services reports.)
Index generation an index may be created as a
back-of-the-book listing for human use or as an
exhaustive computer listing used by a search
engine.
Important phrase extractionmay be used
especially with scientific journals
Keyphrase a phrase of one to three words to
capture the main topic
Keyphrase listusually 5 to 15 keyphrases
Keyphrase generation obtaining the keyphrases
some of which are not available in the body of
the text document.
Keyphrase extraction obtaining the keyphrases
which are available in the body of the text
document.
Note On average about 75 of the keyphrases
appear in the text.

4
Applications and Algorithms

Keyphrases may serve as a mini-summary
Partial indexing
Automated keyphrases can help an author with some
keywords or phrases he may have missed
Labels for text documents
Providing of highlights for a document.
Algorithms of concern stemming of words
Word Porter Stem Lovins Stem
believes believ belief
belief belief belief
believable believ belief
Turney finds the more aggressive Lovins stemming
algorithm to be more useful for keyword
extraction.

5
Applications and Algorithms(continued)

stone church not equal to church
stone
neural networks neural
network
Sometimes the Stemming Algorithm does not get it
correct
Word Porter Stem Lovins Stem
realistic realist real
reality realiti re
Both the Porter and Lovins stemming algorithms
see the two words as distinct.

6
Measuring performance of the algorithms

Confusion matrix for keyphrase extraction
. Human Classified Human Classified
. As a Keyphrase as NOT a Keyphrase
Machine class-
ified as Keyphrase a b
Machine class-
ified as NOT c d
. Human Classified Human Classified
. As a Keyphrase as NOT a Keyphrase
Machine class-
ified as Keyphrase 4 3
Machine class-
ified as NOT 2 (2500-9)
(For a total of 2500
stemmed words)

7
Measuring performance(continued)

Accuracy (ad) / (abcd) 2495/2500
Precision a / (ab) 4/7
Recall a / (ac) 4/6
The F-measure is used as a balanced measure.
F-measure (2a) / (2abc) 8 / 15
A journal article will typically contain 10,000
words and these will narrow down to approximately
2500 stemmed word equivalents. Out of the average
7.5 keyphrases used only 6 keyphrases are
available in the text for extraction.
Class imbalance does present machine learning
difficulties.

8
Empirical Results (Turney,1997)

Method F-measure F-measure F-measure
of Extraction (text1) (text2) (text3)
Microsoft Word 0.154 0.000 0.545
Brills Tagger 0.118 0.077 0.400
Veritys Search 97 0.462 0.000 0.429
NRCs Extractor 0.400 0.080 0.462
The above results are from journal articles. Test
2 was a very difficult scientific article.
The author also obtained good results(F-measures)
for extraction of E-mail and also Web-Page
keyphrases.

9
Machine Learning Results (Turney,1999)

The next paper deals with the different possible
approaches to automatic keyphrase extraction.
Part I The use of C4.5 software to find the
keyphrases. (Where features are provided for the
phrases in the determination of positive and
negative cases.)
Part II The use of the GenEx algorithm which
is the combination of the Genitor genetic
algorithm (Whitely, 1989) and the Extractor
algorithm.(NRC)
Part I
The author went through 110 features before
settling on
1)stemmed_phrase 2)whole phrase 3)
num_words_phrase
4)first_occur_phrase 5)first_occur_word
6)freq_phrase
7)freq_word 8)relative_length
9)proper_noun
10)final_adjective 11)common_verb 12)
class
Class 1 is an extracted keyphrase and Class 0 is
NOT a keyphrase

10
NRCs Extractor

Ten steps
1) Find single stems (stemming algorithm)
2)Score single stems
3)Select top single stems
4)Find stem phrases (phrases up to length 3)
5)Score stem phrases
6)Expand single stems
7)Drop duplicates
8)Add suffixes
9)Add capitals
10)Final output
Summary Extractor is the NRC software that
allows text as the input and keyphrases as the
output.

11
NRCs Extractor(continued)

The tests (within the algorithm)
1)The phrase should not have the capitalization
of a proper noun, unless the flag suppress_proper
is set to zero
2)The phrase should not have an ending that
indicates a possible adjective
3)The phrase should be longer than the
min_length_low_rank
4)If the phrase is shorter than
min_length_low_rank it may still be acceptable
5)If phrase fails both tests 3) and 4) it may
still be acceptable if its capitalization
indicates that it is probably an abbreviation.
6)The phrase should not contain any words
commonly used as verbs.
7)The phrase should not match any phrase
Lastly, a phrase must pass tests 1), 2), 6), and
7) and at least one of 3) , 4) and 5).

12
NRCs Extractor(continued)

Twelve parameters( Used with Extractor and
Genitor)
Parameter Name Range Number of bits
Num_phrases 5,15 0
Num_working 15,75 0
Factor_two_one 1.0,3.0 8
Factor_three_one 1.0,5.0 8
Min_length_low_rank 0.3,3.0 8
Min_rank_low_length 1,20 5
First_low_thresh 1,1000 10
First_high_thresh 1,4000 12
First_low_factor 1.0,15.0 8
First_high_factor 0.01,1.0 8
Stem_length 1,10 4
Suppress_proper 0,1 1
Total number of bits is 72. ( 72-bit
binary string.)

13
GenEx (Combines Genitor with Extractor)

Genitor is run with a population of 50 for 1050
trials(default setting).
Each trial consists of running Extractor with the
parameter settings specified in the 72-bit binary
string.
The fitness measure is based on the average
precision for the whole training set.
The final output is the highest scoring binary
string.
Experimental results in adapting a penalty such
that
. fitness precisionpenalty
(Modification of fitness function to output the
correct number of keyphrases. Penalties vary from
0 to 1.)
Notes on Genitor It is a steady-state genetic
algorithm. The initial population is usually
randomly chosen. Population changes one
individual at a time such that the least fit
individual is replaced by a new randomly selected
individual. Whitely(1989) suggests that
steady-state genetic algorithms are more
aggressive than generational genetic algorithms.

14
GenEx (Continued))

GenEx may take a significant time to run.(750
times longer than C4.5)
GenEx was trained seperately on different corpora
(journals, Emails and Web-pages) in order to
increase precision.
. Average Precision /- Stand.Dev.
Training/Testing GenEx C4.5
Journals 5 0.290/-0.247 0.280/-0.255
. 15 0.177/-0.130 0.170/-0.113
Email 5 0.234/-0.205 0.161/-0.160
. 15 0.122/-0.105 0.100/-0.081
NASA Web-pages 5 0.206/-0.172 0.155/-0.159
. 15 0.118/-0.080 0.092/-0.068
Summary 5 0.239/-0.187 0.181/-0.177
(Averages) 15 0.128/-0.089 0.110/-0.078
The Question still remains whether 29 precision
is acceptable?
How else can automatic keyphrase extraction be
tested?

15
Human Evaluation of GenEx keyphrases

A website explaining GenEx was created where the
reader was asked to volunteer a URL for
processing.
Keyphrases were extracted and then presented to
the user to judge whether he/she found them Good
or Bad or No Opinion.
Web-based human evaluation of keyword extraction
Number of voters 205 Number of documents 267
Number of keyphrases 1869 Max. documents per
person 5
Good 1159(62.1)
Bad 339(18.1)
No Opinion 371(19.9)
From these voters about 80 of the keyphrases
were found to be acceptable.

16
Last Notes on Keyphrase Algorithms

Frank et al.(1999) developed Kea which is a
Bayesian approach to keyphrase extraction. Eibe
Frank and the authors acknowledge the help of
Peter Turney and NRC.
Kea is available through the internet. (See
Weka.)
Turney believes that GenEx and Kea should give
statistically similar results.
The work that was done on specialized procedural
domain knowledge was the main element in the
automation of keyword extraction.
Future Work
Would under-sampling or over-sampling help the
machine learning process? (Since there is a class
imbalance.)
A thesaurus of synonyms would be a welcome
addition.
For specialized journals or Web-pages (ex.
Medline) a lexicon of frequently used keyphrases
could be found.