Title: Automatic Keyphrase Extraction
1Automatic Keyphrase Extraction
- (Jim Nuyens)
- Keywords are an everyday part of looking up
topics and specific content. What are some of the
ways of obtaining keywords/keyphrases by machine
learning? - Reviewing some of the work of Peter Turney from
NRC. - The papers are dated 1997 and 1999 so recent
developments in data mining may suggest
improvements.
2Automated Keyphrase Extraction
- Introduction and definitions
- Applications and algorithms
- Learning algorithms
- Empirical results
- Future work
3Definitions
- Information extraction text analysis in this
domain serves to provide user-anticipated
information.(Ex.The names of companies in news
services reports.) - Index generation an index may be created as a
back-of-the-book listing for human use or as an
exhaustive computer listing used by a search
engine. - Important phrase extractionmay be used
especially with scientific journals - Keyphrase a phrase of one to three words to
capture the main topic - Keyphrase listusually 5 to 15 keyphrases
- Keyphrase generation obtaining the keyphrases
some of which are not available in the body of
the text document. - Keyphrase extraction obtaining the keyphrases
which are available in the body of the text
document. - Note On average about 75 of the keyphrases
appear in the text.
4Applications and Algorithms
- Keyphrases may serve as a mini-summary
- Partial indexing
- Automated keyphrases can help an author with some
keywords or phrases he may have missed - Labels for text documents
- Providing of highlights for a document.
- Algorithms of concern stemming of words
- Word Porter Stem Lovins Stem
- believes believ belief
- belief belief belief
- believable believ belief
- Turney finds the more aggressive Lovins stemming
algorithm to be more useful for keyword
extraction.
5Applications and Algorithms(continued)
- stone church not equal to church
stone - neural networks neural
network - Sometimes the Stemming Algorithm does not get it
correct - Word Porter Stem Lovins Stem
- realistic realist real
- reality realiti re
- Both the Porter and Lovins stemming algorithms
see the two words as distinct.
6Measuring performance of the algorithms
- Confusion matrix for keyphrase extraction
- . Human Classified Human Classified
- . As a Keyphrase as NOT a Keyphrase
- Machine class-
- ified as Keyphrase a b
- Machine class-
- ified as NOT c d
- . Human Classified Human Classified
- . As a Keyphrase as NOT a Keyphrase
- Machine class-
- ified as Keyphrase 4 3
- Machine class-
- ified as NOT 2 (2500-9)
- (For a total of 2500
stemmed words)
7Measuring performance(continued)
- Accuracy (ad) / (abcd) 2495/2500
- Precision a / (ab) 4/7
- Recall a / (ac) 4/6
- The F-measure is used as a balanced measure.
- F-measure (2a) / (2abc) 8 / 15
- A journal article will typically contain 10,000
words and these will narrow down to approximately
2500 stemmed word equivalents. Out of the average
7.5 keyphrases used only 6 keyphrases are
available in the text for extraction. - Class imbalance does present machine learning
difficulties.
8Empirical Results (Turney,1997)
- Method F-measure F-measure F-measure
- of Extraction (text1) (text2) (text3)
- Microsoft Word 0.154 0.000 0.545
- Brills Tagger 0.118 0.077 0.400
- Veritys Search 97 0.462 0.000 0.429
- NRCs Extractor 0.400 0.080 0.462
- The above results are from journal articles. Test
2 was a very difficult scientific article. - The author also obtained good results(F-measures)
for extraction of E-mail and also Web-Page
keyphrases.
9Machine Learning Results (Turney,1999)
- The next paper deals with the different possible
approaches to automatic keyphrase extraction. - Part I The use of C4.5 software to find the
keyphrases. (Where features are provided for the
phrases in the determination of positive and
negative cases.) - Part II The use of the GenEx algorithm which
is the combination of the Genitor genetic
algorithm (Whitely, 1989) and the Extractor
algorithm.(NRC) - Part I
- The author went through 110 features before
settling on - 1)stemmed_phrase 2)whole phrase 3)
num_words_phrase - 4)first_occur_phrase 5)first_occur_word
6)freq_phrase - 7)freq_word 8)relative_length
9)proper_noun - 10)final_adjective 11)common_verb 12)
class - Class 1 is an extracted keyphrase and Class 0 is
NOT a keyphrase
10NRCs Extractor
- Ten steps
- 1) Find single stems (stemming algorithm)
- 2)Score single stems
- 3)Select top single stems
- 4)Find stem phrases (phrases up to length 3)
- 5)Score stem phrases
- 6)Expand single stems
- 7)Drop duplicates
- 8)Add suffixes
- 9)Add capitals
- 10)Final output
- Summary Extractor is the NRC software that
allows text as the input and keyphrases as the
output.
11NRCs Extractor(continued)
- The tests (within the algorithm)
- 1)The phrase should not have the capitalization
of a proper noun, unless the flag suppress_proper
is set to zero - 2)The phrase should not have an ending that
indicates a possible adjective - 3)The phrase should be longer than the
min_length_low_rank - 4)If the phrase is shorter than
min_length_low_rank it may still be acceptable - 5)If phrase fails both tests 3) and 4) it may
still be acceptable if its capitalization
indicates that it is probably an abbreviation. - 6)The phrase should not contain any words
commonly used as verbs. - 7)The phrase should not match any phrase
- Lastly, a phrase must pass tests 1), 2), 6), and
7) and at least one of 3) , 4) and 5).
12NRCs Extractor(continued)
- Twelve parameters( Used with Extractor and
Genitor) - Parameter Name Range Number of bits
- Num_phrases 5,15 0
- Num_working 15,75 0
- Factor_two_one 1.0,3.0 8
- Factor_three_one 1.0,5.0 8
- Min_length_low_rank 0.3,3.0 8
- Min_rank_low_length 1,20 5
- First_low_thresh 1,1000 10
- First_high_thresh 1,4000 12
- First_low_factor 1.0,15.0 8
- First_high_factor 0.01,1.0 8
- Stem_length 1,10 4
- Suppress_proper 0,1 1
- Total number of bits is 72. ( 72-bit
binary string.)
13GenEx (Combines Genitor with Extractor)
- Genitor is run with a population of 50 for 1050
trials(default setting). - Each trial consists of running Extractor with the
parameter settings specified in the 72-bit binary
string. - The fitness measure is based on the average
precision for the whole training set. - The final output is the highest scoring binary
string. - Experimental results in adapting a penalty such
that - . fitness precisionpenalty
- (Modification of fitness function to output the
correct number of keyphrases. Penalties vary from
0 to 1.) - Notes on Genitor It is a steady-state genetic
algorithm. The initial population is usually
randomly chosen. Population changes one
individual at a time such that the least fit
individual is replaced by a new randomly selected
individual. Whitely(1989) suggests that
steady-state genetic algorithms are more
aggressive than generational genetic algorithms.
14GenEx (Continued))
- GenEx may take a significant time to run.(750
times longer than C4.5) - GenEx was trained seperately on different corpora
(journals, Emails and Web-pages) in order to
increase precision. - . Average Precision /- Stand.Dev.
- Training/Testing GenEx C4.5
- Journals 5 0.290/-0.247 0.280/-0.255
- . 15 0.177/-0.130 0.170/-0.113
- Email 5 0.234/-0.205 0.161/-0.160
- . 15 0.122/-0.105 0.100/-0.081
- NASA Web-pages 5 0.206/-0.172 0.155/-0.159
- . 15 0.118/-0.080 0.092/-0.068
- Summary 5 0.239/-0.187 0.181/-0.177
- (Averages) 15 0.128/-0.089 0.110/-0.078
- The Question still remains whether 29 precision
is acceptable? - How else can automatic keyphrase extraction be
tested?
15Human Evaluation of GenEx keyphrases
- A website explaining GenEx was created where the
reader was asked to volunteer a URL for
processing. - Keyphrases were extracted and then presented to
the user to judge whether he/she found them Good
or Bad or No Opinion. - Web-based human evaluation of keyword extraction
- Number of voters 205 Number of documents 267
- Number of keyphrases 1869 Max. documents per
person 5 - Good 1159(62.1)
- Bad 339(18.1)
- No Opinion 371(19.9)
- From these voters about 80 of the keyphrases
were found to be acceptable.
16Last Notes on Keyphrase Algorithms
- Frank et al.(1999) developed Kea which is a
Bayesian approach to keyphrase extraction. Eibe
Frank and the authors acknowledge the help of
Peter Turney and NRC. - Kea is available through the internet. (See
Weka.) - Turney believes that GenEx and Kea should give
statistically similar results. - The work that was done on specialized procedural
domain knowledge was the main element in the
automation of keyword extraction. - Future Work
- Would under-sampling or over-sampling help the
machine learning process? (Since there is a class
imbalance.) - A thesaurus of synonyms would be a welcome
addition. - For specialized journals or Web-pages (ex.
Medline) a lexicon of frequently used keyphrases
could be found.
17Bibliography
- Extraction of Keyphrases from Text Evaluation of
Four Algorithms (Turney, 1997) - Learning Algorithms for Keyphrase Extraction
- (Turney, 1999)
- Kea Practical Automatic Keyphrase Extraction
- (Witten et al., 1999)