Automatic Keyphrase Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Keyphrase Extraction

Description:

What are some of the ways of obtaining keywords/keyphrases by machine learning? ... Keyphrases may serve as a mini-summary. Partial indexing ... – PowerPoint PPT presentation

Number of Views:677
Avg rating:3.0/5.0
Slides: 18
Provided by: jimnu
Category:

less

Transcript and Presenter's Notes

Title: Automatic Keyphrase Extraction


1
Automatic Keyphrase Extraction
  • (Jim Nuyens)
  • Keywords are an everyday part of looking up
    topics and specific content. What are some of the
    ways of obtaining keywords/keyphrases by machine
    learning?
  • Reviewing some of the work of Peter Turney from
    NRC.
  • The papers are dated 1997 and 1999 so recent
    developments in data mining may suggest
    improvements.

2
Automated Keyphrase Extraction
  • Introduction and definitions
  • Applications and algorithms
  • Learning algorithms
  • Empirical results
  • Future work

3
Definitions
  • Information extraction text analysis in this
    domain serves to provide user-anticipated
    information.(Ex.The names of companies in news
    services reports.)
  • Index generation an index may be created as a
    back-of-the-book listing for human use or as an
    exhaustive computer listing used by a search
    engine.
  • Important phrase extractionmay be used
    especially with scientific journals
  • Keyphrase a phrase of one to three words to
    capture the main topic
  • Keyphrase listusually 5 to 15 keyphrases
  • Keyphrase generation obtaining the keyphrases
    some of which are not available in the body of
    the text document.
  • Keyphrase extraction obtaining the keyphrases
    which are available in the body of the text
    document.
  • Note On average about 75 of the keyphrases
    appear in the text.

4
Applications and Algorithms
  • Keyphrases may serve as a mini-summary
  • Partial indexing
  • Automated keyphrases can help an author with some
    keywords or phrases he may have missed
  • Labels for text documents
  • Providing of highlights for a document.
  • Algorithms of concern stemming of words
  • Word Porter Stem Lovins Stem
  • believes believ belief
  • belief belief belief
  • believable believ belief
  • Turney finds the more aggressive Lovins stemming
    algorithm to be more useful for keyword
    extraction.

5
Applications and Algorithms(continued)
  • stone church not equal to church
    stone
  • neural networks neural
    network
  • Sometimes the Stemming Algorithm does not get it
    correct
  • Word Porter Stem Lovins Stem
  • realistic realist real
  • reality realiti re
  • Both the Porter and Lovins stemming algorithms
    see the two words as distinct.

6
Measuring performance of the algorithms
  • Confusion matrix for keyphrase extraction
  • . Human Classified Human Classified
  • . As a Keyphrase as NOT a Keyphrase
  • Machine class-
  • ified as Keyphrase a b
  • Machine class-
  • ified as NOT c d
  • . Human Classified Human Classified
  • . As a Keyphrase as NOT a Keyphrase
  • Machine class-
  • ified as Keyphrase 4 3
  • Machine class-
  • ified as NOT 2 (2500-9)
  • (For a total of 2500
    stemmed words)

7
Measuring performance(continued)
  • Accuracy (ad) / (abcd) 2495/2500
  • Precision a / (ab) 4/7
  • Recall a / (ac) 4/6
  • The F-measure is used as a balanced measure.
  • F-measure (2a) / (2abc) 8 / 15
  • A journal article will typically contain 10,000
    words and these will narrow down to approximately
    2500 stemmed word equivalents. Out of the average
    7.5 keyphrases used only 6 keyphrases are
    available in the text for extraction.
  • Class imbalance does present machine learning
    difficulties.

8
Empirical Results (Turney,1997)
  • Method F-measure F-measure F-measure
  • of Extraction (text1) (text2) (text3)
  • Microsoft Word 0.154 0.000 0.545
  • Brills Tagger 0.118 0.077 0.400
  • Veritys Search 97 0.462 0.000 0.429
  • NRCs Extractor 0.400 0.080 0.462
  • The above results are from journal articles. Test
    2 was a very difficult scientific article.
  • The author also obtained good results(F-measures)
    for extraction of E-mail and also Web-Page
    keyphrases.

9
Machine Learning Results (Turney,1999)
  • The next paper deals with the different possible
    approaches to automatic keyphrase extraction.
  • Part I The use of C4.5 software to find the
    keyphrases. (Where features are provided for the
    phrases in the determination of positive and
    negative cases.)
  • Part II The use of the GenEx algorithm which
    is the combination of the Genitor genetic
    algorithm (Whitely, 1989) and the Extractor
    algorithm.(NRC)
  • Part I
  • The author went through 110 features before
    settling on
  • 1)stemmed_phrase 2)whole phrase 3)
    num_words_phrase
  • 4)first_occur_phrase 5)first_occur_word
    6)freq_phrase
  • 7)freq_word 8)relative_length
    9)proper_noun
  • 10)final_adjective 11)common_verb 12)
    class
  • Class 1 is an extracted keyphrase and Class 0 is
    NOT a keyphrase

10
NRCs Extractor
  • Ten steps
  • 1) Find single stems (stemming algorithm)
  • 2)Score single stems
  • 3)Select top single stems
  • 4)Find stem phrases (phrases up to length 3)
  • 5)Score stem phrases
  • 6)Expand single stems
  • 7)Drop duplicates
  • 8)Add suffixes
  • 9)Add capitals
  • 10)Final output
  • Summary Extractor is the NRC software that
    allows text as the input and keyphrases as the
    output.

11
NRCs Extractor(continued)
  • The tests (within the algorithm)
  • 1)The phrase should not have the capitalization
    of a proper noun, unless the flag suppress_proper
    is set to zero
  • 2)The phrase should not have an ending that
    indicates a possible adjective
  • 3)The phrase should be longer than the
    min_length_low_rank
  • 4)If the phrase is shorter than
    min_length_low_rank it may still be acceptable
  • 5)If phrase fails both tests 3) and 4) it may
    still be acceptable if its capitalization
    indicates that it is probably an abbreviation.
  • 6)The phrase should not contain any words
    commonly used as verbs.
  • 7)The phrase should not match any phrase
  • Lastly, a phrase must pass tests 1), 2), 6), and
    7) and at least one of 3) , 4) and 5).

12
NRCs Extractor(continued)
  • Twelve parameters( Used with Extractor and
    Genitor)
  • Parameter Name Range Number of bits
  • Num_phrases 5,15 0
  • Num_working 15,75 0
  • Factor_two_one 1.0,3.0 8
  • Factor_three_one 1.0,5.0 8
  • Min_length_low_rank 0.3,3.0 8
  • Min_rank_low_length 1,20 5
  • First_low_thresh 1,1000 10
  • First_high_thresh 1,4000 12
  • First_low_factor 1.0,15.0 8
  • First_high_factor 0.01,1.0 8
  • Stem_length 1,10 4
  • Suppress_proper 0,1 1
  • Total number of bits is 72. ( 72-bit
    binary string.)

13
GenEx (Combines Genitor with Extractor)
  • Genitor is run with a population of 50 for 1050
    trials(default setting).
  • Each trial consists of running Extractor with the
    parameter settings specified in the 72-bit binary
    string.
  • The fitness measure is based on the average
    precision for the whole training set.
  • The final output is the highest scoring binary
    string.
  • Experimental results in adapting a penalty such
    that
  • . fitness precisionpenalty
  • (Modification of fitness function to output the
    correct number of keyphrases. Penalties vary from
    0 to 1.)
  • Notes on Genitor It is a steady-state genetic
    algorithm. The initial population is usually
    randomly chosen. Population changes one
    individual at a time such that the least fit
    individual is replaced by a new randomly selected
    individual. Whitely(1989) suggests that
    steady-state genetic algorithms are more
    aggressive than generational genetic algorithms.

14
GenEx (Continued))
  • GenEx may take a significant time to run.(750
    times longer than C4.5)
  • GenEx was trained seperately on different corpora
    (journals, Emails and Web-pages) in order to
    increase precision.
  • . Average Precision /- Stand.Dev.
  • Training/Testing GenEx C4.5
  • Journals 5 0.290/-0.247 0.280/-0.255
  • . 15 0.177/-0.130 0.170/-0.113
  • Email 5 0.234/-0.205 0.161/-0.160
  • . 15 0.122/-0.105 0.100/-0.081
  • NASA Web-pages 5 0.206/-0.172 0.155/-0.159
  • . 15 0.118/-0.080 0.092/-0.068
  • Summary 5 0.239/-0.187 0.181/-0.177
  • (Averages) 15 0.128/-0.089 0.110/-0.078
  • The Question still remains whether 29 precision
    is acceptable?
  • How else can automatic keyphrase extraction be
    tested?

15
Human Evaluation of GenEx keyphrases
  • A website explaining GenEx was created where the
    reader was asked to volunteer a URL for
    processing.
  • Keyphrases were extracted and then presented to
    the user to judge whether he/she found them Good
    or Bad or No Opinion.
  • Web-based human evaluation of keyword extraction
  • Number of voters 205 Number of documents 267
  • Number of keyphrases 1869 Max. documents per
    person 5
  • Good 1159(62.1)
  • Bad 339(18.1)
  • No Opinion 371(19.9)
  • From these voters about 80 of the keyphrases
    were found to be acceptable.

16
Last Notes on Keyphrase Algorithms
  • Frank et al.(1999) developed Kea which is a
    Bayesian approach to keyphrase extraction. Eibe
    Frank and the authors acknowledge the help of
    Peter Turney and NRC.
  • Kea is available through the internet. (See
    Weka.)
  • Turney believes that GenEx and Kea should give
    statistically similar results.
  • The work that was done on specialized procedural
    domain knowledge was the main element in the
    automation of keyword extraction.
  • Future Work
  • Would under-sampling or over-sampling help the
    machine learning process? (Since there is a class
    imbalance.)
  • A thesaurus of synonyms would be a welcome
    addition.
  • For specialized journals or Web-pages (ex.
    Medline) a lexicon of frequently used keyphrases
    could be found.

17
Bibliography
  • Extraction of Keyphrases from Text Evaluation of
    Four Algorithms (Turney, 1997)
  • Learning Algorithms for Keyphrase Extraction
  • (Turney, 1999)
  • Kea Practical Automatic Keyphrase Extraction
  • (Witten et al., 1999)
Write a Comment
User Comments (0)
About PowerShow.com