I256: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

I256: Applied Natural Language Processing

Description:

take an information source, extract the most important content from it and ... Judges were told to extract 25% of the sentences, to maximize coherence, minimize ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 32
Provided by: coursesIs
Category:

less

Transcript and Presenter's Notes

Title: I256: Applied Natural Language Processing


1
I256 Applied Natural Language Processing
Marti Hearst Oct 2, 2006    
2
Contents
  • Introduction and Applications
  • Types of summarization tasks
  • Basic paradigms
  • Single document summarization
  • Evaluation methods

3
Introduction
  • The problem Information overload
  • 4 Billion URLs indexed by Google
  • 200 TB of data on the Web Lyman and Varian 03
  • Information is created every day in enormous
    amounts
  • One solution summarization
  • Abstracts promote current awareness
  • save reading time
  • facilitate selection
  • facilitate literature searches
  • aid in the preparation of reviews
  • But what is an abstract??

4
Introduction
  • abstract
  • brief but accurate representation of the
    contents of a document
  • goal
  • take an information source, extract the most
    important content from it and present it to the
    user in a condensed form and in a manner
    sensitive to the users needs.
  • compression
  • the amount of text to present or the length of
    the summary to the length of the source.

5
History
  • The problem has been addressed since the
    50sLuhn 58
  • Numerous methods are currently being suggested
  • Most methods still rely on 50s-70s algorithms
  • Problem is still hard yet there are some
    applications
  • MS Word,
  • www.newsinessence.com by Drago Radevs research
    group

6
(No Transcript)
7
MSWord AutoSummarize
8
Applications
  • Abstracts for Scientific and other articles
  • News summarization (mostly multiple document
    summarization)
  • Classification of articles and other written data
  • Web pages for search engines
  • Web access from PDAs, Cell phones
  • Question answering and data gathering

9
Types of Summaries
  • Indicative vs Informative
  • Informative a substitute for the entire document
  • Indicative give an idea of what is there
  • Background
  • Does the reader have the needed prior knowledge?
  • Expert reader vs Novice reader
  • Query based or General
  • Query based a form is being filled, answers
    should be answered
  • General General purpose summarization

10
Types of Summaries (input)
  • Single document vs multiple documents
  • Domain specific (chemistry) or general
  • Genre specific (newspaper items) of general

11
Types of Summaries (output)
  • extract vs abstract
  • Extracts representative paragraphs/sentences/
    phrases/words, fragments of the original text
  • Abstracts a concise summary of the central
    subjects in the document.
  • Research shows that sometimes readers prefer
    Extracts!
  • language chosen for summarization
  • format of the resulting summary
    (table/paragraph/key words)

12
Methods
  • Quantitative heuristics, manually scored
  • Machine-learning based statistical scoring
    methods
  • Higher semantic/syntactic structures
  • Network (graph) based methods
  • Other methods (rhetorical analysis, lexical
    chains, co-reference chains)
  • AI methods

13
Quantitative Heuristics
  • General method
  • score each entity (sentence, word) combine
    scores choose best sentence(s)
  • Scoring techniques
  • Word frequencies throughout the text (Luhn 58)
  • Position in the text (Edmunson 69, LinHovy 97)
  • Title method (Edmunson 69)
  • Cue phrases in sentences (Edmunson 69)

14
Using Word Frequencies (Luhn 58)
  • Very first work in automated summarization
  • Assumptions
  • Frequent words indicate the topic
  • Frequent means with reference to the corpus
    frequency
  • Clusters of frequent words indicate summarizing
    sentence
  • Stemming based on similar prefix characters
  • Very common words and very rare words are ignored

15
Ranked Word Frequency
Zipfs curve
16
Word frequencies (Luhn 58)
  • Find consecutive sequences of high-weight
    keywords
  • Allow a certain number of gaps of low-weight
    terms
  • Sentences with highest sum of cluster weights are
    chosen

17
Position in the text (Edmunson 69)
  • Claim Important sentences occur in specific
    positions
  • lead-based summary
  • inverse of position in document works well for
    the news
  • Important information occurs in specific sections
    of the document (introduction/conclusion)

18
Title method (Edmunson 69)
  • Claim title of document indicates its content
  • Unless editors are being cute
  • Not true for novels usually
  • What about blogs ?
  • words in title help find relevant content
  • create a list of title words, remove stop words
  • Use those as keywords in order to find important
    sentences
  • (for example with Luhns methods)

19
Cue phrases method (Edmunson 69)
  • Claim Important sentences contain cue
    words/indicative phrases
  • The main aim of the present paper is to
    describe (IND)
  • The purpose of this article is to review (IND)
  • In this report, we outline (IND)
  • Our investigation has shown that (INF)
  • Some words are considered bonus others stigma
  • bonus comparatives, superlatives, conclusive
    expressions, etc.
  • stigma negatives, pronouns, etc.

20
Feature combination (Edmundson 69)
  • Linear contribution of 4 features
  • title, cue, keyword, position
  • the weights are adjusted using training data with
    any minimization technique
  • Evaluated on a corpus of 200 chemistry articles
  • Length ranged from 100 to 3900 words
  • Judges were told to extract 25 of the sentences,
    to maximize coherence, minimize redundancy.
  • Features
  • Position (sensitive to types of headings for
    sections)
  • cue
  • title
  • keyword
  • Best results obtained with
  • cue title position

21
Bayesian Classifier (Kupiec at el 95)
  • Statistical learning method
  • Feature set
  • sentence length
  • S gt 5
  • fixed phrases
  • 26 manually chosen
  • paragraph
  • sentence position in paragraph
  • thematic words
  • binary whether sentence is included in manual
    extract
  • uppercase words
  • not common acronyms
  • Corpus
  • 188 document summary pairs from scientific
    journals

22
Bayesian Classifier (Kupiec at el 95)
  • Uses Bayesian classifier
  • Assuming statistical independence

23
Bayesian Classifier (Kupiec at el 95)
  • Each Probability is calculated empirically from a
    corpus
  • Higher probability sentences are chosed to be in
    the summary
  • Performance
  • For 25 summaries, 84 precision

24
Evaluation methods
  • When a manual summary is available
  • 1. choose a granularity (clause sentence
    paragraph),
  • 2. create a similarity measure for that
    granularity (word overlap multi-word overlap,
    perfect match),
  • 3. measure the similarity of each unit in the new
    to the most similar unit(s)
  • 4. measure Recall and Precision.
  • Otherwise
  • 1. Intrinsic how good is the summary as a
    summary?
  • 2. Extrinsic how well does the summary help the
    user?

25
Intrinsic measures
  • Intrinsic measures (glass-box) how good is the
    summary as a summary?
  • Problem how do you measure the goodness of a
    summary?
  • Studies compare to ideal (Edmundson, 69 Kupiec
    et al., 95 Salton et al., 97 Marcu, 97) or
    supply criteriafluency, informativeness,
    coverage, etc. (Brandow et al., 95).
  • Summary evaluated on its own or comparing it with
    the source
  • Is the text cohesive and coherent?
  • Does it contain the main topics of the document?
  • Are important topics omitted?

26
Extrinsic measures
  • (Black box) how well does the summary help a
    user with a task?
  • Problem does summary quality correlate with
    performance?
  • Studies GMAT tests (Morris et al., 92) news
    analysis (Miike et al. 94) IR (Mani and
    Bloedorn, 97) text categorization (SUMMAC 98
    Sundheim, 98).
  • Evaluation in an specific task
  • Can the summary be used instead of the document?
  • Can the document be classified by reading the
    summary?
  • Can we answer questions by reading the summary?

27
The Document Understanding Conference (DUC)
  • This is really the Text Summarization Competition
  • Started in 2001
  • Task and Evaluation (for 2001-2004)
  • Various target sizes were used (10-400 words)
  • Both single and multiple-document summaries
    assessed
  • Summaries were manually judged for both content
    and readability.
  • Each peer (human or automatic) summary was
    compared against a single model summary
  • using SEE (http//www.isi.edu/ cyl/SEE/)
  • estimates the percentage of information in the
    model thatwas covered in the peer.
  • Also used ROUGE (Lin 04) in 2004
  • Recall-Oriented Understudy for Gisting Evaluation
  • Uses counts of n-gram overlap between candidate
    and gold-standard summary, assumes fixed-length
    summaries

28
The Document Understanding Conference (DUC)
  • Made a big change in 2005
  • Extrinsic evaluation proposed but rejected (write
    a natural disaster summary)
  • Instead a complex question-focused summarization
    task that required summarizers to piece together
    information from multiple documents to answer a
    question or set of questions as posed in a DUC
    topic.
  • Also indicated a desired granularity of
    information

29
The Document Understanding Conference (DUC)
  • Evaluation metrics for new task
  • Grammaticality
  • Non-redundancy
  • Referential clarity
  • Focus
  • Structure and Coherence
  • Responsiveness (content-based evaluation)
  • This was a difficult task to do well in.

30
Lets make a summarizer!
  • Each person (or pair) write code for one small
    part of the problem, using Kupiec et als method.
  • Well combine the parts in class.

31
Next Time
  • More on Bayesian classification
  • Other summarization approaches (Marcu paper)
  • Multi-document summarization (Goldstein et al.
    paper)
  • In-class summarizer!
Write a Comment
User Comments (0)
About PowerShow.com