Lecture 7: Abstract Generation - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Lecture 7: Abstract Generation

Description:

Identified as a critical research area with increasing attention from the ... same document with the same hypernyms can be replaced by their common hypernym. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 40
Provided by: scie241
Category:

less

Transcript and Presenter's Notes

Title: Lecture 7: Abstract Generation


1
Lecture 7Abstract Generation
  • Prof. Xiaotie Deng
  • Department of Computer Science

2
Outline
  • Introduction
  • Luhns Method
  • Saltons Rank of Documents
  • Advance Approaches
  • Statistical
  • Linguistic

3
Introduction Role of Summary in IR
  • Reduces the processing time
  • Identified as a critical research area with
    increasing attention from the commercial sectors.

4
Introduction Advantage of Abstract
  • Compare the abstract with queries instead of the
    full text.
  • Similarity of two documents defined as the
    similarity of their abstract
  • similarity of query with document is the
    similarity of the query with the abstract of the
    document.
  • Improvement of performance is related to
  • the ratio of documents size versus abstract size
  • the ratio of words in document versus words in
    abstract

5
Luhns Method Introduction
  • Establishing a set of significant words in a file
  • Measuring significance of sentences using
    significant words.
  • The automatic abstract for one document consists
    of its highest ranked sentences.
  • H.P. Luhn, The Automatic Creation of Literature
    Abstracts, IBM Journal of Research Development
    2(2) 159-165, 1958.

6
Luhns Method Arguments Page 1
  • The more a writer repeats a word, the more the
    writer emphasizes it. This may be taken as an
    indicator of significance
  • The more certain words find in each others
    company, the more significance may be attributed
    to each of these words
  • Certain common words must be present to serve the
    function of tying other words together but not
    significant (stop words).

7
Luhns Method Arguments Page 2
  • Thesaurus not used
  • Even if the author makes a reasonable effort to
    select synonyms for stylistic reasons, he soon
    runs out of legitimate alternatives and falls
    into repetition if the notion being expressed was
    potentially significant in the first place.

8
Luhns Method Arguments Consequences
  • The method avoids linguistic implication as
    grammar and syntax.
  • It does not differentiate
  • differ
  • difference
  • different
  • differently
  • It implies stemming method

9
Luhns Method Algorithm
  • Apply a stemming method to the words
  • Sort the word in the descending frequency
  • Remove all words of frequencies higher than a
    cutoff value
  • Alternatively remove common word using stop list
  • Remove all words of frequencies lower than
    another cutoff value

10
Luhns Method Concept Word-Frequency
Diagram
frequencies
significance
Individual words in the order of frequency
11
Luhns Method Concept Significant Sentences
  • Wherever the greatest number of significant
    different words are found in great physical
    proximity to each other,
  • the probability is very high that the information
    being conveyed is most representative of the
    article.

12
Luhns Method Concept Relative
Significance of Sentences
  • Obtain a cluster of significant words in a
    sentence
  • separated by no more than 5 non-significant
    words.
  • Calculate the significance factor of each cluster
  • the square of the number of significant words in
    the cluster, divided by the total number of words
    in the cluster.
  • The higher one (among the clusters) is taken as
    the measure for the sentence

13
Luhns Method Example File
  • The Department of Computer Science was
    established in 1984 and has since evolved from a
    primarily teaching-oriented department in its
  • Polytechnic days into one which excels in both
    teaching and research within the Faculty of
    Science and Engineering of the now City
  • University.
  • The Department launched its first BSc(Hons) in
    Computer Studies in 1987, followed by the MSc in
    Computer Science which was started
  • in 1991. The Department also produced its first
    PhD graduate in 1994.
  • In addition to offering traditional courses
    such as foundations of computer science, computer
    architecture and software engineering, our
  • curriculum also exposes our students to the
    latest advances in distributed databases,
    parallel computing, computer graphics, internet
  • programming, multimedia systems and high speed
    networking. Students will also have the
    opportunity, as part of their learning, to
  • undertake a major design and development
    project in new areas such as electronic commerce,
    virtual reality, multimedia information
  • retrieval, computer vision, object-oriented and
    distributed databases, data-mining and
    webcasting.
  • The Department is also committed to continuing
    education, particularly, in the applications of
    Information Technology (IT) in education.
  • This is reflected in the two part-time
    programmes initiated by the Department for
    in-service primary and secondary school teachers.
    These
  • programmes aim to equip school teachers with
    the necessary fundamental knowledge and skills
    needed to apply IT in teaching and
  • school management as well as to exploit
    multimedia technology in courseware preparation
    and delivery.
  • All the teaching within the Department is
    subject to stringent quality assurance procedures
    to ensure the highest quality of instruction.

14
Luhns Method Example Frequencies of words
(not in stop list)
  • Comput 12
  • department 9
  • teach 6
  • science 5
  • multimedia 4
  • 3 stud, distrrib, program, technology, school,
    high
  • 2 primary, oriented, research, engineering,
    course,advanc,database,graphics,information,
    system,network,major,education,quality,include
  • 1 all others

15
Luhns Method Example Significant Words
  • Set cutoff value at 4 to determine significant
    words
  • Comput 12
  • department 9
  • teach 6
  • science 5
  • multimedia 4

16
Luhns Method Example Mark significant
words
  • The Department of Computer Science was
    established in 1984 and has since evolved from a
    primarily teaching-oriented department in its
  • Polytechnic days into one which excels in both
    teaching and research within the Faculty of
    Science and Engineering of the now City
  • University.
  • The Department launched its first BSc(Hons) in
    Computer Studies in 1987, followed by the MSc in
    Computer Science which was started
  • in 1991. The Department also produced its first
    PhD graduate in 1994.
  • In addition to offering traditional courses
    such as foundations of computer science, computer
    architecture and software engineering, our
  • curriculum also exposes our students to the
    latest advances in distributed databases,
    parallel computing, computer graphics, internet
  • programming, multimedia systems and high speed
    networking. Students will also have the
    opportunity, as part of their learning, to
  • undertake a major design and development
    project in new areas such as electronic commerce,
    virtual reality, multimedia information
  • retrieval, computer vision, object-oriented and
    distributed databases, data-mining and
    webcasting.
  • The Department is also committed to continuing
    education, particularly, in the applications of
    Information Technology (IT) in education.
  • This is reflected in the two part-time
    programmes initiated by the Department for
    in-service primary and secondary school teachers.
    These
  • programmes aim to equip school teachers with
    the necessary fundamental knowledge and skills
    needed to apply IT in teaching and
  • school management as well as to exploit
    multimedia technology in courseware preparation
    and delivery.
  • All the teaching within the Department is
    subject to stringent quality assurance procedures
    to ensure the highest quality of instruction.

17
Luhns Method Example Remove
non-siginificant sentences
  • The Department of Computer Science was
    established in 1984 and has since evolved from a
    primarily teaching-oriented department in its
  • Polytechnic days into one which excels in both
    teaching and research within the Faculty of
    Science and Engineering of the now City
  • University.
  • The Department launched its first BSc(Hons) in
    Computer Studies in 1987, followed by the MSc in
    Computer Science which was started
  • in 1991. The Department also produced its first
    PhD graduate in 1994.
  • In addition to offering traditional courses
    such as foundations of computer science, computer
    architecture and software engineering, our
  • curriculum also exposes our students to the
    latest advances in distributed databases,
    parallel computing, computer graphics, internet
  • programming, multimedia systems and high speed
    networking. Students will also have the
    opportunity, as part of their learning, to
  • undertake a major design and development
    project in new areas such as electronic commerce,
    virtual reality, multimedia information
  • retrieval, computer vision, object-oriented and
    distributed databases, data-mining and
    webcasting.
  • This is reflected in the two part-time
    programmes initiated by the Department for
    in-service primary and secondary school teachers.
    These
  • programmes aim to equip school teachers with
    the necessary fundamental knowledge and skills
    needed to apply IT in teaching and
  • school management as well as to exploit
    multimedia technology in courseware preparation
    and delivery.
  • All the teaching within the Department is
    subject to stringent quality assurance procedures
    to ensure the highest quality of instruction.

18
Luhns Method Example Find clusters
  • Department of Computer Science
  • teaching-oriented department teaching
    Science
  • The Department launched its first BSc(Hons) in
    Computer Computer Science Department
  • computer science, computer
  • computing, computer graphics, internet
    programming, multimedia
  • multimedia information retrieval, computer
  • Department
  • teachers. teachers
  • teaching multimedia
  • teaching within the Department
  • Department maintains state-of-the-art
    computing
  • supercomputer,
  • The Department multimedia and image
    computing, distributed and real-time systems,
    theoretical computer science,

19
Luhns Method Example Remove clusters of
single word
  • Department of Computer Science
  • teaching-oriented department
  • Department launched its first BSc(Hons) in
    Computer
  • Computer Science
  • computer science, computer
  • computing, computer graphics, internet
    programming, multimedia
  • multimedia information retrieval, computer
  • teaching within the Department
  • Department maintains state-of-the-art computing
  • multimedia and image computing, distributed and
    real-time systems, theoretical computer science,

20
Luhns Method Example Calculate the
significance of clusters
  • Department of Computer Science 33/42.25
  • teaching-oriented department 22/22
  • Department launched its first BSc(Hons) in
    Computer lt2
  • Computer Science 22/22
  • computer science, computer 33/33
  • computing, computer graphics, internet
    programming, multimedialt2
  • multimedia information retrieval, computer lt2
  • teaching within the Department lt2
  • Department maintains state-of-the-art computing
    lt2
  • multimedia and image computing, distributed and
    real-time systems, theoretical computer science,
    44/11lt2

21
Luhns Method Example Two highest ranked
clusters
  • Department of Computer Science 33/42.25
  • computer science, computer 33/33

22
Luhns Method Example Abstract -- Two
Highest Ranked Sentences
  • The Department of Computer Science was
    established in 1984 and has since evolved from a
    primarily teaching-oriented department in its
    Polytechnic days into one which excels in both
    teaching and research within the Faculty of
    Science and Engineering of the now City
    University.
  • In addition to offering traditional courses
    such as foundations of computer science, computer
    architecture and software engineering, our
    curriculum also exposes our students to the
    latest advances in distributed databases,
    parallel computing, computer graphics, internet
    programming, multimedia systems and high speed
    networking.

23
Luhns Method Further Improvement 1
  • Use synonyms
  • Two words of the same mean
  • Replace words of the same meaning in a text by
    one word to increase its chance to become
    important word
  • Example search for car in wordnet
  • http//www.cogsci.princeton.edu/cgi-bin/webwn1.7.1
    ?stage2wordcarposnumber1searchtypenumber26
    sensesshowglosses1

24
Luhns Method Further Improvement 2
  • Use Hypernyms (this is a kind of...)
  • Two or more synonyms in the same document with
    the same hypernyms can be replaced by their
    common hypernym.
  • Then we may proceed similarly to construct the
    abstract

25
Saltons Rank of Documents
  • Vector Space model of Document
  • Documents are ranked, with respect to each set of
    query keywords, according to the weights of those
    keywords for those documents Sa.
  • G. Salton, The SMART system experiments in
    automatic document processing, Prentice Hall,
    1971.

26
Salton Background Term-Document Association
Matrix Illustration
Terms
Weight of a term in the document
Documents
27
Salton Background Term-Document Association
Matrix Decide the weight
  • Combine two factors in the document-term weight
  • tfij frequency of term j in document I
  • df j document frequency of term j
  • number of documents containing term j
  • idfj
  • inverse document frequency of term j
  • log2 (N/ df j) (N number of documents in
    collection)
  • Inverse document frequency -- an indication of
    term values as a document discriminator.

28
Salton Background Term-Document Association
Matrix Tf-idf term weight
  • A typical combined term importance indicator
  • wij tfij? idfj tfij? log2 (N/ df j)
  • A term occurs frequently in one document but
    rarely in the remaining of the collection has a
    high weight in the document
  • df jN
  • Therefore, wij 0 for all i. This is the
    interpretation of stop list
  • a term appear in every document has no
    information value.

29
Salton Significance of Sentence
  • Cutoff value to decide significant words
  • cluster defined as before
  • significance of cluster
  • square of total weight of significant words in
    cluster divided by total weight of all words in
    cluster
  • return the maximum cluster significance in a
    sentence

30
Salton Abstract
  • Method 1 (Usual)
  • choose the most significant sentences as
    abstract.
  • Method 2 (Alternative)
  • Choose a set of significant words and their
    weights in a document as the abstract for the
    document.
  • Question what set of words to choose (the most
    significant ones may be highly co-related)?

31
Advance Approaches
  • Two types
  • Statistical Features
  • relevance feedback and user interest profile
  • Linguistic Features
  • experimental identification of linguistic
    features important in summaries

32
Advance Approaches Statistical Relevance
Feedback
  • Use of user feedback to help summarization of
    document
  • G. Salton and Buckley, Improving retrieval
    performance by relevance feedback, Journal of
    American Society for Information Sciences 41(
    1990) 288-197
  • Sumner, R. G., Jr., Yang, K., Akers, R., Shaw,
    W. M., Jr. (1998). Interactive retrieval using
    IRIS TREC-6 experiments. In E. M. Voorhees D.
    K. Harman (Eds.), The Sixth Text REtrieval
    Conference (TREC-6) (NIST Spec. Publ. 500-240,
    pp. 711-734). Washington, DC U.S. Government
    Printing Office.

33
Advance Approaches Linguistic
  • Sentences are ranked for potential inclusion in
    the summary using linguistic features derived
    from an analysis of news-wire summaries
  • V. Mittal, M. Kantrowitz, J.Goldstein,
    J.Carbonell, Selecting Text Span for Document
    Summaries Heuristics and Metrics

34
Advance Approaches Linguistic Positive
Features
  • Greater average word length
  • Thematic Phrases
  • finally
  • in conclusion
  • Density of related words
  • words that have multiple related terms in the
    documents such as synonyms, hypernyms and antonyms

35
Advance Approaches Linguistic Negative
Features
  • Words and phrases common in direct or indirect
    quotation.
  • according, adding, said, and other verbs related
    to communication
  • Informal and imprecise terms got, really, use
  • Auxillary verbs was, could, did
  • Honorifics Dr., Mr., Mrs.
  • Negations dont, no, never
  • Integers
  • Evaluatie and qualifying words often, about,
    significant
  • prepositions at, by, for, of, in, to, with

36
Advance Approaches Relevant Novelty
  • Carbonell and Goldstein proposed to measure
    relevant novelty as a measure (independent of
    relevance) in automatic document summarization to
    reduce redundancy.
  • J. Carbonell, J. Goldstein, The use of MMR,
    Divieristy-Based Reranking for reordering
    documents and producing summaries, SIGIR98,
    335-336.

37
Advance Approaches Relevant Novelty
Maximal Marginal Relevance Criteria Page 1
  • Documents are selected one by one
  • Given a query Q, select a document D that
    maximizes the difference of two factors
  • similarity of Q and D (times ?)
  • maximum similarity of D with documents already
    selected (times 1- ?)
  • //we want it to be similar to the query
  • //and not similar to chosen documents

38
Advance Approaches Relevant Novelty
Maximal Marginal Relevance Criteria Page 2
  • Sequentially select documents
  • Given a query Q, select a document D from
    not-yet-chosen documents to maximize the
    difference of two factors
  • ? sim(Q, D)
  • (1- ?)maxall i sim(D , Di),
  • where Dis are already chosen documents.

39
Summary
  • Introduction
  • Luhns Method
  • Saltons Rank of Documents
  • Advance Approaches
  • Statistical
  • Linguistic
Write a Comment
User Comments (0)
About PowerShow.com