Title: Lecture 7: Abstract Generation
1Lecture 7Abstract Generation
- Prof. Xiaotie Deng
- Department of Computer Science
2Outline
- Introduction
- Luhns Method
- Saltons Rank of Documents
- Advance Approaches
- Statistical
- Linguistic
3Introduction Role of Summary in IR
- Reduces the processing time
- Identified as a critical research area with
increasing attention from the commercial sectors.
4Introduction Advantage of Abstract
- Compare the abstract with queries instead of the
full text. - Similarity of two documents defined as the
similarity of their abstract - similarity of query with document is the
similarity of the query with the abstract of the
document. - Improvement of performance is related to
- the ratio of documents size versus abstract size
- the ratio of words in document versus words in
abstract
5Luhns Method Introduction
- Establishing a set of significant words in a file
- Measuring significance of sentences using
significant words. - The automatic abstract for one document consists
of its highest ranked sentences. - H.P. Luhn, The Automatic Creation of Literature
Abstracts, IBM Journal of Research Development
2(2) 159-165, 1958.
6Luhns Method Arguments Page 1
- The more a writer repeats a word, the more the
writer emphasizes it. This may be taken as an
indicator of significance - The more certain words find in each others
company, the more significance may be attributed
to each of these words - Certain common words must be present to serve the
function of tying other words together but not
significant (stop words).
7Luhns Method Arguments Page 2
- Thesaurus not used
- Even if the author makes a reasonable effort to
select synonyms for stylistic reasons, he soon
runs out of legitimate alternatives and falls
into repetition if the notion being expressed was
potentially significant in the first place.
8Luhns Method Arguments Consequences
- The method avoids linguistic implication as
grammar and syntax. - It does not differentiate
- differ
- difference
- different
- differently
- It implies stemming method
9Luhns Method Algorithm
- Apply a stemming method to the words
- Sort the word in the descending frequency
- Remove all words of frequencies higher than a
cutoff value - Alternatively remove common word using stop list
- Remove all words of frequencies lower than
another cutoff value
10Luhns Method Concept Word-Frequency
Diagram
frequencies
significance
Individual words in the order of frequency
11Luhns Method Concept Significant Sentences
- Wherever the greatest number of significant
different words are found in great physical
proximity to each other, - the probability is very high that the information
being conveyed is most representative of the
article.
12Luhns Method Concept Relative
Significance of Sentences
- Obtain a cluster of significant words in a
sentence - separated by no more than 5 non-significant
words. - Calculate the significance factor of each cluster
- the square of the number of significant words in
the cluster, divided by the total number of words
in the cluster. - The higher one (among the clusters) is taken as
the measure for the sentence
13Luhns Method Example File
- The Department of Computer Science was
established in 1984 and has since evolved from a
primarily teaching-oriented department in its - Polytechnic days into one which excels in both
teaching and research within the Faculty of
Science and Engineering of the now City - University.
- The Department launched its first BSc(Hons) in
Computer Studies in 1987, followed by the MSc in
Computer Science which was started - in 1991. The Department also produced its first
PhD graduate in 1994. - In addition to offering traditional courses
such as foundations of computer science, computer
architecture and software engineering, our - curriculum also exposes our students to the
latest advances in distributed databases,
parallel computing, computer graphics, internet - programming, multimedia systems and high speed
networking. Students will also have the
opportunity, as part of their learning, to - undertake a major design and development
project in new areas such as electronic commerce,
virtual reality, multimedia information - retrieval, computer vision, object-oriented and
distributed databases, data-mining and
webcasting. - The Department is also committed to continuing
education, particularly, in the applications of
Information Technology (IT) in education. - This is reflected in the two part-time
programmes initiated by the Department for
in-service primary and secondary school teachers.
These - programmes aim to equip school teachers with
the necessary fundamental knowledge and skills
needed to apply IT in teaching and - school management as well as to exploit
multimedia technology in courseware preparation
and delivery. - All the teaching within the Department is
subject to stringent quality assurance procedures
to ensure the highest quality of instruction.
14Luhns Method Example Frequencies of words
(not in stop list)
- Comput 12
- department 9
- teach 6
- science 5
- multimedia 4
- 3 stud, distrrib, program, technology, school,
high - 2 primary, oriented, research, engineering,
course,advanc,database,graphics,information,
system,network,major,education,quality,include - 1 all others
15Luhns Method Example Significant Words
- Set cutoff value at 4 to determine significant
words - Comput 12
- department 9
- teach 6
- science 5
- multimedia 4
16Luhns Method Example Mark significant
words
- The Department of Computer Science was
established in 1984 and has since evolved from a
primarily teaching-oriented department in its - Polytechnic days into one which excels in both
teaching and research within the Faculty of
Science and Engineering of the now City - University.
- The Department launched its first BSc(Hons) in
Computer Studies in 1987, followed by the MSc in
Computer Science which was started - in 1991. The Department also produced its first
PhD graduate in 1994. - In addition to offering traditional courses
such as foundations of computer science, computer
architecture and software engineering, our - curriculum also exposes our students to the
latest advances in distributed databases,
parallel computing, computer graphics, internet - programming, multimedia systems and high speed
networking. Students will also have the
opportunity, as part of their learning, to - undertake a major design and development
project in new areas such as electronic commerce,
virtual reality, multimedia information - retrieval, computer vision, object-oriented and
distributed databases, data-mining and
webcasting. - The Department is also committed to continuing
education, particularly, in the applications of
Information Technology (IT) in education. - This is reflected in the two part-time
programmes initiated by the Department for
in-service primary and secondary school teachers.
These - programmes aim to equip school teachers with
the necessary fundamental knowledge and skills
needed to apply IT in teaching and - school management as well as to exploit
multimedia technology in courseware preparation
and delivery. - All the teaching within the Department is
subject to stringent quality assurance procedures
to ensure the highest quality of instruction.
17Luhns Method Example Remove
non-siginificant sentences
- The Department of Computer Science was
established in 1984 and has since evolved from a
primarily teaching-oriented department in its - Polytechnic days into one which excels in both
teaching and research within the Faculty of
Science and Engineering of the now City - University.
- The Department launched its first BSc(Hons) in
Computer Studies in 1987, followed by the MSc in
Computer Science which was started - in 1991. The Department also produced its first
PhD graduate in 1994. - In addition to offering traditional courses
such as foundations of computer science, computer
architecture and software engineering, our - curriculum also exposes our students to the
latest advances in distributed databases,
parallel computing, computer graphics, internet - programming, multimedia systems and high speed
networking. Students will also have the
opportunity, as part of their learning, to - undertake a major design and development
project in new areas such as electronic commerce,
virtual reality, multimedia information - retrieval, computer vision, object-oriented and
distributed databases, data-mining and
webcasting. -
- This is reflected in the two part-time
programmes initiated by the Department for
in-service primary and secondary school teachers.
These - programmes aim to equip school teachers with
the necessary fundamental knowledge and skills
needed to apply IT in teaching and - school management as well as to exploit
multimedia technology in courseware preparation
and delivery. - All the teaching within the Department is
subject to stringent quality assurance procedures
to ensure the highest quality of instruction.
18Luhns Method Example Find clusters
- Department of Computer Science
- teaching-oriented department teaching
Science - The Department launched its first BSc(Hons) in
Computer Computer Science Department - computer science, computer
- computing, computer graphics, internet
programming, multimedia - multimedia information retrieval, computer
- Department
- teachers. teachers
- teaching multimedia
- teaching within the Department
- Department maintains state-of-the-art
computing - supercomputer,
- The Department multimedia and image
computing, distributed and real-time systems,
theoretical computer science,
19Luhns Method Example Remove clusters of
single word
- Department of Computer Science
- teaching-oriented department
- Department launched its first BSc(Hons) in
Computer - Computer Science
- computer science, computer
- computing, computer graphics, internet
programming, multimedia - multimedia information retrieval, computer
- teaching within the Department
- Department maintains state-of-the-art computing
- multimedia and image computing, distributed and
real-time systems, theoretical computer science,
20Luhns Method Example Calculate the
significance of clusters
- Department of Computer Science 33/42.25
- teaching-oriented department 22/22
- Department launched its first BSc(Hons) in
Computer lt2 - Computer Science 22/22
- computer science, computer 33/33
- computing, computer graphics, internet
programming, multimedialt2 - multimedia information retrieval, computer lt2
- teaching within the Department lt2
- Department maintains state-of-the-art computing
lt2 - multimedia and image computing, distributed and
real-time systems, theoretical computer science,
44/11lt2
21Luhns Method Example Two highest ranked
clusters
- Department of Computer Science 33/42.25
- computer science, computer 33/33
22Luhns Method Example Abstract -- Two
Highest Ranked Sentences
- The Department of Computer Science was
established in 1984 and has since evolved from a
primarily teaching-oriented department in its
Polytechnic days into one which excels in both
teaching and research within the Faculty of
Science and Engineering of the now City
University. - In addition to offering traditional courses
such as foundations of computer science, computer
architecture and software engineering, our
curriculum also exposes our students to the
latest advances in distributed databases,
parallel computing, computer graphics, internet
programming, multimedia systems and high speed
networking.
23Luhns Method Further Improvement 1
- Use synonyms
- Two words of the same mean
- Replace words of the same meaning in a text by
one word to increase its chance to become
important word - Example search for car in wordnet
- http//www.cogsci.princeton.edu/cgi-bin/webwn1.7.1
?stage2wordcarposnumber1searchtypenumber26
sensesshowglosses1
24Luhns Method Further Improvement 2
- Use Hypernyms (this is a kind of...)
- Two or more synonyms in the same document with
the same hypernyms can be replaced by their
common hypernym. - Then we may proceed similarly to construct the
abstract
25Saltons Rank of Documents
- Vector Space model of Document
- Documents are ranked, with respect to each set of
query keywords, according to the weights of those
keywords for those documents Sa. - G. Salton, The SMART system experiments in
automatic document processing, Prentice Hall,
1971.
26Salton Background Term-Document Association
Matrix Illustration
Terms
Weight of a term in the document
Documents
27Salton Background Term-Document Association
Matrix Decide the weight
- Combine two factors in the document-term weight
- tfij frequency of term j in document I
- df j document frequency of term j
- number of documents containing term j
- idfj
- inverse document frequency of term j
- log2 (N/ df j) (N number of documents in
collection) - Inverse document frequency -- an indication of
term values as a document discriminator.
28Salton Background Term-Document Association
Matrix Tf-idf term weight
- A typical combined term importance indicator
- wij tfij? idfj tfij? log2 (N/ df j)
- A term occurs frequently in one document but
rarely in the remaining of the collection has a
high weight in the document - df jN
- Therefore, wij 0 for all i. This is the
interpretation of stop list - a term appear in every document has no
information value.
29Salton Significance of Sentence
- Cutoff value to decide significant words
- cluster defined as before
- significance of cluster
- square of total weight of significant words in
cluster divided by total weight of all words in
cluster - return the maximum cluster significance in a
sentence
30Salton Abstract
- Method 1 (Usual)
- choose the most significant sentences as
abstract. - Method 2 (Alternative)
- Choose a set of significant words and their
weights in a document as the abstract for the
document. - Question what set of words to choose (the most
significant ones may be highly co-related)?
31Advance Approaches
- Two types
- Statistical Features
- relevance feedback and user interest profile
- Linguistic Features
- experimental identification of linguistic
features important in summaries
32Advance Approaches Statistical Relevance
Feedback
- Use of user feedback to help summarization of
document - G. Salton and Buckley, Improving retrieval
performance by relevance feedback, Journal of
American Society for Information Sciences 41(
1990) 288-197 - Sumner, R. G., Jr., Yang, K., Akers, R., Shaw,
W. M., Jr. (1998). Interactive retrieval using
IRIS TREC-6 experiments. In E. M. Voorhees D.
K. Harman (Eds.), The Sixth Text REtrieval
Conference (TREC-6) (NIST Spec. Publ. 500-240,
pp. 711-734). Washington, DC U.S. Government
Printing Office.
33Advance Approaches Linguistic
- Sentences are ranked for potential inclusion in
the summary using linguistic features derived
from an analysis of news-wire summaries - V. Mittal, M. Kantrowitz, J.Goldstein,
J.Carbonell, Selecting Text Span for Document
Summaries Heuristics and Metrics
34Advance Approaches Linguistic Positive
Features
- Greater average word length
- Thematic Phrases
- finally
- in conclusion
- Density of related words
- words that have multiple related terms in the
documents such as synonyms, hypernyms and antonyms
35Advance Approaches Linguistic Negative
Features
- Words and phrases common in direct or indirect
quotation. - according, adding, said, and other verbs related
to communication - Informal and imprecise terms got, really, use
- Auxillary verbs was, could, did
- Honorifics Dr., Mr., Mrs.
- Negations dont, no, never
- Integers
- Evaluatie and qualifying words often, about,
significant - prepositions at, by, for, of, in, to, with
36Advance Approaches Relevant Novelty
- Carbonell and Goldstein proposed to measure
relevant novelty as a measure (independent of
relevance) in automatic document summarization to
reduce redundancy. - J. Carbonell, J. Goldstein, The use of MMR,
Divieristy-Based Reranking for reordering
documents and producing summaries, SIGIR98,
335-336.
37Advance Approaches Relevant Novelty
Maximal Marginal Relevance Criteria Page 1
- Documents are selected one by one
- Given a query Q, select a document D that
maximizes the difference of two factors - similarity of Q and D (times ?)
- maximum similarity of D with documents already
selected (times 1- ?) - //we want it to be similar to the query
- //and not similar to chosen documents
38Advance Approaches Relevant Novelty
Maximal Marginal Relevance Criteria Page 2
- Sequentially select documents
- Given a query Q, select a document D from
not-yet-chosen documents to maximize the
difference of two factors - ? sim(Q, D)
- (1- ?)maxall i sim(D , Di),
- where Dis are already chosen documents.
39Summary
- Introduction
- Luhns Method
- Saltons Rank of Documents
- Advance Approaches
- Statistical
- Linguistic