Title: Processing of large document collections
1Processing of large document collections
2Text summarization
- Process of distilling the most important
information from a source to produce an abridged
version for a particular user or task
3Text summarization
- Many everyday uses
- headlines (from around the world)
- outlines (notes for students)
- minutes (of a meeting)
- reviews (of books, movies)
- ...
4Architecture of a text summarization system
- Input
- a single document or multiple documents
- text, images, audio, video
- database
5Architecture of a text summarization system
- output
- extract or abstract
- compression rate
- ratio of summary length to source length
- connected text or fragmentary
- generic or user-focused/domain-specific
- indicative or informative
6Architecture of a text summarization system
- Three phases
- analyzing the input text
- transforming it into a summary representation
- synthesizing an appropriate output form
7Condensation operations
- Selection of more salient or non-redundant
information - aggregation of information (e.g. from different
parts of the source, or of different linguistic
descriptions) - generalization of specific information with more
general, abstract information
8The level of processing
- Surface level
- entity level
- discourse level
9Surface-level approaches
- Tend to represent information in terms of shallow
features - the features are then selectively combined
together to yield a salience function used to
extract information
10Surface level
- Shallow features
- thematic features
- presence of statistically salient terms, based on
term frequency statistics - location
- position in text, position in paragraph, section
depth, particular sections - background
- presence of terms from the title or headings in
the text, users query
11Surface level
- Cue words and phrases
- in summary, our investigation
- emphasizers like important, in particular
- domain-specific bonus ( ) and stigma (-) terms
12Entity-level approaches
- Build an internal representation for text
- modeling text entities and their relationships
- tend to represent patterns of connectivity in the
text to help determine what is salient
13Relationships between entities
- Similarity (e.g. vocabulary overlap)
- proximity (distance between text units)
- co-occurrence (words related based on their
occurring in common contexts) - thesaural relationships among words (synonymy,
hypernymy, part-of relations) - co-reference (of referring expressions such as
noun phrases)
14Relationships between entities
- Logical relationships (agreement, contradiction,
entailment, consistency) - syntactic relations (based on parse trees)
- meaning representation-based relations (e.g.
based on predicate-argument relations)
15Discourse-level approaches
- Model the global structure of the text and its
relation to communicative goals - structure can include
- format of the document (e.g. hypertext markup)
- threads of topics as they are revealed in the
text - rhetorical structure of the text, such as
argumentation or narrative structure
16Classical approaches
17Luhns method
- Filter terms in the document using a stoplist
- Terms are normalized based on aggregating
together ortographically similar terms - Frequencies of aggregated terms are calculated
and non-frequent terms are removed
18Luhns method
- Sentences are weighted using the resulting set
of significant terms and a term density measure - each sentence is divided into segments bracketed
by significant terms not more than 4
non-significant terms apart
19Luhns method
- each segment is scored by taking the square of
the number of bracketed significant terms divided
by the total number of bracketed terms - the score of the highest scoring segment is taken
as the sentence score - the highest scoring sentences are chosen to the
summary
20Edmundsons method
- Extends earlier work to look at three features in
addition to word frequencies - cue phrases (e.g. significant, impossible,
hardly) - title and heading words
- location
21Edmundsons method
- Programs to weight sentences based on each of the
four methods separately - programs were evaluated by comparison against
manually created extracts - corpus-based methodology training set and test
set - in the training phase, weights were manually
readjusted
22Edmundsons method
- Results
- three additional features dominated word
frequency measures - the combination of cue-title-location was the
best, with location being the best individual
feature - keywords alone was the worst
23Fundamental issues
- What are the most powerful but also more general
features to exploit for summarization? - How do we combine these features?
- How can we evaluate how well we are doing?
24Corpus-based approaches
- In the classical methods, various features
(thematic features, title, location, cue phrase)
were used to determine the salience of
information for summarization - an obvious issue determine the relative
contribution of different features to any given
text summarization task
25Corpus-based approaches
- Contribution is dependent on the text genre, e.g.
location - in newspaper stories, the leading text often
contains a summary - in TV news, a preview segment may contain a
summary of the news to come - in scientific text an author-written abstract
26Corpus-based approaches
- The importance of different text features for any
given summarization problem can be determined by
counting the occurrences of such features in text
corpora - in particular, analysis of human-generated
summaries, along with their full-text sources,
can be used to learn rules for summarization
27Corpus-based approaches
- One could use a corpus to model particular
components, without using a completely trainable
approach - e.g. a corpus can be used to compute weights
(TFIDF)
28Corpus-based approaches
- Challenges
- creating a suitable text corpus, designing an
annotation scheme - ensuring the suitable set of summaries is
available - may already be available scientific papers
- if not author, professional abstractor, judge
29KPC method
- Kupiec, Pedersen, Chen (1995) A Trainable
Document Summarizer - a learning method using a corpus of abstracts
written by professional human abstractors
(Engineering Information Co.) - naïve Bayesian classification method is used
30KPC method features
- Sentence-length cut-off feature
- given a threshold (e.g. 5 words), the feature is
true for all sentences longer than the threshold,
and false otherwise - fixed-phrase feature
- this feature is true for sentences that contain
any of 26 indicator phrases (e.g. this letter,
In conclusion), or that follow section head
that contain specific keywords (e.g. results,
conclusion)
31KPC method features
- Paragraph feature
- sentences in the first 10 paragraphs and the last
5 paragraphs in a document get a higher value - in paragraphs paragraph-initial,
paragraph-final, paragraph-medial are
distinguished
32KPC method features
- thematic word feature
- a small number of thematic words (the most
frequent content words) are selected - each sentence is scored as a function of
frequency of the thematic words - highest scoring sentences are selected
- binary feature feature is true for a sentence,
if the sentence is present in the set of highest
scoring sentences
33KPC method features
- Uppercase word feature
- proper names and explanatory text for acronyms
are usually important - feature is computed like the thematic word
feature - an uppercase thematic word
- is not sentence-initial and begins with a
capital letter and must occur several times - first occurrence is scored twice as much as later
occurrences
34KPC method classifier
- For each sentence, we compute the probability it
will be included in a summary S given the k
features Fj, j1k - the probability can be expressed using Bayes
rule
35KPC method classifier
- Assuming statistical independence of the
features - P(s?S) is a constant, and P(Fj s?S) and P(Fj)
can be estimated directly from the training set
by counting occurrences
36KPC method corpus
- Corpus is acquired from Engineering Information
Co, which provides abstracts of technical
articles to online information services - articles do not have author-written abstracts
- abstracts were created by professional abstractors
37KPC method corpus
- 188 document/summary pairs sampled from 21
publications in the scientific/technical domain - summaries are mainly indicative, average length
is 3 sentences - average number of sentences in the original
documents is 86 - author, address, and bibliography were removed
38KPC method sentence matching
- The abstracts from the human abstractors are not
extracts but inspired by the original sentences - the automatic summarization task here
- extract sentences that the human abstractor might
have chosen to prepare summary text (with minor
modifications)
39KPC method sentence matching
- For training, a correspondence between the manual
summary sentences and sentences in the original
document need to be obtained - matching can be done in several ways
40KPC method sentence matching
- matching can be done in several ways
- a direct sentence match
- the same sentence is found in both
- a direct join
- 2 or more original sentences were used to form a
summary sentence - summary sentence can be unmatchable
- summary sentence (single or joined) can be
incomplete
41KPC method sentence matching
- Matching was done in two passes
- first, the best one-to-one sentence matches were
found automatically (79) - second, these matches were used as a starting
point for the manual assignment of
correspondences
42KPC method evaluation
- Cross-validation strategy for evaluation
- documents from a given journal were selected for
testing one at a time all other document/summary
pairs were used for training - unmatchable and incomplete sentences were
excluded - total of 498 unique sentences
43KPC method evaluation
- Two ways of evaluation
- the fraction of manual summary sentences that
were faithfully reproduced by the summarizer
program - the summarizer produced the same number of
sentences as were in the corresponding manual
summary - -gt 35
- 83 is the highest possible value, since
unmatchable and incomplete sentences were excluded
44KPC method evaluation
- The fraction of the matchable sentences that
were correctly identified by the summarizer - -gt 42
- the effect of different features was also studied
- best combination (44) paragraph, fixed-phrase,
sentence-length - baseline selecting sentences from the beginning
of the document (result 24) - if 25 of the original sentences selected 84
45Discourse-based approaches
- Discourse structure appears to play an important
role in the strategies used by human abstractors
and in the structure of their abstracts - an abstract is not just a collection of
sentences, but it has an internal structure - -gt abstract should be coherent and it should
represent some of the argumentation used in the
source
46Discourse models
- Cohesion
- relations between words or referring expressions,
which determine how tightly connected the text is - anaphora, ellipsis, synonymy, hypernymy (dog
is-a-kind animal) - coherence
- overall structure of a multi-sentence text in
terms of macro-level relations between sentences
(e.g. although -gt contrast)
47Boguraev, Kennedy (BG)
- Goal identify those phrasal units across the
entire span of the document that best function as
representative highlights of the documents
content - these phrasal units are called topic
stamps - a set of topic stamps is called capsule
overview
48BG
- A capsule overview
- not a set/sequence of sentences
- a semi-formal (normalised) representaion of the
document, derived after a process of data
reduction over the original text - not always very readable, but still represents
the flow of the narrative - can be combined with surrounding information to
produce more coherent presentation
49BG
- Primary consideration methods should apply to
any document type and source (domain
independence) - also efficient and scalable technology
- shallow syntactic analysis, no comprehensive
parsing engine needed
50BG
- Based on the findings on technical terms
- technical terms have such linguistic properties
that can be used to find terms automatically in
different domains quite reliably - technical terms seem to be topical
- task of content characterization
- identifying phrasal units that have
- lexico-syntactic properties similar to technical
terms - discourse properties that signify their status as
most prominent
51BG terms as content indicators
- Problems
- undergeneration
- overgeneration
- differentiation
52Undergeneration
- a set of phrases should contain an exhaustive
description of all the entities that are
discussed in the text - the set of technical terms has to be extended to
include also expressions with pronouns etc.
53Overgeneration
- already the set of technical terms can be large
- extensions make the information overload even
worse - solution phrases that refer to one participant
in the discourse are combined with referential
links
54Differentiation
- The same list of terms may be used to describe
two documents, even if they, e.g., focus on
different subtopics - it is necessary to differentiate term sets not
only according to their membership, but also
according to the relative representativeness of
the terms they contain
55Term sets and coreference classes
- Phrases are extracted using a phrasal grammar
(e.g. a noun with modifiers) - also expressions with pronouns and incomplete
expressions are extracted - using a (Lingsoft) tagger that provides
information about the part of speech, number,
gender, and grammatical function of tokens in a
text - solves the undergeneration problem
56Term sets and coreference classes
- The phrase set has to be reduced to solve the
problem of overgeneration - -gt a smaller set of expressions that uniquely
identify the objects referred to in the text - application of anaphora resolution
- e.g. to which noun a pronoun he refers to?
57Resolving coreferences
- Procedure
- moving through the text sentence by sentence and
analysing the nominal expressions in each
sentence from left to right - either an expression is identified as a new
participant in the discourse, or it is taken to
refer to a previously mentioned referent
58Resolving coreferences
- Coreference is determined by a 3 step procedure
- a set of candidates is collected all nominals
within a local segment of discourse - some candidates are eliminated due to
morphological mismatches or syntactical
restrictions - remaining candidates are ranked according to
their relative salience in the discourse
59Salience factors
- sent(term) 100 iff term is in the current
sentence - cntx(term) 50 iff term is in the current
discourse segment - subj(term) 80 iff term is a subject
- acc(term) 50 iff term is a direct object
- dat(term) 40 iff term is an indirect obj
- ...
60Local salience of a candidate
- The local salience of a candidate is the sum of
the values of the salience factors - the most salient candidate is selected as the
antecedent for the anaphor - if the coreference link cannot be established to
some other expression, the nominal is taken to
introduce a new referent - -gt coreferent classes
61Topic stamps
- In order to further reduce the referent set, some
additional structure has to be imposed - the term set is ranked according to the salience
of its members - relative prominence or importance in the
discourse of the entities to which they refer - objects in the centre of discussion have a high
degree of salience
62Saliency
- Measured like local saliency in coreference
resolution, but tries the measure the importance
of unique referents in the discourse
63Priest is charged with Pope attack
A Spanish priest was charged here today with
attempting to murder the Pope. Juan Fernandez
Krohn, aged 32, was arrested after a man armed
with a bayonet approached the Pope while he was
saying prayers at Fatima on Wednesday
night. According to the police, Fernandez told
the investigators today that he trained for the
past six months for the assault. He was alleged
to have claimed the Pope looked furious on
hearing the priests criticism of his handling of
the churchs affairs. If found quilty, the
Spaniard faces a prison sentence of 15-20 years.
64Saliency
- priest is the primary element
- eight references to the same actor in the body of
the story - these reference occur in important syntactic
positions 5 are subjects of main clauses, 2 are
subjects of embedded clauses, 1 is a possessive - Pope attack is also important
- Pope occurs 5 times, but not in so important
positions (2 are direct objects)
65Discourse segments
- If the intention is to use very concise
descriptions of one or two salient phrases, i.e.
topic stamps, longer text have to be broken down
into smaller segments - topically coherent, contiguous segments can be
found by using a lexical similarity measure - assumption distribution of words used changes
when the topic changes
66BG Summarization process
- Linguistic analysis
- discourse segmentation
- extended phrase analysis
- anaphora resolution
- calculation of discourse salience
- topic stamp identification
- capsule overview
67Knowledge-rich approaches
- Structured information can be used as the
starting point for summarization - structured information e.g. data and knowledge
bases, may have been produced by processing input
text - summarizer does not have to address the
linguistic complexities and variability of the
input, but also the structure of the input text
is not available
68Knowledge-rich approaches
- There is a need for measures of salience and
relevance that are dependent on the knowledge
source - addressing coherence, cohesion, and fluency
becomes the entire responsibility of the generator
69STREAK, PLANDOC
- McKeown, Robin, Kukich (1995) Generating concise
natural language summaries - goal folding information from multiple facts
into a single sentence using concise linguistic
constructions
70STREAK
- Produces summaries of basketball games
- first creates a draft of essential facts
- then uses revision rules constrained by the draft
wording to add in additional facts as the text
allows
71STREAK
- Input
- a set of box scores for a basketball game
- historical information (from a database)
- task
- summarize the highlights of the game,
underscoring their significance in the light of
previous games - output
- a short summary a few sentences
72STREAK
- The box score input is represented as a
conceptual network that expresses relations
between what were the columns and rows of the
table - essential facts the game result, its location,
date and at least one final game statistic (the
most remarkable statistic of a winning team
player)
73STREAK
- Essential facts can be obtained directly from the
box-score - in addition, other potential facts
- other notable game statistics of individual
players - from box-score - game result streaks (Utah recorded its fourth
straight win) - historical - extremum performances such as maximums or
minimums - historical
74STREAK
- Essential facts are always included
- potential facts are included if there is space
- decision on the potential facts to be included
could be based on the possibility to combine the
facts to the essential information in cohesive
and stylistically successful ways
75STREAK
- Given facts
- Karl Malone scored 39 points.
- Karl Malones 39 point performance is equal to
his season high - a single sentence is produced
- Karl Malone tied his season high with 39 points
76PLANDOC
- Produces summaries of telephone network planning
activity - uses discourse planning, looking ahead in its
text plan to group together facts which can be
expressed concisely using conjunction and
deleting repetitions
77PLANDOC
- The system must produce a report documenting how
an engineer investigated what new technology is
needed in a telephone route to meet demand
through use of a sophisticated software planning
system
78PLANDOC
- Input
- a trace of user interaction with the planning
system software PLAN - output
- 1-2 page report, including a paragraph summary of
PLANs solution, a summary of refinements than an
engineer made to the system solution, and a
closing paragraph summarizing the engineers
final proposition
79Summary generation
- Summaries must convey maximal information in a
minimal amount of space - requires the use of complex sentence structures
- multiple modifiers of a noun or a verb
- conjunction (and)
- ellipsis (deletion of repetitions)
- selection of words that convey multiple aspects
of the information