Processing of large document collections - PowerPoint PPT Presentation

1 / 79

About This Presentation

Title:

Processing of large document collections

Description:

'Process of distilling the most important information from a source to produce an ... produce a report documenting how an engineer investigated what new technology is ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 80

Provided by: helenaah

Category:

more less

Transcript and Presenter's Notes

Title: Processing of large document collections

1
Processing of large document collections

Part 3

2
Text summarization

Process of distilling the most important
information from a source to produce an abridged
version for a particular user or task

3
Text summarization

Many everyday uses
headlines (from around the world)
outlines (notes for students)
minutes (of a meeting)
reviews (of books, movies)
...

4
Architecture of a text summarization system

Input
a single document or multiple documents
text, images, audio, video
database

5
Architecture of a text summarization system

output
extract or abstract
compression rate
ratio of summary length to source length
connected text or fragmentary
generic or user-focused/domain-specific
indicative or informative

6
Architecture of a text summarization system

Three phases
analyzing the input text
transforming it into a summary representation
synthesizing an appropriate output form

7
Condensation operations

Selection of more salient or non-redundant
information
aggregation of information (e.g. from different
parts of the source, or of different linguistic
descriptions)
generalization of specific information with more
general, abstract information

8
The level of processing

Surface level
entity level
discourse level

9
Surface-level approaches

Tend to represent information in terms of shallow
features
the features are then selectively combined
together to yield a salience function used to
extract information

10
Surface level

Shallow features
thematic features
presence of statistically salient terms, based on
term frequency statistics
location
position in text, position in paragraph, section
depth, particular sections
background
presence of terms from the title or headings in
the text, users query

11
Surface level

Cue words and phrases
in summary, our investigation
emphasizers like important, in particular
domain-specific bonus ( ) and stigma (-) terms

12
Entity-level approaches

Build an internal representation for text
modeling text entities and their relationships
tend to represent patterns of connectivity in the
text to help determine what is salient

13
Relationships between entities

Similarity (e.g. vocabulary overlap)
proximity (distance between text units)
co-occurrence (words related based on their
occurring in common contexts)
thesaural relationships among words (synonymy,
hypernymy, part-of relations)
co-reference (of referring expressions such as
noun phrases)

14
Relationships between entities

Logical relationships (agreement, contradiction,
entailment, consistency)
syntactic relations (based on parse trees)
meaning representation-based relations (e.g.
based on predicate-argument relations)

15
Discourse-level approaches

Model the global structure of the text and its
relation to communicative goals
structure can include
format of the document (e.g. hypertext markup)
threads of topics as they are revealed in the
text
rhetorical structure of the text, such as
argumentation or narrative structure

16
Classical approaches

Luhn 58
Edmundson 69

17
Luhns method

Filter terms in the document using a stoplist
Terms are normalized based on aggregating
together ortographically similar terms
Frequencies of aggregated terms are calculated
and non-frequent terms are removed

18
Luhns method

Sentences are weighted using the resulting set
of significant terms and a term density measure
each sentence is divided into segments bracketed
by significant terms not more than 4
non-significant terms apart

19
Luhns method

each segment is scored by taking the square of
the number of bracketed significant terms divided
by the total number of bracketed terms
the score of the highest scoring segment is taken
as the sentence score
the highest scoring sentences are chosen to the
summary

20
Edmundsons method

Extends earlier work to look at three features in
addition to word frequencies
cue phrases (e.g. significant, impossible,
hardly)
title and heading words
location

21
Edmundsons method

Programs to weight sentences based on each of the
four methods separately
programs were evaluated by comparison against
manually created extracts
corpus-based methodology training set and test
set
in the training phase, weights were manually
readjusted

22
Edmundsons method

Results
three additional features dominated word
frequency measures
the combination of cue-title-location was the
best, with location being the best individual
feature
keywords alone was the worst

23
Fundamental issues

What are the most powerful but also more general
features to exploit for summarization?
How do we combine these features?
How can we evaluate how well we are doing?

24
Corpus-based approaches

In the classical methods, various features
(thematic features, title, location, cue phrase)
were used to determine the salience of
information for summarization
an obvious issue determine the relative
contribution of different features to any given
text summarization task

25
Corpus-based approaches

Contribution is dependent on the text genre, e.g.
location
in newspaper stories, the leading text often
contains a summary
in TV news, a preview segment may contain a
summary of the news to come
in scientific text an author-written abstract

26
Corpus-based approaches

The importance of different text features for any
given summarization problem can be determined by
counting the occurrences of such features in text
corpora
in particular, analysis of human-generated
summaries, along with their full-text sources,
can be used to learn rules for summarization

27
Corpus-based approaches

One could use a corpus to model particular
components, without using a completely trainable
approach
e.g. a corpus can be used to compute weights
(TFIDF)

28
Corpus-based approaches

Challenges
creating a suitable text corpus, designing an
annotation scheme
ensuring the suitable set of summaries is
available
may already be available scientific papers
if not author, professional abstractor, judge

29
KPC method

Kupiec, Pedersen, Chen (1995) A Trainable
Document Summarizer
a learning method using a corpus of abstracts
written by professional human abstractors
(Engineering Information Co.)
naïve Bayesian classification method is used

30
KPC method features

Sentence-length cut-off feature
given a threshold (e.g. 5 words), the feature is
true for all sentences longer than the threshold,
and false otherwise
fixed-phrase feature
this feature is true for sentences that contain
any of 26 indicator phrases (e.g. this letter,
In conclusion), or that follow section head
that contain specific keywords (e.g. results,
conclusion)

31
KPC method features

Paragraph feature
sentences in the first 10 paragraphs and the last
5 paragraphs in a document get a higher value
in paragraphs paragraph-initial,
paragraph-final, paragraph-medial are
distinguished

32
KPC method features

thematic word feature
a small number of thematic words (the most
frequent content words) are selected
each sentence is scored as a function of
frequency of the thematic words
highest scoring sentences are selected
binary feature feature is true for a sentence,
if the sentence is present in the set of highest
scoring sentences

33
KPC method features

Uppercase word feature
proper names and explanatory text for acronyms
are usually important
feature is computed like the thematic word
feature
an uppercase thematic word
is not sentence-initial and begins with a
capital letter and must occur several times
first occurrence is scored twice as much as later
occurrences

34
KPC method classifier

For each sentence, we compute the probability it
will be included in a summary S given the k
features Fj, j1k
the probability can be expressed using Bayes
rule

35
KPC method classifier

Assuming statistical independence of the
features
P(s?S) is a constant, and P(Fj s?S) and P(Fj)
can be estimated directly from the training set
by counting occurrences

36
KPC method corpus

Corpus is acquired from Engineering Information
Co, which provides abstracts of technical
articles to online information services
articles do not have author-written abstracts
abstracts were created by professional abstractors

37
KPC method corpus

188 document/summary pairs sampled from 21
publications in the scientific/technical domain
summaries are mainly indicative, average length
is 3 sentences
average number of sentences in the original
documents is 86
author, address, and bibliography were removed

38
KPC method sentence matching

The abstracts from the human abstractors are not
extracts but inspired by the original sentences
the automatic summarization task here
extract sentences that the human abstractor might
have chosen to prepare summary text (with minor
modifications)

39
KPC method sentence matching

For training, a correspondence between the manual
summary sentences and sentences in the original
document need to be obtained
matching can be done in several ways

40
KPC method sentence matching

matching can be done in several ways
a direct sentence match
the same sentence is found in both
a direct join
2 or more original sentences were used to form a
summary sentence
summary sentence can be unmatchable
summary sentence (single or joined) can be
incomplete

41
KPC method sentence matching

Matching was done in two passes
first, the best one-to-one sentence matches were
found automatically (79)
second, these matches were used as a starting
point for the manual assignment of
correspondences

42
KPC method evaluation

Cross-validation strategy for evaluation
documents from a given journal were selected for
testing one at a time all other document/summary
pairs were used for training
unmatchable and incomplete sentences were
excluded
total of 498 unique sentences

43
KPC method evaluation

Two ways of evaluation
the fraction of manual summary sentences that
were faithfully reproduced by the summarizer
program
the summarizer produced the same number of
sentences as were in the corresponding manual
summary
-gt 35
83 is the highest possible value, since
unmatchable and incomplete sentences were excluded

44
KPC method evaluation

The fraction of the matchable sentences that
were correctly identified by the summarizer
-gt 42
the effect of different features was also studied
best combination (44) paragraph, fixed-phrase,
sentence-length
baseline selecting sentences from the beginning
of the document (result 24)
if 25 of the original sentences selected 84

45
Discourse-based approaches

Discourse structure appears to play an important
role in the strategies used by human abstractors
and in the structure of their abstracts
an abstract is not just a collection of
sentences, but it has an internal structure
-gt abstract should be coherent and it should
represent some of the argumentation used in the
source

46
Discourse models

Cohesion
relations between words or referring expressions,
which determine how tightly connected the text is
anaphora, ellipsis, synonymy, hypernymy (dog
is-a-kind animal)
coherence
overall structure of a multi-sentence text in
terms of macro-level relations between sentences
(e.g. although -gt contrast)

47
Boguraev, Kennedy (BG)

Goal identify those phrasal units across the
entire span of the document that best function as
representative highlights of the documents
content
these phrasal units are called topic
stamps
a set of topic stamps is called capsule
overview

48
BG

A capsule overview
not a set/sequence of sentences
a semi-formal (normalised) representaion of the
document, derived after a process of data
reduction over the original text
not always very readable, but still represents
the flow of the narrative
can be combined with surrounding information to
produce more coherent presentation

49
BG

Primary consideration methods should apply to
any document type and source (domain
independence)
also efficient and scalable technology
shallow syntactic analysis, no comprehensive
parsing engine needed

50
BG

Based on the findings on technical terms
technical terms have such linguistic properties
that can be used to find terms automatically in
different domains quite reliably
technical terms seem to be topical
task of content characterization
identifying phrasal units that have
lexico-syntactic properties similar to technical
terms
discourse properties that signify their status as
most prominent

51
BG terms as content indicators

Problems
undergeneration
overgeneration
differentiation

52
Undergeneration

a set of phrases should contain an exhaustive
description of all the entities that are
discussed in the text
the set of technical terms has to be extended to
include also expressions with pronouns etc.

53
Overgeneration

already the set of technical terms can be large
extensions make the information overload even
worse
solution phrases that refer to one participant
in the discourse are combined with referential
links

54
Differentiation

The same list of terms may be used to describe
two documents, even if they, e.g., focus on
different subtopics
it is necessary to differentiate term sets not
only according to their membership, but also
according to the relative representativeness of
the terms they contain

55
Term sets and coreference classes

Phrases are extracted using a phrasal grammar
(e.g. a noun with modifiers)
also expressions with pronouns and incomplete
expressions are extracted
using a (Lingsoft) tagger that provides
information about the part of speech, number,
gender, and grammatical function of tokens in a
text
solves the undergeneration problem

56
Term sets and coreference classes

The phrase set has to be reduced to solve the
problem of overgeneration
-gt a smaller set of expressions that uniquely
identify the objects referred to in the text
application of anaphora resolution
e.g. to which noun a pronoun he refers to?

57
Resolving coreferences

Procedure
moving through the text sentence by sentence and
analysing the nominal expressions in each
sentence from left to right
either an expression is identified as a new
participant in the discourse, or it is taken to
refer to a previously mentioned referent

58
Resolving coreferences

Coreference is determined by a 3 step procedure
a set of candidates is collected all nominals
within a local segment of discourse
some candidates are eliminated due to
morphological mismatches or syntactical
restrictions
remaining candidates are ranked according to
their relative salience in the discourse

59
Salience factors

sent(term) 100 iff term is in the current
sentence
cntx(term) 50 iff term is in the current
discourse segment
subj(term) 80 iff term is a subject
acc(term) 50 iff term is a direct object
dat(term) 40 iff term is an indirect obj
...

60
Local salience of a candidate

The local salience of a candidate is the sum of
the values of the salience factors
the most salient candidate is selected as the
antecedent for the anaphor
if the coreference link cannot be established to
some other expression, the nominal is taken to
introduce a new referent
-gt coreferent classes

61
Topic stamps

In order to further reduce the referent set, some
additional structure has to be imposed
the term set is ranked according to the salience
of its members
relative prominence or importance in the
discourse of the entities to which they refer
objects in the centre of discussion have a high
degree of salience

62
Saliency

Measured like local saliency in coreference
resolution, but tries the measure the importance
of unique referents in the discourse

63
Priest is charged with Pope attack
A Spanish priest was charged here today with
attempting to murder the Pope. Juan Fernandez
Krohn, aged 32, was arrested after a man armed
with a bayonet approached the Pope while he was
saying prayers at Fatima on Wednesday
night. According to the police, Fernandez told
the investigators today that he trained for the
past six months for the assault. He was alleged
to have claimed the Pope looked furious on
hearing the priests criticism of his handling of
the churchs affairs. If found quilty, the
Spaniard faces a prison sentence of 15-20 years.
64
Saliency

priest is the primary element
eight references to the same actor in the body of
the story
these reference occur in important syntactic
positions 5 are subjects of main clauses, 2 are
subjects of embedded clauses, 1 is a possessive
Pope attack is also important
Pope occurs 5 times, but not in so important
positions (2 are direct objects)

65
Discourse segments

If the intention is to use very concise
descriptions of one or two salient phrases, i.e.
topic stamps, longer text have to be broken down
into smaller segments
topically coherent, contiguous segments can be
found by using a lexical similarity measure
assumption distribution of words used changes
when the topic changes

66
BG Summarization process

Linguistic analysis
discourse segmentation
extended phrase analysis
anaphora resolution
calculation of discourse salience
topic stamp identification
capsule overview

67
Knowledge-rich approaches

Structured information can be used as the
starting point for summarization
structured information e.g. data and knowledge
bases, may have been produced by processing input
text
summarizer does not have to address the
linguistic complexities and variability of the
input, but also the structure of the input text
is not available

68
Knowledge-rich approaches

There is a need for measures of salience and
relevance that are dependent on the knowledge
source
addressing coherence, cohesion, and fluency
becomes the entire responsibility of the generator

69
STREAK, PLANDOC

McKeown, Robin, Kukich (1995) Generating concise
natural language summaries
goal folding information from multiple facts
into a single sentence using concise linguistic
constructions

70
STREAK

Produces summaries of basketball games
first creates a draft of essential facts
then uses revision rules constrained by the draft
wording to add in additional facts as the text
allows

71
STREAK

Input
a set of box scores for a basketball game
historical information (from a database)
task
summarize the highlights of the game,
underscoring their significance in the light of
previous games
output
a short summary a few sentences

72
STREAK

The box score input is represented as a
conceptual network that expresses relations
between what were the columns and rows of the
table
essential facts the game result, its location,
date and at least one final game statistic (the
most remarkable statistic of a winning team
player)

73
STREAK

Essential facts can be obtained directly from the
box-score
in addition, other potential facts
other notable game statistics of individual
players - from box-score
game result streaks (Utah recorded its fourth
straight win) - historical
extremum performances such as maximums or
minimums - historical

74
STREAK

Essential facts are always included
potential facts are included if there is space
decision on the potential facts to be included
could be based on the possibility to combine the
facts to the essential information in cohesive
and stylistically successful ways

75
STREAK

Given facts
Karl Malone scored 39 points.
Karl Malones 39 point performance is equal to
his season high
a single sentence is produced
Karl Malone tied his season high with 39 points

76
PLANDOC

Produces summaries of telephone network planning
activity
uses discourse planning, looking ahead in its
text plan to group together facts which can be
expressed concisely using conjunction and
deleting repetitions

77
PLANDOC

The system must produce a report documenting how
an engineer investigated what new technology is
needed in a telephone route to meet demand
through use of a sophisticated software planning
system

78
PLANDOC

Input
a trace of user interaction with the planning
system software PLAN
output
1-2 page report, including a paragraph summary of
PLANs solution, a summary of refinements than an
engineer made to the system solution, and a
closing paragraph summarizing the engineers
final proposition

79
Summary generation