Text Summarization In Search of Effective Ideas and Techniques - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Text Summarization In Search of Effective Ideas and Techniques

Description:

Shuhua Liu, IIS/IAMSR, A. Text Summarization -- In Search of Effective Ideas and Techniques ... Shuhua Liu, IIS/IAMSR, A. Phase 1: Theme detection, topic ... – PowerPoint PPT presentation

Number of Views:206

Avg rating:3.0/5.0

Slides: 42

Provided by: sliu2

Category:

more less

Transcript and Presenter's Notes

Title: Text Summarization In Search of Effective Ideas and Techniques

1
Text Summarization -- In Search of Effective
Ideas and Techniques

Shuhua Liu, Assistant Professor
Department of Information Systems
Åbo Akademi University, Finland
Visiting scholar at BISC, UC Berkeley
Berkeley, September 21, 2004

2
What is text summarization?

to reduce (long) textual information to its most
essential points
to distill the most important information from a
source or sources to produce an abridged version
of it (Endres-Niggemeyer, 1998 Mani and Maybury,
1999 Spärck-Jones, 1999).

3
Text summarization a context-dependent activity
4
Text summarization

Key issues
how to identify the most important content out of
the rest of the text?
how to synthesize the substance and formulate a
summary text based on the identified content?
Major approaches
Selection based produce extracts
Text understanding based produce abstracts

5
(No Transcript)
6
Selection based summarization how does it work?

The most content-bearing sentences or passages
are identified and selected to compose a summary.
Compute a significance value for each sentence
(Luhn, 1958 Edmundson, 1969)
Count word frequency
the keywords, title words, cue words it contains
the position of the sentence
RST (Rhetorical structute theory) based discourse
analysis (Marcu, 1997)
Passage and sentence similarity analysis
(Goldstein et al, 2000 CMU)

7
MSWord AutoSummarize
8
MEAD/NewsINEssence (Radev et al, 2003)
9
MEAD/NewsINEssence (Radev et al, 2003)
10
MEAD/NewsINEssence (Radev et al, 2003)
11
Text understanding system

A text understanding task often aims to recover
all of the information that there is in a text,
including what is only implicit in what is
actually written.
All the richness of natural language becomes
fair game, including metaphor, metonymy,
discourse structure, and the recognition of the
author's underlying intentions, and the full
interplay between language and world knowledge
becomes central to the task.

12
Text understanding based summarization

Depend on complete sentence analysis and
discourse analysis with full knowledge support
Syntactic pasrer, semantic interpreter
Linguistic knowledge, world knowledge, domain
knowledge
Reasoning mechnisms that work effectively over
huge knowledge collections.

13
Selection based vs. Unedrstanding based

Selection based general applicable, but
incoherent content, poor readability due to
unclear relationships between the selected text
excerpts, dangling references, and so on.
Understanding based high precision, but very
slow, large amount of wasted computation, highly
domain specific.
Endres-Niggenger (2000) found that, people prefer
(sometimes) extractive summaries instead of
gloss-over abstractive summaries!

14
The reality

The dominant approach in practice is still
selection-based
Understanding based systems only exist in theory,
and will continue to be so for quite a while
However, certain text understanding tasks in
small scale or restricted domains can be done.

15
Topic guided text summarization TIDE

TIDE is our effort trying to make use of text
summarization techniques for business
applications.
Such real world applications will require an
inclusion of these different types of
summarization forms.
Simply extractive summary will not do.
Simply abstractive summary will not do.
Simply information extraction will not do.

16
Topic guided text summarization TIDE

Text summarization as a process of topic
analysis, passage extraction, and text
understanding, information integration/fusion,
and text generation proces.
Passage extraction guided by topic structure will
expect to keep the logic relationships between
the extracted text parts e.g. sentences are
arranged logically according to topic structure
Tpoic representation will also be very helpful in
next phase text analysis and information
integration.

17
Phase 1 Theme detection, topic labels,
sentence/passage selection

Theme detection through passage pairwise
similarity analysis
Vector space model of term and document
TF-IDF baseline method

18
Passage similarity analysis with LSA method

LSA (Latent Sematic Analysis)
http//www.cs.utk.edu/lsi/ Deerwester et al,
1990
http//lsa.colorado.edu/
Similar results as using TF-IDF
Fuzzy LSI approach (Nikravesh, 2002)

19
Passage similarity analysis

OKAPI (TREC-3, Robertson et al, 1996)
Weight functions take into account document
length and average document length and relevance
feedback factors, in addition to term frequency
and collection frequency
Current standard

20
Passage adjacency matrix (partial)
21
Passage Relation Map
22
Passage Extraction Rules

Passage clusters help us to identify themes and
topics unconnected passages form distinct topics
covered in a document.
The MMR algorithm (CMU) (Goldstein et al, 2000)
A sentence/passage closest to the centroid of the
cluster be chosen to be included in the summary.
Sentences that are maximally similar to the
document and maximally dissimilar to sentences
already in the summary are selected to compose a
summary.

23
Creating theme labels

Keywords (TF based)
Word families (semantic related words in a
passage cluster)
Key phrases
Linguistic approach
Statistical simple heuristics (Kelledy and
Smeaton, 1997) seems quite effective.

24
Next step
25
WordNet, since 1985

Lexical database developed at Princeton
University, led by George Miller
Hand-coded, freely available
Word knowledge of nouns, verbs, adjectives,
adverbs
Semantic network representation with only a few
semantic relations
Synonym, hypernynm,
Categorization relation Is-a
Widely used in query expansion, word similarity
determination (based on synsets)

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
FrameNet, The Berkeley FrameNet project

A lexical resource but contains much richer
information about words than WordNet
Contains rich linguistic knowledge necessary for
text understanding
Document a range of semantic and syntactic
combinatory possibilities of each word (nouns,
verbs, adjectives) in each of its sense
Mannual annotation of example sentences
automatic capture and organization of the
annotation result (using IE technolohy)
Can be displayed and queried via the web

30
(No Transcript)
31
(No Transcript)
32
How to use FrameNet?

Frames are formed in accord with various uses of
prepositions around the verb sense
The cases associated with a verb sense are
related with questions that we would usually ask
about an event such as who did what to whom, and
when?
Parts of a sentence are applied to instantiate a
frame, and content is recognized from the text
segments to fill in the frame slots.
Much needs to be explored.
Limitation in its coverage.

33
(No Transcript)
34
(No Transcript)
35
ConceptNet, MIT Media Lab

Common sense knowledge base with NLP capability
Extracted automatically from common sense
knowledge expressed in semi-structured NL
sentences from OMCSNet (open mind common sense)
applying about 50 extraction rules
The Effect of falling off a bike is you get
hurt.
A lime is a very sour fruit at OMCS is
extracted into two assertations
IsA (lime, fruit)
PropertyOf (lime, very sour)

36
(No Transcript)
37
ConceptNet (Liu and Singh, 2004a, 2004b)

Inference
Spreading activation node-activation radiating
outward from an origin code
GetContext (node)
GetAnalogousConcept (node)
Graph traversal
FindPathBetweenNodes (node1, node2)

38
ConceptNet (Liu and Singh, 2004a, 2004b)

Support
Topic sensing
Query expansion
Semantic similarity of words
Lexical generalization
Thematic generalization
Much needs to be examined
Uncontrolled vocabulary, can be biased in terms
of content but seems quite reliable knowledge.

39
Topic-Sensing
40
Eurovoc multilingual thesaurus

Controlled vocabulary, 20 languages, broad fields
politics, international relations, European
Communities, law, economics, trade, finance,
social questions, education, science,
international organizations, employment and
working conditions
industry, business and competition, production,
technology and research,
transport, environment, energy,
agriculture, forestry and fisheries,
agri-foodstuffs,
geography

41
Next step work

It is not clear how the various current knowledge
resources will help in real world business
applications. But it is important to have a
deeper look into them.
Study the peculiarities of certain business
document corpus to improve the selection
process.
Other knowledge resources

Write a Comment

User Comments (0)