Title: Text Summarization In Search of Effective Ideas and Techniques
1Text Summarization -- In Search of Effective
Ideas and Techniques
- Shuhua Liu, Assistant Professor
- Department of Information Systems
- Ã…bo Akademi University, Finland
- Visiting scholar at BISC, UC Berkeley
- Berkeley, September 21, 2004
2What is text summarization?
- to reduce (long) textual information to its most
essential points - to distill the most important information from a
source or sources to produce an abridged version
of it (Endres-Niggemeyer, 1998 Mani and Maybury,
1999 Spärck-Jones, 1999).
3Text summarization a context-dependent activity
4Text summarization
- Key issues
- how to identify the most important content out of
the rest of the text? - how to synthesize the substance and formulate a
summary text based on the identified content? - Major approaches
- Selection based produce extracts
- Text understanding based produce abstracts
5(No Transcript)
6Selection based summarization how does it work?
- The most content-bearing sentences or passages
are identified and selected to compose a summary. - Compute a significance value for each sentence
(Luhn, 1958 Edmundson, 1969) - Count word frequency
- the keywords, title words, cue words it contains
- the position of the sentence
- RST (Rhetorical structute theory) based discourse
analysis (Marcu, 1997) - Passage and sentence similarity analysis
(Goldstein et al, 2000 CMU)
7MSWord AutoSummarize
8MEAD/NewsINEssence (Radev et al, 2003)
9MEAD/NewsINEssence (Radev et al, 2003)
10MEAD/NewsINEssence (Radev et al, 2003)
11Text understanding system
- A text understanding task often aims to recover
all of the information that there is in a text,
including what is only implicit in what is
actually written. - All the richness of natural language becomes
fair game, including metaphor, metonymy,
discourse structure, and the recognition of the
author's underlying intentions, and the full
interplay between language and world knowledge
becomes central to the task.
12Text understanding based summarization
- Depend on complete sentence analysis and
discourse analysis with full knowledge support - Syntactic pasrer, semantic interpreter
- Linguistic knowledge, world knowledge, domain
knowledge - Reasoning mechnisms that work effectively over
huge knowledge collections.
13Selection based vs. Unedrstanding based
- Selection based general applicable, but
incoherent content, poor readability due to
unclear relationships between the selected text
excerpts, dangling references, and so on. - Understanding based high precision, but very
slow, large amount of wasted computation, highly
domain specific. - Endres-Niggenger (2000) found that, people prefer
(sometimes) extractive summaries instead of
gloss-over abstractive summaries!
14The reality
- The dominant approach in practice is still
selection-based - Understanding based systems only exist in theory,
and will continue to be so for quite a while - However, certain text understanding tasks in
small scale or restricted domains can be done.
15Topic guided text summarization TIDE
- TIDE is our effort trying to make use of text
summarization techniques for business
applications. - Such real world applications will require an
inclusion of these different types of
summarization forms. - Simply extractive summary will not do.
- Simply abstractive summary will not do.
- Simply information extraction will not do.
16Topic guided text summarization TIDE
- Text summarization as a process of topic
analysis, passage extraction, and text
understanding, information integration/fusion,
and text generation proces. - Passage extraction guided by topic structure will
expect to keep the logic relationships between
the extracted text parts e.g. sentences are
arranged logically according to topic structure - Tpoic representation will also be very helpful in
next phase text analysis and information
integration.
17Phase 1 Theme detection, topic labels,
sentence/passage selection
- Theme detection through passage pairwise
similarity analysis - Vector space model of term and document
- TF-IDF baseline method
18Passage similarity analysis with LSA method
- LSA (Latent Sematic Analysis)
- http//www.cs.utk.edu/lsi/ Deerwester et al,
1990 - http//lsa.colorado.edu/
- Similar results as using TF-IDF
- Fuzzy LSI approach (Nikravesh, 2002)
19Passage similarity analysis
- OKAPI (TREC-3, Robertson et al, 1996)
- Weight functions take into account document
length and average document length and relevance
feedback factors, in addition to term frequency
and collection frequency - Current standard
20Passage adjacency matrix (partial)
21Passage Relation Map
22Passage Extraction Rules
- Passage clusters help us to identify themes and
topics unconnected passages form distinct topics
covered in a document. - The MMR algorithm (CMU) (Goldstein et al, 2000)
- A sentence/passage closest to the centroid of the
cluster be chosen to be included in the summary. - Sentences that are maximally similar to the
document and maximally dissimilar to sentences
already in the summary are selected to compose a
summary.
23Creating theme labels
- Keywords (TF based)
- Word families (semantic related words in a
passage cluster) - Key phrases
- Linguistic approach
- Statistical simple heuristics (Kelledy and
Smeaton, 1997) seems quite effective.
24Next step
25WordNet, since 1985
- Lexical database developed at Princeton
University, led by George Miller - Hand-coded, freely available
- Word knowledge of nouns, verbs, adjectives,
adverbs - Semantic network representation with only a few
semantic relations - Synonym, hypernynm,
- Categorization relation Is-a
- Widely used in query expansion, word similarity
determination (based on synsets)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29FrameNet, The Berkeley FrameNet project
- A lexical resource but contains much richer
information about words than WordNet - Contains rich linguistic knowledge necessary for
text understanding - Document a range of semantic and syntactic
combinatory possibilities of each word (nouns,
verbs, adjectives) in each of its sense - Mannual annotation of example sentences
automatic capture and organization of the
annotation result (using IE technolohy) - Can be displayed and queried via the web
30(No Transcript)
31(No Transcript)
32How to use FrameNet?
- Frames are formed in accord with various uses of
prepositions around the verb sense - The cases associated with a verb sense are
related with questions that we would usually ask
about an event such as who did what to whom, and
when? - Parts of a sentence are applied to instantiate a
frame, and content is recognized from the text
segments to fill in the frame slots. - Much needs to be explored.
- Limitation in its coverage.
33(No Transcript)
34(No Transcript)
35ConceptNet, MIT Media Lab
- Common sense knowledge base with NLP capability
- Extracted automatically from common sense
knowledge expressed in semi-structured NL
sentences from OMCSNet (open mind common sense)
applying about 50 extraction rules - The Effect of falling off a bike is you get
hurt. - A lime is a very sour fruit at OMCS is
extracted into two assertations - IsA (lime, fruit)
- PropertyOf (lime, very sour)
36(No Transcript)
37ConceptNet (Liu and Singh, 2004a, 2004b)
- Inference
- Spreading activation node-activation radiating
outward from an origin code - GetContext (node)
- GetAnalogousConcept (node)
- Graph traversal
- FindPathBetweenNodes (node1, node2)
38ConceptNet (Liu and Singh, 2004a, 2004b)
- Support
- Topic sensing
- Query expansion
- Semantic similarity of words
- Lexical generalization
- Thematic generalization
- Much needs to be examined
- Uncontrolled vocabulary, can be biased in terms
of content but seems quite reliable knowledge.
39Topic-Sensing
40Eurovoc multilingual thesaurus
- Controlled vocabulary, 20 languages, broad fields
- politics, international relations, European
Communities, law, economics, trade, finance,
social questions, education, science,
international organizations, employment and
working conditions - industry, business and competition, production,
technology and research, - transport, environment, energy,
- agriculture, forestry and fisheries,
agri-foodstuffs, - geography
41Next step work
- It is not clear how the various current knowledge
resources will help in real world business
applications. But it is important to have a
deeper look into them. - Study the peculiarities of certain business
document corpus to improve the selection
process. - Other knowledge resources