Telugu-English Dictionary Based Cross Language Query Focused Multi-Document Summarization

1 / 19

About This Presentation

Title:

Telugu-English Dictionary Based Cross Language Query Focused Multi-Document Summarization

Description:

A bridge between CLIR and MT ... MT systems need syntactically correct sentences for ... Summary evaluated against DUC model summaries using ROUGE package ... –

Number of Views:38

Avg rating:3.0/5.0

Slides: 20

Provided by: carbonVide

Category:

more less

Transcript and Presenter's Notes

Title: Telugu-English Dictionary Based Cross Language Query Focused Multi-Document Summarization

1
Telugu-English Dictionary Based Cross Language
Query Focused Multi-Document Summarization

IIIA-2006

Prasad Pingali, J. Jagadeesh, Vasudeva Varma
International Institute of Information
Technology, Hyderabad, India pvvpr,
vv_at_iiit.ac.in July 08, 2006
2
Cross-Language Information Systems

CLIR, Factoid based Question Answering were
thoroughly researched
TREC, CLEF and NTCIR are some of the focused
workshops in this area
Efforts are on to make the state-of-the-art
research usable in applications
Machine Translation capabilities are a bottleneck
for end-user application
CL query focused summarization not studied.

3
CL Query Focused Summarization Use Case
4
Potential Benefits of CLQ Summarization

A bridge between CLIR and MT
Output is a paragraph, coherent, readable and
syntactically correct.
MT systems need syntactically correct sentences
for proper translation
Summary can be tailor made to suit the
constraints of MT system
For example, summary with minimum NL ambiguity
etc.

5
Problem Statement

to synthesize from a set of 25-50 documents in a
language L2 that are related to a given topic, a
brief, well-organized, fluent answer to a need
for information given in a language L1 , that
cannot be met by just stating a name, date,
quantity, etc.

6
Extraction Based Summary

Extraction based summarization process
Identify Sentences from documents
Score these sentences using some function
Choose the top scoring sentences
Identify and eliminate exact duplicates and near
duplicates (redundancy)
Concatenate the sentences in some logical order
(e.g., timestamp etc.)
Sentence scoring function
Query independent or
Query based or
Combination

7
Sentence Scoring using RBLM

Relevance based language modeling (RBLM)
framework

Using conditional sampling, joint probability
can be re-written as

Term dependencies in a CL setting calculated as

8
Sentence Scoring using RBLM (contd.)

P(qi/ej) as translation probability
Explored in statistical machine translation and
CLIR
P(ej/w) as post-translation query expansion
Calculated using dictionary based
Term co-occurrence statistics
Pseudo-relevance feedback

9
Calculations in our Experiments (Telugu-English)

P(qi/ej) as translation probability
Used bilingual dictionary and assumed uniform
probability for all possible translations
P(ej/w) as post-translation query expansion
Used Hyperspace Analogue to Language (HAL)
feature to compute term dependencies
HAL works based on skip-bigrams, window length
is a parameter

10
Sentence Score

The score of each word 'w' w.r.t to the query
can be written as

The score of each sentence 'S' w.r.t to the
query can be written as

P
11
Summary Generation

Top ranking sentences are tested for redundancy
Cosine similarity between sentences (unigram
overlap) used for calculating redundancy
Candidate sentences after removing redundant ones
are concatenated in the order of the publication
date of the document
Sentences belonging to same document are
concatenated in the order of occurrence in the
document.

12
Evaluation

Experiments in Telugu-English language pair
Used DUC 2005 dataset
Manually translated DUC queries into Telugu
Used Telugu queries as input to system
System generates English summary
Summary evaluated against DUC model summaries
using ROUGE package
ROUGE package contains many metrics, for
summaries of paragraph length ROUGE-2 and
ROUGE-SU4 correlate best with human evaluations

13
An Example
14
Evaluation Results
15
Evaluation Results (per topic)
16
Conclusions

We extended our mono-lingual summarization
framework to a cross-lingual setting in RBLM
framework
We designed a cross-lingual experimental setup
using DUC 2005 dataset
Experiments were conducted for Telugu-English
language pair
Comparison with mono-lingual baseline shows about
90 performance in ROUGE-SU4 and about 85 in
ROUGE-2 f-measures

17
DUC-2006
18
Invitation
Workshop on Cross Lingual Information
Access http//search.iiit.ac.in/CLIA2007 January
06. 2007 held at IJCAI 07 in Hyderabad, India.
19
Thank you