SlideSeer: A DL of aligned document and presentation pairs - PowerPoint PPT Presentation

About This Presentation

Title:

SlideSeer: A DL of aligned document and presentation pairs

Description:

WING (Web IR / NLP Group) National ... Used in Machine Translation (MT) ... Use machine learning (SVM) to learn a binary classifier. Features. Similarity score ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 28

Provided by: tanyeefanm

Category:

more less

Transcript and Presenter's Notes

Title: SlideSeer: A DL of aligned document and presentation pairs

1
SlideSeer A DL of aligned document and
presentation pairs

Min-Yen Kan
WING (Web IR / NLP Group)National University of
Singapore

2
Scholarly Digital Libraries what do we use them
for?

Find articles to print, read offline
Browse, select research work
Assess authors, publication venues, research
groups
Papers (documents) dont store all of the
information about a discovery
Datasets
Tools
Implementation details / conditions
They also dont help a person learn the research
Textbooks
Slide presentations

Well focus on this
3
Qualities of slide presentations

Good slide sets complement a document. They
often
focus and highlight findings in the document
create a bridge into the document itself
are a visual and oral summary of a document
How can we leverage slides in a digital library?

What about poor slides?
PowerPoint is presenter-oriented, not
content-oriented or audience-oriented The
remedy? Visual reasoning usually works more
effectively when the relevant evidence is shown
adjacent in space within the eyespan. (Tufte,
2006)
Four score and seven years ago
4
Documents and presentations as duals

Present identical or highly overlapping materials
Document for archival and reference purposes
Presentation for introducing and summarizing
the work
As the two can be seen as duals, we should allow
them to be viewed together.
Would like random access of the presentation and
document pair
Answer find pairs of documents and
presentations.

5
A model MITs Open CourseWare

A better answer add fine-grained alignment.

Audio of lecture
Slides in context
Simplified transcript of lecture
6
Talk Outline
1. Resource Discovery
3. User Interface

Motivation
Architecture
1. Resource Discovery
2. Alignment
3. User Interface
Demo
Status and Conclusions

2. Alignment
7
1. Resource Discovery

Algorithm
Obtain suitable document metadata
Web search to find candidate presentations
Post process to useable form

8
1. Resource Discovery Obtaining Metadata

Start with CiteSeer (thanks to IST CL Giles, I
Councill)
750K records with parsed header metadata
Complete with .pdf documents
Enhancement Merge DBLP snapshot (Aug 2006 1.2M
docs) with CiteSeer
Large scale record linkage task, O(nm)
complexity unacceptable
Indexed DBLP into Lucene, use each CS record to
retrieve DBLP variants, resulting in O(n)
complexity
Result size 1.5M

9
1. Resource Discovery Finding presentations

Google API on title, author to find corresponding
presentation
Use simple Jaccard similarity threshold to
decide matches
threshold ?3 for titleauthor similarity

CiteSeer
DBLP merge
Present-ations
?2
?3
?1
DBLPLucene Index
Web filetypeppt
10
1. Resource Discovery Conversion

Final results
85 precision, recall difficult to calculate
(80)
11K pairs after processing 200K of 1.5M records
Many caveats
only .pdf and .ppt formats currently handled
conversion fails often, pdf conversion difficult
current work use OCR to redo text extraction

Via pdftohtml
text
formatted text

Via czppt2gif/convert
png
text

11
2. Alignment Problem formulation

Q What are we aligning?
A Text of slides to document text
Use paragraphs to delimit text units in
documents
Use document headers to delimit sections
Q What type of alignment is necessary?
A Depends. Presentation or document centered
view?
Presentation 1 slide aligned to 0 to more
paragraphs
Document 1 section aligned to 0 to more slides
Q Whats the approach?
A Two stages
Basic similarity measure to calculate a
similarity matrix
Alignment schemes to establish alignment mapping

Concentrate on this
Text Units
1
p
1
Similarity Matrix
Slides
s
12
2. Alignment Related Work

Narration to presentation alignment
Usually naturally synchronous Monotonic
alignment
Multilingual text alignment
Used in Machine Translation (MT)
Polynomial complexity (O(n3)) but heuristics
tend to work well
Slide/abstract to document alignment
Use Hidden Markov Model (HMM) for alignment
Doesnt handle missing materials well.
Desiderata
Should take context into account
But shouldnt enforce monotonicity
Nil (zero) alignments needed, when materials
dont overlap

13
2. Alignment Similarity Measures

Take text units, cut into tokens. Then calculate
similarity using
Cosine
Standard IR metric
TFIDF for token weight
Calculate slide, paragraph vector similarity
using cosine
Jaccard
unigram tokens
bigram
unigram bigram
Use IDF weighting for tokens.
For both schemes, use IDF weighting from WebBase
corpus

14
2. Alignment - Schemes
Using matrix of ltp,sgt similarity, align using

1. Max Similarity
Baseline
Cant do nil alignment
2. Edit Distance
Efficient dynamic programming
But outputs only monotonic alignments
3. Local Jump Model
Variation on 2 to allow local backward jumps
Backward jumps within 5 of text units
Still doesnt handle reordered sections

4. Hidden Markov Model
Word-based
Attempts to find origin of s in p
Only handles overlapping information

si-5 si-1 si wj-5 wj-1
wj1 wj5 si1 si5
p1gtp2gtp3gtp4gtp5gtp6
15
2. Alignment Span Extension
As Maximum Similarity does quite well, lets
extend the algorithm

Idea post-process to extend from points to spans
Retrieve top n (n10) most sim paragraphs
Try all (n) possible spans for alignment
alignment_score (x,y) span_sim ln(span_length)

2
Slightly favor longer spans
16
2. Alignment Alignment Correction
Neighboring alignments can help to correct a
spurious one
p1
p1
p1
pn
pn
pn
si-1
si-1
si-1
si
si
si
si1
si1
si1
(a)
(b)
(c)

(a) monotonic alignment ? ok
(b) si jumps back from si-1, but then proceeds
monotonically? probably ok, minor penalty
(c) si jumps back, but si1 jumps back forward
? looks more like an error, major penalty applied
Final alignment score alignment_score
(1-penalty)

17
2. Alignment Nil classifier
But not all text units should be aligned

Use machine learning (SVM) to learn a binary
classifier
Features
Similarity score
Number of words on slide
Few words can indicate figures, pictures with
less preference for alignment
Words on slide
Cue phrases outline, questions, thanks
Alignment path
Jumping alignments (e.g., outline slides)

18
2. Alignment Evaluation Dataset

Manually compiled alignment dataset by author
and fellow researcher
Gold standard annotate all acceptable spans, or
nil
20 presentation and document pairs from databases
Dataset is freely downloadable

19
2. Alignment Evaluation
Weighted Jaccard Accuracy

40? Why is it so difficult?
Noise in conversion process. Other studies have
used clean data.
Other have used soft accuracy (any overlap is
correct)
Use Weighted Jaccard accuracy as metric
Fractional accuracy for partially correct
answers
Give false positives (extra spurious alignments)
less weight

20
3. User Interface Rationale
How might fine-grained aligned pairs be utilized
in a large DL?

Coordinated Views
Learning / Comprehension
Summarization
Offline Viewing

Collection Interface
Comparing pairs
Searching for suitable materials

21
3. UI Coordinated Views
Gallery View
SlideshowView
Full Document View
Document View
Slide View
Print View
Slide centric
Document centric
22
SlideSeer Prototype Demo

Production environment differs from demo

23
3. UI Collection Interface

Searching
Lucene indexing of the static print view
Show title along with the set of results
Spider-friendly
Main content loaded dynamically by Javascript,
not spiderable
Currently use print view (as it is static) for
spiderable interface
URLs
Most material in the form ltsubject/surname/year/ti
tle/view/type?offsetgt
Implies hierarchy of papers
Constructed URLs to promote browsing access
Simple keyboard shortcuts
For expert user navigation

24
Conclusion

Alignment of documents to presentations
Simple approach works well thus far
Tweaks to get more mileage out of simple
approach
Span alignment, nil alignment modifications
But certainly more models to try!
40 best performance, certainly much room to
improve
Deployment status
In Alpha (development)
Beta hopefully in mid 2008
Usability testing underway

Interested in digital anthologies?
Join our mailing list (web dAnth)
Current text extraction project for ACL
Anthology

25
Other slides
26
Future Work

Planning to hook up current work in progress
2 stage CRF/SVM re-ranking citation segmentation
algorithm
Automatic keyphrase extraction program
Automatic synthetic image classification
Automatic de-duplication module
Partnering with Simone Teufel (Cambridge U.) to
do argumentative zoning of documents
What is a citation used for?

27
Poor slides