Title: SlideSeer: A DL of aligned document and presentation pairs
1SlideSeer A DL of aligned document and
presentation pairs
- Min-Yen Kan
- WING (Web IR / NLP Group)National University of
Singapore
2Scholarly Digital Libraries what do we use them
for?
- Find articles to print, read offline
- Browse, select research work
- Assess authors, publication venues, research
groups - Papers (documents) dont store all of the
information about a discovery - Datasets
- Tools
- Implementation details / conditions
- They also dont help a person learn the research
- Textbooks
- Slide presentations
Well focus on this
3Qualities of slide presentations
- Good slide sets complement a document. They
often - focus and highlight findings in the document
- create a bridge into the document itself
- are a visual and oral summary of a document
- How can we leverage slides in a digital library?
What about poor slides?
PowerPoint is presenter-oriented, not
content-oriented or audience-oriented The
remedy? Visual reasoning usually works more
effectively when the relevant evidence is shown
adjacent in space within the eyespan. (Tufte,
2006)
Four score and seven years ago
4Documents and presentations as duals
- Present identical or highly overlapping materials
- Document for archival and reference purposes
- Presentation for introducing and summarizing
the work - As the two can be seen as duals, we should allow
them to be viewed together. - Would like random access of the presentation and
document pair - Answer find pairs of documents and
presentations.
5A model MITs Open CourseWare
- A better answer add fine-grained alignment.
Audio of lecture
Slides in context
Simplified transcript of lecture
6Talk Outline
1. Resource Discovery
3. User Interface
- Motivation
- Architecture
- 1. Resource Discovery
- 2. Alignment
- 3. User Interface
- Demo
- Status and Conclusions
2. Alignment
71. Resource Discovery
- Algorithm
- Obtain suitable document metadata
- Web search to find candidate presentations
- Post process to useable form
81. Resource Discovery Obtaining Metadata
- Start with CiteSeer (thanks to IST CL Giles, I
Councill) - 750K records with parsed header metadata
- Complete with .pdf documents
- Enhancement Merge DBLP snapshot (Aug 2006 1.2M
docs) with CiteSeer - Large scale record linkage task, O(nm)
complexity unacceptable - Indexed DBLP into Lucene, use each CS record to
retrieve DBLP variants, resulting in O(n)
complexity - Result size 1.5M
91. Resource Discovery Finding presentations
- Google API on title, author to find corresponding
presentation - Use simple Jaccard similarity threshold to
decide matches - threshold ?3 for titleauthor similarity
CiteSeer
DBLP merge
Present-ations
?2
?3
?1
DBLPLucene Index
Web filetypeppt
101. Resource Discovery Conversion
- Final results
- 85 precision, recall difficult to calculate
(80) - 11K pairs after processing 200K of 1.5M records
- Many caveats
- only .pdf and .ppt formats currently handled
- conversion fails often, pdf conversion difficult
- current work use OCR to redo text extraction
- Via pdftohtml
- text
- formatted text
- Via czppt2gif/convert
- png
- text
112. Alignment Problem formulation
- Q What are we aligning?
- A Text of slides to document text
- Use paragraphs to delimit text units in
documents - Use document headers to delimit sections
- Q What type of alignment is necessary?
- A Depends. Presentation or document centered
view? - Presentation 1 slide aligned to 0 to more
paragraphs - Document 1 section aligned to 0 to more slides
- Q Whats the approach?
- A Two stages
- Basic similarity measure to calculate a
similarity matrix - Alignment schemes to establish alignment mapping
Concentrate on this
Text Units
1
p
1
Similarity Matrix
Slides
s
122. Alignment Related Work
- Narration to presentation alignment
- Usually naturally synchronous Monotonic
alignment - Multilingual text alignment
- Used in Machine Translation (MT)
- Polynomial complexity (O(n3)) but heuristics
tend to work well - Slide/abstract to document alignment
- Use Hidden Markov Model (HMM) for alignment
- Doesnt handle missing materials well.
- Desiderata
- Should take context into account
- But shouldnt enforce monotonicity
- Nil (zero) alignments needed, when materials
dont overlap
132. Alignment Similarity Measures
- Take text units, cut into tokens. Then calculate
similarity using - Cosine
- Standard IR metric
- TFIDF for token weight
- Calculate slide, paragraph vector similarity
using cosine - Jaccard
- unigram tokens
- bigram
- unigram bigram
- Use IDF weighting for tokens.
- For both schemes, use IDF weighting from WebBase
corpus
142. Alignment - Schemes
Using matrix of ltp,sgt similarity, align using
- 1. Max Similarity
- Baseline
- Cant do nil alignment
- 2. Edit Distance
- Efficient dynamic programming
- But outputs only monotonic alignments
-
- 3. Local Jump Model
- Variation on 2 to allow local backward jumps
- Backward jumps within 5 of text units
- Still doesnt handle reordered sections
- 4. Hidden Markov Model
- Word-based
- Attempts to find origin of s in p
- Only handles overlapping information
si-5 si-1 si wj-5 wj-1
wj1 wj5 si1 si5
p1gtp2gtp3gtp4gtp5gtp6
152. Alignment Span Extension
As Maximum Similarity does quite well, lets
extend the algorithm
- Idea post-process to extend from points to spans
- Retrieve top n (n10) most sim paragraphs
- Try all (n) possible spans for alignment
- alignment_score (x,y) span_sim ln(span_length)
2
Slightly favor longer spans
162. Alignment Alignment Correction
Neighboring alignments can help to correct a
spurious one
p1
p1
p1
pn
pn
pn
si-1
si-1
si-1
si
si
si
si1
si1
si1
(a)
(b)
(c)
- (a) monotonic alignment ? ok
- (b) si jumps back from si-1, but then proceeds
monotonically? probably ok, minor penalty - (c) si jumps back, but si1 jumps back forward
- ? looks more like an error, major penalty applied
- Final alignment score alignment_score
(1-penalty)
172. Alignment Nil classifier
But not all text units should be aligned
- Use machine learning (SVM) to learn a binary
classifier - Features
- Similarity score
- Number of words on slide
- Few words can indicate figures, pictures with
less preference for alignment - Words on slide
- Cue phrases outline, questions, thanks
- Alignment path
- Jumping alignments (e.g., outline slides)
182. Alignment Evaluation Dataset
- Manually compiled alignment dataset by author
and fellow researcher - Gold standard annotate all acceptable spans, or
nil -
- 20 presentation and document pairs from databases
- Dataset is freely downloadable
192. Alignment Evaluation
Weighted Jaccard Accuracy
- 40? Why is it so difficult?
- Noise in conversion process. Other studies have
used clean data. - Other have used soft accuracy (any overlap is
correct) - Use Weighted Jaccard accuracy as metric
- Fractional accuracy for partially correct
answers - Give false positives (extra spurious alignments)
less weight
203. User Interface Rationale
How might fine-grained aligned pairs be utilized
in a large DL?
- Coordinated Views
- Learning / Comprehension
- Summarization
- Offline Viewing
- Collection Interface
- Comparing pairs
- Searching for suitable materials
213. UI Coordinated Views
Gallery View
SlideshowView
Full Document View
Document View
Slide View
Print View
Slide centric
Document centric
22SlideSeer Prototype Demo
- Production environment differs from demo
233. UI Collection Interface
- Searching
- Lucene indexing of the static print view
- Show title along with the set of results
- Spider-friendly
- Main content loaded dynamically by Javascript,
not spiderable - Currently use print view (as it is static) for
spiderable interface - URLs
- Most material in the form ltsubject/surname/year/ti
tle/view/type?offsetgt - Implies hierarchy of papers
- Constructed URLs to promote browsing access
- Simple keyboard shortcuts
- For expert user navigation
24Conclusion
- Alignment of documents to presentations
- Simple approach works well thus far
- Tweaks to get more mileage out of simple
approach - Span alignment, nil alignment modifications
- But certainly more models to try!
- 40 best performance, certainly much room to
improve - Deployment status
- In Alpha (development)
- Beta hopefully in mid 2008
- Usability testing underway
- Interested in digital anthologies?
- Join our mailing list (web dAnth)
- Current text extraction project for ACL
Anthology
25Other slides
26Future Work
- Planning to hook up current work in progress
- 2 stage CRF/SVM re-ranking citation segmentation
algorithm - Automatic keyphrase extraction program
- Automatic synthetic image classification
- Automatic de-duplication module
- Partnering with Simone Teufel (Cambridge U.) to
do argumentative zoning of documents - What is a citation used for?
27Poor slides
- Often represent a biased view of the full
results - Cherry picking evidence to support claims
- Imply that evidence is independent (when it is
statistically correlated) - May summarize other findings inaccurately
(secondary or tertiary sources