SlideSeer: A DL of aligned document and presentation pairs - PowerPoint PPT Presentation

About This Presentation
Title:

SlideSeer: A DL of aligned document and presentation pairs

Description:

WING (Web IR / NLP Group) National ... Used in Machine Translation (MT) ... Use machine learning (SVM) to learn a binary classifier. Features. Similarity score ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 28
Provided by: tanyeefanm
Category:

less

Transcript and Presenter's Notes

Title: SlideSeer: A DL of aligned document and presentation pairs


1
SlideSeer A DL of aligned document and
presentation pairs
  • Min-Yen Kan
  • WING (Web IR / NLP Group)National University of
    Singapore

2
Scholarly Digital Libraries what do we use them
for?
  • Find articles to print, read offline
  • Browse, select research work
  • Assess authors, publication venues, research
    groups
  • Papers (documents) dont store all of the
    information about a discovery
  • Datasets
  • Tools
  • Implementation details / conditions
  • They also dont help a person learn the research
  • Textbooks
  • Slide presentations

Well focus on this
3
Qualities of slide presentations
  • Good slide sets complement a document. They
    often
  • focus and highlight findings in the document
  • create a bridge into the document itself
  • are a visual and oral summary of a document
  • How can we leverage slides in a digital library?

What about poor slides?
PowerPoint is presenter-oriented, not
content-oriented or audience-oriented The
remedy? Visual reasoning usually works more
effectively when the relevant evidence is shown
adjacent in space within the eyespan. (Tufte,
2006)
Four score and seven years ago
4
Documents and presentations as duals
  • Present identical or highly overlapping materials
  • Document for archival and reference purposes
  • Presentation for introducing and summarizing
    the work
  • As the two can be seen as duals, we should allow
    them to be viewed together.
  • Would like random access of the presentation and
    document pair
  • Answer find pairs of documents and
    presentations.

5
A model MITs Open CourseWare
  • A better answer add fine-grained alignment.

Audio of lecture
Slides in context
Simplified transcript of lecture
6
Talk Outline
1. Resource Discovery
3. User Interface
  • Motivation
  • Architecture
  • 1. Resource Discovery
  • 2. Alignment
  • 3. User Interface
  • Demo
  • Status and Conclusions

2. Alignment
7
1. Resource Discovery
  • Algorithm
  • Obtain suitable document metadata
  • Web search to find candidate presentations
  • Post process to useable form

8
1. Resource Discovery Obtaining Metadata
  • Start with CiteSeer (thanks to IST CL Giles, I
    Councill)
  • 750K records with parsed header metadata
  • Complete with .pdf documents
  • Enhancement Merge DBLP snapshot (Aug 2006 1.2M
    docs) with CiteSeer
  • Large scale record linkage task, O(nm)
    complexity unacceptable
  • Indexed DBLP into Lucene, use each CS record to
    retrieve DBLP variants, resulting in O(n)
    complexity
  • Result size 1.5M

9
1. Resource Discovery Finding presentations
  • Google API on title, author to find corresponding
    presentation
  • Use simple Jaccard similarity threshold to
    decide matches
  • threshold ?3 for titleauthor similarity

CiteSeer
DBLP merge
Present-ations
?2
?3
?1
DBLPLucene Index
Web filetypeppt
10
1. Resource Discovery Conversion
  • Final results
  • 85 precision, recall difficult to calculate
    (80)
  • 11K pairs after processing 200K of 1.5M records
  • Many caveats
  • only .pdf and .ppt formats currently handled
  • conversion fails often, pdf conversion difficult
  • current work use OCR to redo text extraction
  • Via pdftohtml
  • text
  • formatted text
  • Via czppt2gif/convert
  • png
  • text

11
2. Alignment Problem formulation
  • Q What are we aligning?
  • A Text of slides to document text
  • Use paragraphs to delimit text units in
    documents
  • Use document headers to delimit sections
  • Q What type of alignment is necessary?
  • A Depends. Presentation or document centered
    view?
  • Presentation 1 slide aligned to 0 to more
    paragraphs
  • Document 1 section aligned to 0 to more slides
  • Q Whats the approach?
  • A Two stages
  • Basic similarity measure to calculate a
    similarity matrix
  • Alignment schemes to establish alignment mapping

Concentrate on this
Text Units
1
p
1
Similarity Matrix
Slides
s
12
2. Alignment Related Work
  • Narration to presentation alignment
  • Usually naturally synchronous Monotonic
    alignment
  • Multilingual text alignment
  • Used in Machine Translation (MT)
  • Polynomial complexity (O(n3)) but heuristics
    tend to work well
  • Slide/abstract to document alignment
  • Use Hidden Markov Model (HMM) for alignment
  • Doesnt handle missing materials well.
  • Desiderata
  • Should take context into account
  • But shouldnt enforce monotonicity
  • Nil (zero) alignments needed, when materials
    dont overlap

13
2. Alignment Similarity Measures
  • Take text units, cut into tokens. Then calculate
    similarity using
  • Cosine
  • Standard IR metric
  • TFIDF for token weight
  • Calculate slide, paragraph vector similarity
    using cosine
  • Jaccard
  • unigram tokens
  • bigram
  • unigram bigram
  • Use IDF weighting for tokens.
  • For both schemes, use IDF weighting from WebBase
    corpus

14
2. Alignment - Schemes
Using matrix of ltp,sgt similarity, align using
  • 1. Max Similarity
  • Baseline
  • Cant do nil alignment
  • 2. Edit Distance
  • Efficient dynamic programming
  • But outputs only monotonic alignments
  • 3. Local Jump Model
  • Variation on 2 to allow local backward jumps
  • Backward jumps within 5 of text units
  • Still doesnt handle reordered sections
  • 4. Hidden Markov Model
  • Word-based
  • Attempts to find origin of s in p
  • Only handles overlapping information

si-5 si-1 si wj-5 wj-1
wj1 wj5 si1 si5
p1gtp2gtp3gtp4gtp5gtp6
15
2. Alignment Span Extension
As Maximum Similarity does quite well, lets
extend the algorithm
  • Idea post-process to extend from points to spans
  • Retrieve top n (n10) most sim paragraphs
  • Try all (n) possible spans for alignment
  • alignment_score (x,y) span_sim ln(span_length)

2
Slightly favor longer spans
16
2. Alignment Alignment Correction
Neighboring alignments can help to correct a
spurious one
p1
p1
p1
pn
pn
pn
si-1
si-1
si-1
si
si
si
si1
si1
si1
(a)
(b)
(c)
  • (a) monotonic alignment ? ok
  • (b) si jumps back from si-1, but then proceeds
    monotonically? probably ok, minor penalty
  • (c) si jumps back, but si1 jumps back forward
  • ? looks more like an error, major penalty applied
  • Final alignment score alignment_score
    (1-penalty)

17
2. Alignment Nil classifier
But not all text units should be aligned
  • Use machine learning (SVM) to learn a binary
    classifier
  • Features
  • Similarity score
  • Number of words on slide
  • Few words can indicate figures, pictures with
    less preference for alignment
  • Words on slide
  • Cue phrases outline, questions, thanks
  • Alignment path
  • Jumping alignments (e.g., outline slides)

18
2. Alignment Evaluation Dataset
  • Manually compiled alignment dataset by author
    and fellow researcher
  • Gold standard annotate all acceptable spans, or
    nil
  • 20 presentation and document pairs from databases
  • Dataset is freely downloadable

19
2. Alignment Evaluation
Weighted Jaccard Accuracy
  • 40? Why is it so difficult?
  • Noise in conversion process. Other studies have
    used clean data.
  • Other have used soft accuracy (any overlap is
    correct)
  • Use Weighted Jaccard accuracy as metric
  • Fractional accuracy for partially correct
    answers
  • Give false positives (extra spurious alignments)
    less weight

20
3. User Interface Rationale
How might fine-grained aligned pairs be utilized
in a large DL?
  • Coordinated Views
  • Learning / Comprehension
  • Summarization
  • Offline Viewing
  • Collection Interface
  • Comparing pairs
  • Searching for suitable materials

21
3. UI Coordinated Views
Gallery View
SlideshowView
Full Document View
Document View
Slide View
Print View
Slide centric
Document centric
22
SlideSeer Prototype Demo
  • Production environment differs from demo

23
3. UI Collection Interface
  • Searching
  • Lucene indexing of the static print view
  • Show title along with the set of results
  • Spider-friendly
  • Main content loaded dynamically by Javascript,
    not spiderable
  • Currently use print view (as it is static) for
    spiderable interface
  • URLs
  • Most material in the form ltsubject/surname/year/ti
    tle/view/type?offsetgt
  • Implies hierarchy of papers
  • Constructed URLs to promote browsing access
  • Simple keyboard shortcuts
  • For expert user navigation

24
Conclusion
  • Alignment of documents to presentations
  • Simple approach works well thus far
  • Tweaks to get more mileage out of simple
    approach
  • Span alignment, nil alignment modifications
  • But certainly more models to try!
  • 40 best performance, certainly much room to
    improve
  • Deployment status
  • In Alpha (development)
  • Beta hopefully in mid 2008
  • Usability testing underway
  • Interested in digital anthologies?
  • Join our mailing list (web dAnth)
  • Current text extraction project for ACL
    Anthology

25
Other slides
26
Future Work
  • Planning to hook up current work in progress
  • 2 stage CRF/SVM re-ranking citation segmentation
    algorithm
  • Automatic keyphrase extraction program
  • Automatic synthetic image classification
  • Automatic de-duplication module
  • Partnering with Simone Teufel (Cambridge U.) to
    do argumentative zoning of documents
  • What is a citation used for?

27
Poor slides
  • Often represent a biased view of the full
    results
  • Cherry picking evidence to support claims
  • Imply that evidence is independent (when it is
    statistically correlated)
  • May summarize other findings inaccurately
    (secondary or tertiary sources
Write a Comment
User Comments (0)
About PowerShow.com