Automatic Timeline Generation - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Timeline Generation

Description:

Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b Corpus Subset of TDT-3 text news articles 875 articles 49 topics (collections of related articles) 6 ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 15
Provided by: JoshT79
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatic Timeline Generation


1
Automatic Timeline Generation
  • Jessica Jenkins
  • Josh Taylor
  • CS 276b

2
Corpus
  • Subset of TDT-3 text news articles
  • 875 articles
  • 49 topics (collections of related articles)
  • 6-30 articles per topic
  • Sources ABC, APW, CNN, NBC, NYT, VOA
  • Focus on sentence-level event detection within a
    topic

3
Labeling / Data Processing
  • Manual annotation for evaluation
  • 10 topics, 120 articles
  • Decide on a set of events for each topic
  • Annotate each sentence in each article with a set
    of relevant topic events
  • Sentence boundary detection
  • MXTerminator maximum entropy classifier (Reynar
    and Ratnaparkhi)

4
Data Processing
  • Sentences tokenized with case-folding,
    punctuation stripping, Porter stemming, and
    English stop-word removal
  • Part of Speech tagging
  • Stanford Log-linear PoS Tagger (Toutanova and
    Manning)
  • Noun Phrase Chunking
  • BaseNP Chunking (Ramshaw and Marcus)
  • Only retained verbs for non-NP terms

5
Evaluation
  • Precision and recall analogs
  • Defined relative to set of annotated events
  • NU-precision count of sentences labeled with
    novel events (relative to higher-ranked
    sentences) over number of selected sentences
  • NU-recall number of events in selected sentences
    over number of events in topic

6
Event Selection Methods
  • Language model-based sentence scoring (Allan,
    Gupta, Khandelwal)
  • Useful score measure of how likely a sentence
    is to be on-topic.
  • Novel score indication that sentence content
    is unlike previous sentences
  • Sentences ranked using a weighted blend of their
    Useful and Novel score

7
Sample Timeline
  • Top 5 sentences by score, ordered
    chronologically
  • As Mr. Sharif left the White House, he said the
    talks focused on the India-Pakistan dispute over
    the Kashmir region and U.S. concerns about
    nuclear arms tests conducted by India and
    Pakistan.
  • In 1989, Pakistan paid the U.S. close to 700
    million for 28 F-16 fighter planes.
  • But Pakistan was never refunded for the F-16s.
  • In return, Pakistan will withdraw its claim to
    the F-16 aircraft.
  • Delivery of the jets was stopped in 1990 because
    of the U.S. arms embargo against Pakistan because
    of its nuclear program.

8
(No Transcript)
9
Event Selection Methods
  • Event discovery through clustering
  • Similar to text summarization
  • Each cluster represents an event
  • Hierarchical agglomerative clustering with sparse
    vectors
  • K-means (k5,10,20) with and without
    dimensionality reduction

10
Clustering Results
  • Tried using group average HAC without
    dimensionality reduction
  • Representative sentence chosen for maximal cosine
    similarity with cluster centroid
  • Underwhelming results with respect to evaluation
    metrics

11
Sample Timeline 2
  • Top 5 useful cluster representatives, taken
    from 15 clusters, ordered chronologically
  • Mr. Clinton also said he and Mr. Sharif would try
    to resolve a dispute over a canceled sale of U.S.
    fighter aircraft.
  • Mr. Clinton wants India and Pakistan, which each
    conducted nuclear tests last May, to sign the
    Comprehensive Test Ban Treaty as they have
    indicated they would do.
  • All of you know my concern to do everything we
    can to end the nuclear competition in South Asia
    which I believe is a threat to Pakistan and India
    and to the stability of the world.
  • The delivery was blocked under a 1990 law barring
    direct U.S. military sales to Pakistan because of
    its development of nuclear weapons.
  • He met with Pakistan's Prime Minister Omar Sharif.

12
Clustering in Progress
  • Using SVD dimensionality reduction
  • Noun phrase features with part-of-speech
    filtering
  • Representative sentences chosen from each cluster
    based on Useful score

13
Problem Areas
  • Small annotated corpus
  • Few annotated events per topic give a coarse
    evaluation
  • Weak correspondence between clusters and labeled
    events

14
Unexplored Possibilities
  • Detection and exploitation of temporal features
    in sentences
  • Clustering with different sentence similarity
    measures
  • WordNet-derived semantic distance
Write a Comment
User Comments (0)
About PowerShow.com