A Statistical Model for Domain-Independent Text Segmentation - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

A Statistical Model for Domain-Independent Text Segmentation

Description:

A Statistical Model for Domain-Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost Introduction Algorithm find maximum ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 12
Provided by: MatthewW170
Category:

less

Transcript and Presenter's Notes

Title: A Statistical Model for Domain-Independent Text Segmentation


1
A Statistical Model for Domain-Independent Text
Segmentation
  • Masao Utiyama and Hitoshi Isahura
  • Presentation by Matthew Waymost

2
Introduction
  • Algorithm find maximum-probability segmentation
    using a statistical method.
  • No training required.
  • Domain-independent.

3
Other Methods
  • Lexical Cohesion
  • Statistical
  • Hidden Markov model (Yamron et al., 1998)

4
Statistical Model
  • Find the probability of a segmentation S given a
    text W.
  • Use Bayes rule to find maximum-probability
    segmentation.

5
Definition of Pr(WS)
  • Assume statistical independence of topics and of
    words within the scope of a topic.
  • Assume different topics have different word
    distributions.
  • Can breakdown into double product of
    probabilities across words and segments.
  • Uses Laplace estimator for word frequency
    prediction.

6
Definition of Pr(S)
  • Varies depending on prior information.
  • In general, assume no prior information.
  • Prevents the algorithm from generating too many
    segments counteracts Pr(WS).

7
Algorithm
  • Convert the probability function into a cost
    function by taking the negative log.
  • Given a text W, define gi to be the gap between
    word wi and wi1.
  • Create a directed graph where the nodes are the
    gaps between words and the edges cover a segment
    between the gaps the edge connects.
  • Calculate all edge weights by using the cost
    function and find the minimum-cost path from the
    first to last node.

8
Algorithm
  • The calculated path represents the minimum-cost
    segmentation by correlating the edges to segments.

9
Algorithm Features
  • Determines the number of segments, but can also
    specify the number of edges in the shortest path.
  • Can specify where segmentation occurs by only
    using a subset of all possible edges where both
    nodes connected by the edge meet user-specified
    conditions.
  • Algorithm is insensitive to text length.
  • Good for summarization

10
Algorithm Evaluation
  • Compared algorithm against C99 (Choi 2000).
  • Artificial test corpus extracted from the Brown
    corpus used.
  • Probabilistic error metric used to evaluate
    performance.
  • Results of Utiyama algorithm significantly better
    at 1 level than Choi algorithm.

11
Algorithm Evaluation
  • Assessment of algorithm using real texts is
    needed.
  • Advantages over HMM
  • No training required (implies domain-independence)
    .
  • Can incorporate probabilistic information into
    model.
  • Might be expandable to detect word descriptions
    in text.
Write a Comment
User Comments (0)
About PowerShow.com