A Statistical Model for Domain-Independent Text Segmentation

About This Presentation

Title:

Description:

Number of Views:44

Avg rating:3.0/5.0

Slides: 12

Provided by: MatthewW170

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Statistical Model for Domain-Independent Text Segmentation

1
A Statistical Model for Domain-Independent Text
Segmentation

2
Introduction

3
Other Methods

4
Statistical Model

5
Definition of Pr(WS)

Assume statistical independence of topics and of
words within the scope of a topic.
Assume different topics have different word
distributions.
Can breakdown into double product of
probabilities across words and segments.
Uses Laplace estimator for word frequency
prediction.

6
Definition of Pr(S)

7
Algorithm

Convert the probability function into a cost
function by taking the negative log.
Given a text W, define gi to be the gap between
word wi and wi1.
Create a directed graph where the nodes are the
gaps between words and the edges cover a segment
between the gaps the edge connects.
Calculate all edge weights by using the cost
function and find the minimum-cost path from the
first to last node.

8
Algorithm

The calculated path represents the minimum-cost
segmentation by correlating the edges to segments.

9
Algorithm Features

Determines the number of segments, but can also
specify the number of edges in the shortest path.
Can specify where segmentation occurs by only
using a subset of all possible edges where both
nodes connected by the edge meet user-specified
conditions.
Algorithm is insensitive to text length.
Good for summarization

10
Algorithm Evaluation

Compared algorithm against C99 (Choi 2000).
Artificial test corpus extracted from the Brown
corpus used.
Probabilistic error metric used to evaluate
performance.
Results of Utiyama algorithm significantly better
at 1 level than Choi algorithm.

11
Algorithm Evaluation