Title: Clustering of Imperfect Transcripts Using a Novel Similarity Measure
1Clustering of Imperfect Transcripts Using a Novel
Similarity Measure
- Oktay Ibrahimov1, Ishwar K. Sethi1 and Nevenka
Dimitrova2 - 1Department of Computer Science Engineering
- Oakland University, Rochester, MI 48309
- 2Philips Research
- 345 Scarborough Road
- Briarcliff Manor, NY10510-2099
2Outline
- Motivation
- Chi-square Measure of Similarity
- Clustering Procedure
- Experimental Results
- Summary
3Motivation
- Multimedia information retrieval
- To complement other analysis channels, e.g.
video, closed captioned text etc. - Stand alone applications, e.g. telephone
conversations, radio broadcasts etc.
4Our Approach to Content Extraction from Audio
Documents
- Commercial-of-the-shelf (COTS) software for
automatic speech recognition (ASR)
5The bill would ban unlimited soft
money contributions to political parties by
corporations, unions, and individuals it also
would bar corporations and unions from paying for
their own issue ads that mention federal
candidates just before an election and it would
require outside interest groups that run ads to
disclose expenditures and contributors ..
(Original)
A limited soft money contributions to political
party is more corporations unions and individual
to also would bar corporations and unions from
paying for their own issue at the mention federal
candidates just before election and it would
require special risk groups but what has to
disclose their expenditures and contributors
(Transcribed)
6Characteristics of Documents Transcribed Through
ASR
- Transcription errors (gt 10)
- Ill-defined boundaries between different topics
- Conversational speech versus written text
- Length (_at_ 100 words or less)
7Measuring Similarity between Transcribed
Documents (c2 Measure )
- Transcript intersection
- Determined through common words
- Information contribution
- Evaluate the amount of information contributed by
a document to the intersection - Informative closeness
- Assess informative closeness of common words
- Compute similarity
8Assumptions
- Information contained in a document is the sum of
information contained in its words - Information conveyed by a word in a document is
proportional to its weight - Informative closeness of common words (usage
context) is conveyed by the distribution pattern
of their weights
9Information Conveyed by a Word in a Document
(Okapi Technique)
Collection freq. weight
Number of documents containing the word wi
Frequency of wi in Dj
Mean normalized document length
N Total number of documents K and b
Empirical constants
10Information Conveyed by Documents
D1
D2
Intersection of documents
D3
11Similarity Computation
12Sample Results
Original Reader ViaVoice Trainer
13c2 Measure Discrimination Ability
14Transcript Clustering
- Clusters are formed sequentially
- Documents in a cluster must exhibit certain
information sharing - Cluster center is a document sharing highest
amount of information with other documents in the
cluster
15Cluster Centroid
Minimum of
16Results Processing Stages
Stop List Filtering
Stemming
Weight Computation
Chi-square similarity and clustering
17Experiment 1
- TDT 2 Careful Transcription Text Corpus
- Broadcast news from ABC, CNN, and VOA
- 202 stories out of 540 stories
- Average story length 72 words
- Vocabulary size 17,000 words after stopping and
table look-up
18Sample Results
19Cluster 43 Iraq, UN Resolution (11 documents)
CNN19980302.2130.0072.utf sim0.995
President Clinton says the _U_N security council
has sent a clear message to Iraqi, give
international inspectors unrestricted access to
all suspected weapons sites. The security council
unanimously approved a resolution tonight
endorsing the agreement with Iraq over _U_N
weapon inspections. _U_N secretary general Kofi
Annan says now it's up to Baghdad to hold up its
end of the agreement.
agree 5.3079, 5.3079 understand
5.1537, 5.1537 council 4.8601, 4.8601
iraq 4.7936, 4.7936 resolution 4.6209,
4.6209 insist 4.5653, 4.5653 unanimous
4.4657, 4.4657 accord 4.4657, 4.4657
deception 4.4657, 4.4657
CNN19980217.1600.0129.utf sim0.232266817212973
Iraqi officials were quick to criticize the
president's comments. Deputy prime minister
Tariq Aziz says the _U_S has no authority to
attack and is out to destroy a nation. Diplomats
say it's increasingly likely _U_N secretary
general Kofi Annan will go to Baghdad to try to
resolve this. He is scheduled to meet again today
with delegations from the five _U_N security
council members.
diplomacy 3.8244, 4.9120 positive
3.2995, 4.6051 balance 3.1117, 4.3431
iraq 4.7936, 4.0809 member 2.6172,
3.6529 council 4.8601, 3.5292
20Cluster 32 Stock Market (9 documents)
CNN19980114.1600.1011.utf sim0.995
In fact, a second wind of sorts in the stock
market this afternoon, the final hour of trading,
by the closing bell the Dow Jones Industrials
finished near their highs of the day up fifty-two
and a half points at seventy-seven eighty-four in
very active trading. The broader market indices
all posted gains as well, and the NASDAQ
composite continuing to move higher, it is up six
and half points. Shares of Intel, however, fell
one and a half at seventy-five and
seven-sixteenths after the chip giant said its
profit margins will shrink in the near term.
schuch 6.0760, 6.0760 upside
6.0760, 6.0760 shrink 6.0760, 6.0760
beverly 6.0760, 6.0760 trade 5.5716,
5.5716 finish 5.3503, 5.3503 index
5.2826, 5.2826 desk 5.2826, 5.2826
chip 5.2826, 5.2826 intel 5.2826,
5.2826 margin 5.2826, 5.2826
CNN19980126.1600.1022.utf sim0.299013101987292
In any case blue chip stocks breaking a three day
losing streak. Investors remain cautious and kept
the Dow Jones Industrials in moderate rate
trading range for most of the day. The Dow did
manage to close higher by twelve points, at
seventy-seven twelve. The broader markets closed
however mostly lower, weakness in the technology
sector dragging the NASDAQ composite down
fourteen points. And bonds surged as the dollar
gained strength, the thirty year treasury up one
and four thirty-seconds of a point in price. That
has the yield back down to five point eighty-nine
percent.
trade 5.5716, 4.3196 chip
5.2826, 5.9290 compose 4.8185, 5.4081
market 4.4230, 3.4291 stock 4.3087,
3.3405 broad 4.0251, 4.5176 dow
3.8487, 6.0489 nasdaq 3.8487, 4.3196
gain 3.5610, 3.9967 jones 3.4404,
3.8614 industry 3.0553, 3.4291
21Experiment 2
- 25 short audio files from Internet
- 4 IBM ViaVoice trainers to obtain 4 transcribed
versions of varying quality for each file - A total of 125 documents (25 original transcripts
and 4x25 transcribed versions)
22Audio File Characteristics
23Audio File Characteristics
24Transcription Summary
ERROR RATES FOR 4 MODEL (2 MALE, 2 FEMALE)
TRANSCRIPTS Â
Â
25Sample Transcripts
Wave File Business01.wav Original
Text California's energy crisis has gone
national. On Tuesday, the new Bush administration
guaranteed electricity and natural gas would
continue to be available to California, but only
for two weeks. Today, Federal Reserve Chairman
Greenspan expressed his own concerns. Â Gasior
Model (error rate 42) One is energy crisis has
gone national on this that and it was an
illustration of their electricity and natural gas
would continue to a nearby wall in California but
only for July launched a Federal Reserve Chairman
Greenspan expressed his  Iraicu Model (error
rate 29) On his energy crisis has gone on
national all Tuesday the new bushel
administration and guaranteeing electricity and
natural gas would continue the be a battle over
California's but only for two weeks today fender
reserved jam Greenspan expressed his home can
star - Â Lewis Model (error rate 90) Warner's
letter to the core of national or Molitoris
Model (error rate 42) One is energy crisis
along national if his abortion laws rationing
guarantee electricity and natural-gas with
continued to be available in California but only
for two weeks to a Federal Reserve Chairman
Greenspan and expressed Â
26Sample Transcripts
Original Text (Wave File Business02.wav ) The
problems that we're now seeing in the energy
area, specifically-- obviously, natural gas--
around the country, and very obviously,
California's electric power production, are not
aberrations, but in my judgment, a significant
problem that this country is going to have to
address, and I think address rather quickly, and
I think coherently and cogently. Obviously, if
the energy structure of this country is
inadequate, or in some way excessively costly, it
will undermine economic growth in this country,
and therefore, is a major issue which must be
addressed. Gasior Model (error rate 42) nine
problems or non sheen in the energy around great
of its fleet California's record par for a
one-hour operations in March and a different
today significant problem this country and have
to impress American press rather quickly and I
coherently encouraging when a structure of this
country is and I would Warren shine when have
well undermine economic growth in this country
and therefore is a major issue which must be
crushed Iraicu Model (error rate 29) be
problems or winnow seen the damage your call to
coat . you look on not apparition but and
significant problem this country , have cut all
structure of this country as and will Lewis
Model (error rate 90) Colin and normal
rehearse broken we'll " Molitoris Model (error
rate 42) the problems or were not 0 3 on not
operation makes but structural and will undermine
the not grow from Â
27Highlights of the Clustering Results
- A total of 53 clusters are formed.
- Only 25 clusters have at least two transcripts.
- 22 of these clusters are formed by the different
versions of the same transcript. - 22 of the 28 singleton clusters are formed by
transcripts corresponding to Model 3.
28Highlights of the Clustering Results(Porters
Algorithm)
- A total of 48 clusters are formed.
- Again only 25 clusters with at least two
transcripts are formed. - 20 of these clusters are formed by the different
versions of the same transcript. - 18 of the 22 singleton clusters are formed by
Model 3 transcripts.
29Summary Conclusion
- Robust performance against high transcription
errors - Potential for use with telephone quality speech
- Further improvements are being explored through
fuzzy clustering
30Thank You!