Title: Goals and Objectives
1Phonetic Dissection of Switchboard-Corpus Automat
ic Speech Recognition Systems Steven Greenberg
and Shuangyu Chang International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 steveng, shawnc_at_icsi.berkeley.edu http/
/www.icsi.berkeley.edu/steveng Large
Vocabulary Continuous Speech Recognition Workshop
Maritime Institute of Technology, Linthicum
Heights, MD, May 4, 2001
2Take Home Messages
- PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY
FACTOR UNDERLYING THE ABILITY TO CORRECTLY
RECOGNIZE WORDS - Many different analyses (to follow) support this
conclusion - Consonants appear to be more important than
vowels - SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR
FOR ACCURATE RECOGNITION - The pattern of errors differs across the syllable
(onset, nucleus, coda) and exhibit consistent
patterns difficult to discern with other units of
analysis - STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE,
PARTICULARLY FOR UNDERSTANDING THE NATURE OF
WORD-DELETION ERRORS - Relation among stress-accent, syllable structure,
vocalic identity and length - THE NATURE OF PRONUNCIATION MODELS and THEIR
RELATION TO LEXICAL REPRESENTATIONS IS A
POTENTIALLY KEY FACTOR - The unit of lexical representation (phones,
articulatory features, etc.) is probably of the
utmost importance for optimizing ASR performance - FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS
LIKELY TO DEPEND ON DEEP INSIGHT INTO THE
NATURE OF SPOKEN LANGUAGE
3Structure of the Presentation
- DESCRIPTION OF THE CORPUS MATERIALS FOR THE 2000
AND 2001 EVALUATIONS - 2000 Brief (2-17 s) utterances spoken by
hundreds of different speakers. No relation to
competitive evaluation - 2001 A subset of the competitive evaluation
- BRIEF OVERVIEW OF THE ANALYSIS REGIME COMMON TO
THE 2000 AND 2001 PHONETIC EVALUATIONS - File formats, time-mediated alignment,
statistical analysis of the corpora, etc. - Details are contained in Linguistic Dissection
.. (in workshop notebook) and in An
Introduction . (NIST Speech Transcription
Workshop, 2000) - ANALYSES AND PATTERNS COMMON TO BOTH 2000 and
2001 EVALUATIONS - Syllable structure, phonetic segments,
articulatory-acoustic features. Details
pertaining to the 2000 evaluation are in the
papers cited above - PHONETIC CONFUSION MATRICES FOR THE 2001
EVALUATION - FUTURE ANALYSIS PLANNED FOR THIS SPRING WHEN
REMAINING 2001 SUBMISSIONS ARRIVE - Relationship between phonetic classification,
pronunciation and language models
4Evaluation Material - 2000
- SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS
- Switchboard contains informal telephone dialogues
- 54 minutes of material that previously
phonetically transcribed (by highly trained
phonetics students from UC-Berkeley) - All of this material was hand-segmented at either
the phonetic- segment or syllabic level by the
transcribers - The syllabic-segmented material was subsequently
segmented at the phonetic-segment level by a
special-purpose neural network trained on
72-minutes of hand-segmented Switchboard
material. This automatic segmentation was
manually verified. - THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS
USED IN THE Â Â Â CURRENT PROJECT ARE AVAILABLE ON
THE PHONEVAL WEB SITE - http//www.icsi.berkeley.edu/real/phoneval
- THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL
   ARE AVAILABLE AT - http//www.icsi.berkeley.edu/real/stp
5Evaluation Material Details - 2000
- 581 DIFFERENT SPEAKERS
- AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS
- BROAD DISTRIBUTION OF UTTERANCE DURATIONS
- 2-4 sec - 40, 4-8 sec - 50, 8-17 sec - 10
- COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN
SWITCHBOARD - A WIDE RANGE OF DISCUSSION TOPICS
- VARIABILITY IN DIFFICULTY (VERY EASY TO VERY
HARD)
By Subjective Difficulty
By Dialect Region
Number of Utterances
Subjective Difficulty
Dialect Region
6Evaluation Material - 2001
- SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS
- Seventy-four minutes of material phonetically
labeled by five highly trained phonetics
students from UC-Berkeley plus S. Greenberg - The material was hand-segmented at the syllabic
level by the transcribers - The syllabic-segmented material was subsequently
segmented at the phonetic-segment level by a
special-purpose neural network trained
originally on 72-minutes of hand-segmented
Switchboard material (similar to the process
performed the previous year) - THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS
USED ARE AVAILABLE ON THE PHONEVAL WEB SITE - http//www.icsi.berkeley.edu/real/phoneval
7Evaluation Material Details - 2001
- A SUBSET OF THE HUB-5 COMPETITIVE EVALUATION
CORPUS - A representative selection from the evaluation
set, including an even distribution of data from
the three main recording conditions (cellular and
2 land-line conditions) - 21 SEPARATE CONVERSATIONS (2 speakers per
conversation) - 42 DIFFERENT SPEAKERS
- A TOTAL OF 74 MINUTES OF SPOKEN LANGUAGE MATERIAL
- (including FILLED PAUSES, JUNCTURES, etc.)
- AVERAGE LENGTH OF SPEECH PER SPEAKER 106
seconds - RANGE OF LENGTH PER SPEAKER 48 s (least) to
226 s (most) - STANDARD DEVIATION 38 s
- APPROXIMATELY ONE-THIRD OF THE MATERIAL FROM CELL
PHONES
8Evaluation Sites - 2000
- EIGHT SITES PARTICIPATED IN THE EVALUATION
- All eight provided material for the
unconstrained-recognition phase - Six sites also provided sufficient
forced-alignment-recognition material (i.e.,
phone/word labels and segmentation given the word
transcript for each utterance) for a detailed
analysis - ATT (forced-alignment recognition incomplete,
not analyzed ) - Bolt, Beranek and Newman
- Cambridge University
- Dragon (forced-alignment recognition incomplete,
not analyzed ) - Johns Hopkins University
- Mississippi State University
- SRI International
- University of Washington
9Evaluation Sites - 2001
- SEVEN SITES ARE PARTICIPATING IN THE EVALUATION
- Unconstrained-recognition phase 6 Sites
- Forced-alignment 7 Sites
- Phone classification confidence scores 5 Sites
- Variable condition recognition 2 Sites
- Phone strings to words - 1 Site
- ATT
- Bolt, Beranek and Newman
- IBM
- Johns Hopkins University
- Mississippi State University
- Philips
- SRI International
10Evaluation Data Status - 2001
- However NOT ALL OF THE MATERIAL REQUIRED TO
PERFORM THE ANALYSES HAVE MATERIALIZED - The tables below summarize the commitments and
currently usable data (certain data arrived in
not-quite-ready-for-prime-time form)
Commitments
Current (usable data)
11Initial Recognition File - Example
- Parameter Key
- START - Begin time (in seconds) of phone
- DUR - Duration (in sec) of phone
- PHN - Hypothesized phone ID
- WORD - Hypothesized Word ID
- Format is for all 674 files in the evaluation set
- (Example courtesy of MSU)
12Phone Mapping Procedure
- EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE
SET - Most of the phone sets are available on the
PHONEVAL web site - THE SITES PHONE SETS WERE MAPPED TO A COMMON
REFERENCE PHONE SET - The reference phone set is based on the ICSI
Switchboard transcription material (STP), but
is adapted to match the less granular symbol
sets used by the submission sites - The set of mapping conventions relating to the
STP (and reference) sets are also available on
the PHONEVAL web site - THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE
SUBMISSION SITE PHONE SETS - This reverse mapping was done in order to insure
that variants of a phone were given due credit
in the scoring procedure - For example - em (syllabic nasal) is mapped to
ix m, the vowel ix maps in certain
instances to both ih and ax, depending on the
specifics of the phone set
13Phone Scoring Procedures - 2001
- TWO METHODS WERE USED FOR THE 2001 EVALUATION
- The UNCOMPENSATED form is the same as last
years scoring method. Only common phone
ambiguities (such as ix, ih, ah. ax,
etc. are allowed - The TRANSCRIPTION-COMPENSATED form allows for
certain phones commonly confused among human
transcribers to be scored as correct, even
though they would otherwise be scored as wrong - The compensated form of transcription lowers the
phone error by ca. 10-20 - TIME-MEDIATED SCORING WAS OF TWO VARIETIES
- A STRICT form is identical to that used in
last years evaluation. There is a severe penalty
for deviations from time boundaries for words
and phones - A LENIENT form allows for a much looser fit
between time markers associated with words and
phones. A weighting of 0.15 (relative to the
STRICT form) was used (by modifying the penalty
algorithm in SC-Lite). The 0.15 weight reduced
the number of phone errors by ca. 20 without
a significant decline in false-positive responses
14Visualization of a 3-D Confusion Matrix
- When the matrix is sparsely coded, as below, it
is more efficient to view the pattern as if
squashed against a brick wall (see below)
The diagonal is plotted in a linear plane
15Interlabeler Agreement (74) - 3 Transcribers
- Highest for consonants (especically the stops)
- Lowest for vowels (particularly the lax
monophthongs)
Vowels
Proportion Concordance
Consonants
Phonetic Segment
Numbers refer to the concordance diagonal in the
confusion matrices
16Interlabeler Disagreement Patterns - 2001
- INTERLABELER DISAGREEMENT PATTERNS WERE DERIVED
FROM THE 2000 EVALUATION MATERIAL - Several minutes of 3 transcribers material
transcribed in common were analyzed (2 from
1996-1997 STP, 1 from 2001 STP) - THE FOLLOWING PATTERNS WERE OBSERVED IN THE
INTERLABELER DISAGREEMENT ANALYSIS - Consonants
- Stop and nasal consonants exhibit a small amount
of disagreement - Fricatives exhibit slightly higher amounts of
disagreement - Liquids show a moderate amount of disagreement
- Vowels
- Lax monophthongs exhibit a high amount of
disagreement - Diphthongs show a relatively small amount of
disagreement - Tense, low monophthongs show relatively little
disagreement (except for ao (probably a
dialect issue) - Overall Transcriber Agreement was 70
17Interlabeler Disagreement Patterns - 2001
- FROM SUCH PATTERNS THE FOLLOWING FORMS OF
TOLERANCES WERE ALLOWED IN TRANSCRIPTION
COMPENSATED SCORING
Segment d k s n r iy ao ax ix
UNcompensated d k s n r iy ao ax
ix ih ax
Compensated d dx k s z n nx ng
en r axr er iy ix ih ao aa
ow ax ah aa ix ix ih iy ax
18Transcription Compensation Affects Phone Error
- COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS
LOWERS THE PHONE ERROR APPRECIABLY FOR MOST
SITES
STRICTTime Mediation
Error Rate
19Transcription Compensation Affects Phone Error
- COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS
LOWERS THE PHONE ERROR APPRECIABLY FOR MOST
SITES
LENIENTTime Mediation
Error Rate
20Generation of Evaluation Data - 1
21CTM File Format for Word Scoring
- EACH SITES MATERIAL WAS PROCESSED THROUGH
SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR
ANALYSIS (IN TERMS OF ERROR TYPE)
ERROR KEY C CORRECT I INSERTION N NULL
ERROR S SUBSTITUTION
22Generation of Evaluation Data - 2
23Summary of Corpus Acoustic Properties
- LEXICAL PROPERTIES
- Lexical Identity
- Unigram Frequency
- Number of Syllables in Word
- Number of Phones in Word
- Word Duration
- Speaking Rate
- Prosodic Prominence
- Energy Level
- Lexical Compounds
- Non-Words
- Word Position in Utterance
- SYLLABLE PROPERTIES
- Syllable Structure
- Syllable Duration
- Syllable Energy
- Prosodic Prominence
- Prosodic Context
- PHONE PROPERTIES
- Phonetic Identity
- Phone Frequency
- Position within the Word
- Position within the Syllable
- Phone Duration
- Speaking Rate
- Phonetic Context
- Contiguous Phones Correct
- Contiguous Phones Wrong
- Phone Segmentation
- Articulatory Features
- Articulatory Feature Distance
- Phone Confusion Matrices
- OTHER PROPERTIES
- Speaker (Dialect, Gender)
- Utterance Difficulty
- Utterance Energy
- Utterance Duration
24Word- and Phone-Centric Big Lists
- THE BIG LISTS CONTAIN SUMMARY INFORMATION ON
55-65 Â Â Â SEPARATE PARAMETERS ASSOCIATED WITH
PHONES, Â Â Â SYLLABLES, WORD, UTTERANCES AND
SPEAKERS Â Â Â SYNCHRONIZED TO EITHER THE WORD (THIS
SLIDE) OR THE PHONE
25Generation of Evaluation Data - 3
26Phoneval-2000 Web Site
- FORCED ALIGNMENT FILES
- Forced Alignment Files
- BBN , JHU, MSU, WASH
- Word-Level Alignment Errors
- BBN , CU, JHU, MSU, SRI, WASH
- Phone Error (Forced Alignment)
- CU, BBN, JHU, MSU, SRI, WASH
- Alignment Word-Phone Mapping
- BBN , JHU, MSU, WASH
- BIG LISTS
- Word-Centric
- BBN, CU, JHU, MSU, SRI, WASH
- Phone-Centric
- BBN, JHU, MSU, WASH
- Phonetic Confusion Matrices
- BBN, JHU, MSU, WASH
- RECOGNITION FILES
- Converted Submissions
- ATT, BBN , JHU, MSU, SRI, WASH
- Word Level Recognition Errors
- ATT, CU, BBN , JHU, MSU, SRI, WASH
- Phone Error (Free Recognition)
- ATT, BBN, JHU, MSU, WASH
- Word Recognition Phone Mapping
- ATT, BBN, JHU, MSU, WASH
- BIG LISTS
- Word-Centric
- ATT, CU, BBN, JHU, MSU, SRI, WASH
- Phone-Centric
- ATT, BBN, JHU, MSU, WASH
- Phonetic Confusion Matrices
- ATT, BBN, JHU, MSU, WASH
- Description of the STP Phone Set
- STP Transcription Material
- Phone-Word Reference
- Syllable-Word Reference
- Phone Mapping for Each Site
- ATT, BBN , JHU, MSU, WASH
- STP-to-Reference Map
- STP Phone-to-Articulatory-Feature Map
http//www.icsi.berkeley.edu/real/phoneval
27A Syllable-Centric Perspective
In this presentation we will drill down from
the lexical to the phonetic tiers by way of the
syllable, the phone and articulatory-acoustic
features
Words
Stress-accent
Phonetic segment
Articulatory-Acoustic Features
28Coarse Word and Phone Recognition
- THE FOLLOWING SLIDES PROVIDE DETAILS ABOUT THE
COARSE WORD AND PHONE SCORES FOR THE 2000 AND
2001 EVALUATIONS - ALTHOUGH THE WORD AND PHONE SCORES ARE ROUGHLY
COMPARABLE ACROSS YEARS (FOR ANALOGOUS
CONDITIONS) THE 2001 EVALUATION HAS FOUR TIMES
THE NUMBER OF SCORING CONDITIONS (FOR PHONES)
BASED ON THE LENIENT vs. STRICT
TIME-MEDIATION AND THE COMPENSATED vs.
UNCOMPENSATED TRANSCRIPTION SCORING
29Word Recognition Error (2000)
- WORD ERROR RATES VARY BETWEEN 27 AND 43
- Substitutions are the major source of word errors
Site
Error Rate
Error Type
30Prosodic Stress Word Error Rate (2000)
- The effect of stress is most concentrated among
word-deletion errors
Data represent averages across all eight ASR
systems
31Syllable Structure Word Error Rate (2000)
- Vowel-initial forms show the greatest error
- Polysyllabic forms exhibit the lowest error
- Data are averaged across all eight sites
32Syllable Structure Word Error Rate (2000)
- VOWEL-INITIAL forms exhibit the HIGHEST error
- POLYSYLLABLES have the LOWEST error rate
33Word Recognition Error (2001)
- WORD ERROR RATES VARY BETWEEN 33 AND 49
- Substitutions are the major source of phone errors
Site
Error Rate
Error Type
STRICT Time Mediation
34Word Recognition Error (2001)
- WORD ERROR RATES VARY BETWEEN 31 AND 44
- Substitutions are the major source of phone errors
Site
Error Rate
Error Type
LENIENT Time Mediation
35Prosodic Stress Word Error Rate (2001)
- NOT YET
- PROSODIC LABELING OF THIS MATERIAL REQUIRED FIRST
- ANALYSIS SCHEDULED FOR JUNE, 2001
36Syllable Structure Word Error Rate (2001)
- Vowel-initial forms show the greatest error
- Polysyllabic forms exhibit the lowest error,
except fpr CVCV forms (probably due to forms
such as gonna, etc.)
- Data are averaged across all five sites
37Syllable Structure Word Error Rate (2001)
- VOWEL-INITIAL forms exhibit the HIGHEST error
- POLYSYLLABLES have the LOWEST error rate
38Are Word and Phone Errors Related? (2000)
- COMPARISON OF THE WORD AND PHONE ERROR RATES
ACROSS Â Â Â Â SITES SUGGESTS THAT WORD ERROR IS
HIGHLY DEPENDENT ON Â Â Â Â THE PHONE ERROR RATE - The correlation between the two parameters is 0.78
Pronunciation Models?
The differential error rate is probably related
to the use of either pronunciation or language
models (or both)
Error Rate
Submission Site
39Are Word and Phone Errors Related? (2001)
- COMPARISON OF THE WORD AND PHONE ERROR RATES
ACROSS Â Â Â Â SITES SUGGESTS THAT WORD ERROR IS
HIGHLY DEPENDENT ON Â Â Â Â THE PHONE ERROR RATE
Pronunciation Model?
StrictTime Mediation
TranscriptionUnCompensated
Error Rate
40Are Word and Phone Errors Related? (2001)
- COMPARISON OF THE WORD AND PHONE ERROR RATES
ACROSS Â Â Â Â SITES SUGGESTS THAT WORD ERROR IS
HIGHLY DEPENDENT ON Â Â Â Â THE PHONE ERROR RATE
Pronunciation Model?
LenientTime Mediation
TranscriptionUnCompensated
Error Rate
41Phonetic - Pronunciation Mismatch
- THERE ARE A FAR GREATER NUMBER OF PRONUNCIATIONS
IN THE TRANSCRIPTION MATERIALS THAN IN THE ASR
LEXICONS - GIVEN THAT MOST WORDS ARE CORRECTLY RECOGNIZED,
THIS RESULT IMPLIES THAT PHONETIC
CLASSIFICATION IN ASR SYSTEMS IS, BY NECESSITY,
HIGHLY AGRANULAR - THUS, UNUSUAL PRONUNCIATIONS ARE UNLIKELY TO BE
DECODED CORRECTLY - THE COARSE NATURE OF THE PRONUNCIATION MODELS
ALSO MAKE IT DIFFICULT TO FINE-TUNE THE RELATION
BETWEEN THE PHONETIC CLASSIFIER AND
PRONUNCIATION MODEL COMPONENTS
42Pronunciation Variation in ASR Lexicons
- MOST WORDS IN THE ASR LEXICONS HAVE A SINGLE
PRONUNCIATION - EXCEPTIONS ARE HIGHLY FREQUENT WORDS (SUCH AS
THE AND AND WHICH HAVE 2 OR 3 PRONUNCIATION
VARIATIONS. NO WORD HAS MORE THAN 5
PRONUNCIATION VARIANTS (AT LEAST NOT IN THE
PHONETIC OUTPUT PROVIDED TO ICSI FOR THE
EVALUATION)
43Pronunciation Variation in Switchboard (2001)
- THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR
THE 100 MOST FREQUENT WORDS IN THE PHONETIC
EVALUATION MATERIAL
WORD INSTANCES PRON
WORD INSTANCES PRON
44Pronunciation Variation in Switchboard (2001)
- THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR
THE 100 MOST FREQUENT WORDS IN THE PHONETIC
EVALUATION MATERIAL
WORD INSTANCES PRON
WORD INSTANCES PRON
45Phone Error and Word Length (2000)
- For CORRECT words, only one phone (on average) is
misclassified - Implication short words are highly tolerant of
phone errors - For INCORRECT words, phone errors increase
linearly with word length
- Data are averaged across all eight sites
46Phone Error and Word Length (2001)
- For CORRECT words, only one phone (on average) is
misclassified - Implication short words are highly tolerant of
phone errors - For INCORRECT words, phone errors increase
linearly with word length
- Data are averaged across all five sites
47Phone Error - Forced Alignment (2000)
- PHONE ERROR RATES VARY BETWEEN 35 AND 49
- This, despite having the word transcript!!!
Site
Error Rate
ATT, Dragon did not provide a complete set of
forced alignments
Error Type
48Phone Error - Forced Alignment (2001)
- PHONE ERROR RATES VARY BETWEEN 40 AND 50
- Same picture for 2001. Suggests a potential
mismatch between lexical and phonetic
representations
Site
Error Rate
Error Type
STRICT Time Mediation
Transcription UNcompensated
49Phone Error - Forced Alignment (2001)
- PHONE ERROR RATES VARY BETWEEN 30 AND 44
- Still a poor match between phonetic transcripts
and lexical reps
Site
Error Rate
Error Type
LENIENT Time Mediation
Transcription UNcompensated
50Phone Error - Forced Alignment (2001)
- PHONE ERROR RATES VARY BETWEEN 32 AND 38
- Still a lack of concordance with a tolerant
scoring method
Site
Error Rate
Error Type
STRICT Time Mediation
Transcription Compensated
51Phone Error - Forced Alignment (2001)
- PHONE ERROR RATES VARY BETWEEN 23 AND 29
- With the most tolerant scoring there is still
some lack of concordance
Site
Error Rate
Error Type
Transcription Compensated
LENIENT Time Mediation
52Visualization of a 3-D Confusion Matrix
- When the matrix is sparsely coded, as below, it
is more efficient to view the pattern as if
squashed against a brick wall (see below)
The diagonal is plotted in a linear plane
53Phonetic Confusion Matrix - CVC Syllables
- Onset consonants tend to be highly concordant
with transcription - Coda consonants are slightly less concordant,
particularly some fricatives
CVC
Proportion Concordance
CVC
Phonetic Segment
Forced Alignment
Numbers refer to the concordance diagonal in the
confusion matrices
54Phonetic Confusions - CCVC, CVCC Syllables
- Certain fricatives are problematic in CVCC coda
position - Redo this figure and others - no wrong words,
compare CVC, CVC etc,
CCVC
Proportion Concordance
CVCC
Phonetic Segment
Forced Alignment
Numbers refer to the concordance diagonal in the
confusion matrices
55Phonetic Confusions - CV and CVC Nuclei
- Diphthongs and tense, low monophthongs tend to be
concordant - Lax monophthongs tend to be less concordant (cf.
Stress-accent-paper)
CVC
Proportion Concordance
CV
Phonetic Segment
Forced Alignment
Numbers refer to the concordance diagonal in the
confusion matrices
56Phone Error - Unconstrained Recognition (2000)
- PHONE ERROR RATES VARY BETWEEN 39 AND 55
- Phone error is only slightly greater than for
forced alignments
57Phone Error - Unconstrained Recognition(2001)
- PHONE ERROR RATES VARY BETWEEN 44 AND 55
- Results similar to 2000 evaluation
Condition most analogous to 2000 evaluation
Site
Error Rate
Error Type
Transcription Uncompensated
STRICT Time Mediation
58Phone Error - Unconstrained Recognition (2001)
- PHONE ERROR RATES VARY BETWEEN 38 AND 48
- Relaxing time-mediation brings down the error
slightly
Site
Error Rate
Error Type
LENIENT Time Mediation
Transcription Uncompensated
59Phone Error - Unconstrained Recognition(2001)
- PHONE ERROR RATES VARY BETWEEN 25 AND 39
- Transcription compensation also brings down the
error
Site
Error Rate
Error Type
STRICT Time Mediation
Transcription Compensated
60Phone Error - Unconstrained Recognition(2001)
- PHONE ERROR RATES VARY BETWEEN 27 AND 38
- Phone errors decline somewhat more with lax
scoring
Site
Error Rate
Error Type
LENIENT Time Mediation
Transcription Compensated
61Phonetic Confusion Matrix - CV Onsets
- ARROWS pinpoint problem segments
- AFFRICATES and FRICATIVES are problematic in CV
onset position - d is also problematic
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
62Phonetic Confusion Matrix - CVC Onsets
- Fricatives and affricates are problematic in CVC
onset position
Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
63Phonetic Confusion Matrix - CCVC Onsets
- Certain fricatives are particularly problematic
in CCVC onset position
Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
64Phonetic Confusion Matrix - CVC Codas
- Fricatives are particularly problematic in CVC
coda position - Certain Stops are also problematic in CVC coda
position
Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
65Phonetic Confusion Matrix - CVCC Codas
- Certain fricatives are problematic in CVCC coda
position - d is also problematic in CVCC coda position
Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
66Phonetic Confusion Matrix - CVC Nuclei
- Certain vowels are a problem in CVC nucleus
position - Note that the level of concordance is much lower
for vowels than for consonants (in onset or
coda position), even for correct words
Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
67Phonetic Confusion Matrix - CV Nuclei
- Diphthongs and low, tense vowels are more
concordant with the transcription than the lax
monophthongs cf. Stress-accent paper
Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
68Consonantal Onsets and AF Errors (2000)
- Syllable onsets are intolerant of AF errors in
CORRECT words - Place and manner AF errors are particularly high
in INCORRECT onsets
- Data are averaged across all eight sites
69Consonantal Onsets and AF Errors (2001)
- Syllable onsets are intolerant of AF errors,
particularly place, in CORRECT words - Place and manner AF errors are particularly high
in INCORRECT onsets - Syllable structure does not have the same effect
as in the 2000 analysis
- Data are averaged across all five sites
70Consonantal Codas and AF Errors (2000)
- Syllable codas exhibit a slightly higher
tolerance for error than onsets - There is a high degree of AF error for wrong words
- Data are averaged across all eight sites
71Consonantal Codas and AF Errors (2001)
- Syllable codas exhibit a slightly higher
tolerance for error than onsets - There is a high degree of AF error for wrong words
- Data are averaged across all five sites
72Vocalic Nuclei and AF Errors (2000)
- Nuclei exhibit a much higher tolerance for error
than onsets codas - There are many more errors than among syllabic
onsets codas
- Data are averaged across all eight sites
73Vocalic Nuclei and AF Errors (2001)
- Nuclei exhibit a much higher tolerance for error
than onsets codas, particularly for height and
front/back - There are many more errors than among syllabic
onsets codas
- Data are averaged across all five sites
74Into the (Near) Future
- WITH THE ARRIVAL OF THE REMAINING
FORCED-ALIGNMENT AND UNCONSTRAINED RECOGNITION
DATA - IT will be possible to investigate in the
relative contribution of the phonetic
classification, pronunciation and language
models to recognition performance - In order to do this, it is necessary to obtain
unconstrained recognition, forced alignment and
phone-confidence material from each site (to the
extent possible) the phone confidence metric
is problematic - CUSTOMIZED ANALYSES FOR INDIVIDUAL SITES
- SRI has different versions of their system (with
w/o adaptation, etc.) - ATT will use phone strings from ICSI
transcription material - Individual diagnostics for each site (are there
significant differences for specific
parameters?) - MOST OF THE DATA FOR THE 2001 EVALUATION WILL BE
POSTED ON THE PHONEVAL WEB SITE SHORTLY - WEB-BASED ORACLE DATABASE APPLICATION IS NEAR
COMPLETION - Will enable searches over the web of the Phoneval
corpus and be able to graph the results (this is
the tricky part, given the ugly nature of Oracle
Web DB) - A PAPER DESCRIBING THE FULL SET OF ANALYSES WILL
BE AVAILABLE AT THE END OF JUNE (2001)
75Summary and Conclusions
- PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY
FACTOR UNDERLYING THE ABILITY TO CORRECTLY
RECOGNIZE WORDS - Many different analyses (to follow) support this
conclusion - Consonants appear to be more important than
vowels - SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR
FOR ACCURATE RECOGNITION - The pattern of errors differs across the syllable
(onset, nucleus, coda) and exhibit consistent
patterns difficult to discern with other units of
analysis - STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE,
PARTICULARLY FOR UNDERSTANDING THE NATURE OF
WORD-DELETION ERRORS - Relation among stress-accent, syllable structure,
vocalic identity and length - THE NATURE OF PRONUNCIATION MODELS and THEIR
RELATION TO LEXICAL REPRESENTATIONS IS A
POTENTIALLY KEY FACTOR - The unit of lexical representation (phones,
articulatory features, etc.) is probably of the
utmost importance for optimizing ASR performance - FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS
LIKELY TO DEPEND ON DEEP INSIGHT INTO THE
NATURE OF SPOKEN LANGUAGE