Title: Goals and Objectives
1From Here to Utility Melding Phonetic Insight
With Speech Technology Steven
Greenberg International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 http//www.icsi.berkeley.edu/steveng steven
g_at_icsi.berkeley.edu
2Acknowledgements and Thanks
Automatic Feature Classification and
Analysis Joy Hollenback, Shawn Chang, Leah
Hitchcock Research Funding U.S. National
Science Foundation U.S. Department of Defense
3Road Map of the Presentation
- What is Truth?
- The story of Rashomon, a film by Akira Kurosawa
- Its application to spoken language
4Road Map of the Presentation
- What is Truth?
- The story of Rashomon, a film by Akira Kurosawa
- Its application to spoken language
- The Varieties of Scientific Experience
- The Fundamental Duality
- The Eternal Pentangle
- The Inner Triangle
5Road Map of the Presentation
- What is Truth?
- The story of Rashomon, a film by Akira Kurosawa
- Its application to spoken language
- The Varieties of Scientific Experience
- The Fundamental Duality
- The Eternal Pentangle
- The Inner Triangle
- The Importance of Being Phonetically Annotated
- A Corpus-Centric Perspective on Spoken Language
- Phonetic Annotation of Spontaneous American
English Discourse
6Road Map of the Presentation
- What is Truth?
- The story of Rashomon, a film by Akira Kurosawa
- Its application to spoken language
- The Varieties of Scientific Experience
- The Fundamental Duality
- The Eternal Pentangle
- The Inner Triangle
- The Importance of Being Phonetically Annotated
- A Corpus-Centric Perspective on Spoken Language
- Phonetic Annotation of Spontaneous American
English Discourse - Phonetic Dissection of Automatic Speech
Recognition Systems - Stress Accent and Word Error Rate
- Syllable Structure and Word Error Rate
7Road Map of the Presentation
- What is Truth?
- The story of Rashomon, a film by Akira Kurosawa
- Its application to spoken language
- The Varieties of Scientific Experience
- The Fundamental Duality
- The Eternal Pentangle
- The Inner Triangle
- The Importance of Being Phonetically Annotated
- A Corpus-Centric Perspective on Spoken Language
- Phonetic Annotation of Spontaneous American
English Discourse - Phonetic Dissection of Automatic Speech
Recognition Systems - Stress Accent and Word Error Rate
- Syllable Structure and Word Error Rate
- The Relation Between Stress Accent and Vocalic
Identity - The Relation Between Segmental Duration and Vowel
Height - Durational Differences Between Stressed and
Unstressed Vowels - The Relation Between Vowel Height and Stress
Accent
8Road Map of the Presentation
- What is Truth?
- The story of Rashomon, a film by Akira Kurosawa
- Its application to spoken language
- The Varieties of Scientific Experience
- The Fundamental Duality
- The Eternal Pentangle
- The Inner Triangle
- The Importance of Being Phonetically Annotated
- A Corpus-Centric Perspective on Spoken Language
- Phonetic Annotation of Spontaneous American
English Discourse - Phonetic Dissection of Automatic Speech
Recognition Systems - Stress Accent and Word Error Rate
- Syllable Structure and Word Error Rate
- The Relation Between Stress Accent and Vocalic
Identity - The Relation Between Segmental Duration and Vowel
Height - Durational Differences Between Stressed and
Unstressed Vowels - The Relation Between Vowel Height and Stress
Accent - Spoken Language What is Truth?
- Fundamental Questions Remain Unanswered
9Part One WHAT IS TRUTH?
- The Story of Rashomon
- Its Moral for the Study of Spoken Language
10Rashomon What is Truth?
It is twelfth-century Japan, and a nobleman has
died .
11Rashomon What is Truth?
This we learn from a conversation between a
woodcutter, a priest and a peasant under a gate
in the ancient city of Kyoto .
12Rashomon What is Truth?
The woodcutter and the priest have just come from
a judicial inquest into the death, and are
telling the peasant what they have heard
13Rashomon What is Truth?
The woodcutter and the priest have just come from
a judicial inquest into the death, and are
telling the peasant what they have heard The
woodcutter testified at the inquest, having
witnessed the sequence of events resulting in the
Noblemans death
14Rashomon What is Truth?
The story begins with the capture of the
notorious bandit, Tajomaru, who is the accused in
the noblemans death .
15Rashomon What is Truth?
The nobleman and his wife had been traveling
through the forest .
16Rashomon What is Truth?
When, all of a sudden,
17Rashomon What is Truth?
When, all of a sudden, they are confronted by
Tajomaru, who halts their progress .
18Rashomon What is Truth?
The nobleman and bandit go off alone into a
thicket, where the former winds up being subdued
by the latter
19Rashomon What is Truth?
The nobleman is tied to a tree and forced to
watch as his wife is violated by the bandit
20Rashomon What is Truth?
The wife, at first, resists .
21Rashomon What is Truth?
But eventually drops the dagger and submits
22Rashomon What is Truth?
So far, all parties concerned agree (roughly) as
to the course of events, but from this point on
the picture becomes murky, with each participant
telling a somewhat different version of the
story
23Rashomon What is Truth?
In two versions (Tajomarus and the woodcutters)
the wife insists that her husband and the bandit
fight for her honor. The noblemans death results
from losing the duel.
24Rashomon What is Truth?
In the wifes version, the bandit departs, with
the husband still tied to the tree. The husband
proceeds to taunt his wife, telling her how
ashamed he is of her!
25Rashomon What is Truth?
She cuts the rope binding her husband to the tree
and asks to be killed! The wife promptly faints
and when she awakens, finds the dagger in the
chest of her (now very dead) husband
26Rashomon What is Truth?
In yet another version (the husbands through a
spirit medium) his wife betrays him and tries to
convince the bandit to kill the husband
27Rashomon What is Truth?
However, the bandit is repulsed by this
suggestion and quickly departs .
28Rashomon What is Truth?
However, the bandit is repulsed by this
suggestion and quickly departs . The nobleman,
still tied to the tree, picks up the dagger and
plunges it into his chest, thus taking his own
life
29Rashomon What is Truth?
However, the bandit is repulsed by this
suggestion and quickly departs . The nobleman,
still tied to the tree, picks up the dagger and
plunges it into his chest, thus taking his own
life Some time later the (now very dead) nobleman
is aware of someone (it is not clear who)
removing the dagger from his chest
30Rashomon What is Truth?
The film ends as the priest, woodcutter and
peasant mull over the significance of the
disparate accounts of the noblemans death,
seeking some kernel of truth in the morass of
ambiguity and uncertainty
31Rashomon What is Truth?
The film ends as the priest, woodcutter and
peasant mull over the significance of the
disparate accounts of the noblemans death,
seeking some kernel of truth in the morass of
ambiguity and uncertainty It is unclear whether
ANY witness has been entirely truthful
32Rashomon What is Truth?
The film ends as the priest, woodcutter and
peasant mull over the significance of the
disparate accounts of the noblemans death,
seeking some kernel of truth in the morass of
ambiguity and uncertainty It is unclear whether
ANY witness has been entirely truthful (probably
not)
33Rashomon What is Truth?
The story of Rashomon is cited often in
philosophical discussions of truth
34Rashomon What is Truth?
The story of Rashomon is cited often in
philosophical discussions of truth As nothing
is known (or knowable) with absolute certainty,
all knowledge is relative (and hence ephemeral)
35Rashomon What is Truth?
The story of Rashomon is cited often in
philosophical discussions of truth As nothing
is known (or knowable) with absolute certainty,
all knowledge is relative (and hence ephemeral)
The concept of truth is a chimera
36Rashomon What is Truth?
The story of Rashomon is cited often in
philosophical discussions of truth As nothing
is known (or knowable) with absolute certainty,
all knowledge is relative (and hence ephemeral)
The concept of truth is a chimera
37Rashomon What is Truth?
The story of Rashomon is cited often in
philosophical discussions of truth As nothing
is known (or knowable) with absolute certainty,
all knowledge is relative (and hence ephemeral)
The concept of truth is a chimera and therefore
unworthy of pursuit
38Rashomon What is Truth?
Yet, there is an alternative interpretation, one
that questions not the concept of truth itself,
but rather the capacity of its assimilation
through a single vantage point
39Rashomon What is Truth?
Yet, there is an alternative interpretation, one
that questions not the concept of truth itself,
but rather the capacity of its assimilation
through a single vantage point Perhaps the true
message of Rashomon is that deep and ever-lasting
knowledge can only be gained through exposure to
a variety of perspectives,
40Rashomon What is Truth?
Yet, there is an alternative interpretation, one
that questions not the concept of truth itself,
but rather the capacity of its assimilation
through a single vantage point Perhaps the true
message of Rashomon is that deep and ever-lasting
knowledge can only be gained through exposure to
a variety of perspectives, No single source
providing sufficient depth and detail to
comprehend a situation as complex (and as tragic)
as the murder of a man
41Spoken Language What is Truth?
Can an intellectual domain as complex as spoken
language be fully understood through the
testimony of a single perspective?
42Spoken Language What is Truth?
Can an intellectual domain as complex as spoken
language be fully understood through the
testimony of a single perspective? Or must
orthogonal varieties of evidence be sought with
which to reconstruct the truth?
43Spoken Language What is Truth?
Can an intellectual domain as complex as spoken
language be fully understood through the
testimony of a single perspective? Or must
orthogonal varieties of evidence be sought with
which to reconstruct the truth? How does true
insight proceed from objective study of spoken
language?
44Spoken Language What is Truth?
Can an intellectual domain as complex as spoken
language be fully understood through the
testimony of a single perspective? Or must
orthogonal varieties of evidence be sought with
which to reconstruct the truth? How does true
insight proceed from objective study of spoken
language? Is it possible to fully comprehend the
multivocal nature of a scientific domain from the
sole vantage point of a laboratory?
45Spoken Language What is Truth?
Can an intellectual domain as complex as spoken
language be fully understood through the
testimony of a single perspective? Or must
orthogonal varieties of evidence be sought with
which to reconstruct the truth? How does true
insight proceed from objective study of spoken
language? Is it possible to fully comprehend the
multivocal nature of a scientific domain from the
sole vantage point of a laboratory? Or does the
spirit of Rashomon compel us to seek testimony
from other sources in the pursuit of objective
knowledge?
46Part Two THE VARIETIES OF SCIENTIFIC
EXPERIENCE
- The Fundamental Duality
- The Eternal Pentangle
- The Inner Triangle
47The Fundamental Duality
Technology and science appear to oppose each
other in perspective
48The Fundamental Duality
- Technology and science appear to oppose each
other in perspective - Technology is concerned with what works
The Art of the Workable
49The Fundamental Duality
- Technology and science appear to oppose each
other in perspective - Technology is concerned with what works (and can
sell)
The Art of the Sellable
The Art of the Workable
50The Fundamental Duality
- Technology and science appear to oppose each
other in perspective - Technology is concerned with what works (and can
sell) - Science is concerned with what is
The Art of the Workable
The Art of the Sellable
The Art of the Soluble
51The Fundamental Duality
- Technology and science appear to oppose each
other in perspective - Technology is concerned with what works (and can
sell) - Science is concerned with what is (and can be
published)
The Art of the Sellable
The Art of the Workable
The Art of the Soluble
The Art of the Publishable
52The Fundamental Duality
There is an essential tension between Science
and Technology
The Art of the Sellable
The Art of the Workable
The Art of the Soluble
The Art of the Publishable
53The Fundamental Duality
- There is an essential tension between Science
and Technology - Science is often deemed pure
The Art of the Sellable
The Art of the Workable
The Art of the Soluble
The Art of the Publishable
54The Fundamental Duality
- There is an essential tension between Science
and Technology - Science is often deemed pure
- Technology is usually perceived as applied
The Art of the Sellable
The Art of the Workable
The Art of the Soluble
The Art of the Publishable
55The Fundamental Duality
- There is an essential tension between Science
and Technology - Science is often deemed pure
- Technology is usually perceived as applied (and
therefore not quite as pure)
The Art of the Sellable
The Art of the Workable
The Art of the Soluble
The Art of the Publishable
56The Eternal Pentangle
Speech Research Provides an Excellent Example of
the Tension between Science and Technology
57The Eternal Pentangle
Speech Research Provides an Excellent Example of
the Tension between Science and Technology
58The Eternal Pentangle
- Speech Research Provides an Excellent Example of
the Tension between Science and Technology - Phonetic insight is on the side of the angels
59The Eternal Pentangle
- Speech Research Provides an Excellent Example of
the Tension between Science and Technology - Phonetic insight is on the side of the angels
(a.k.a. science)
Phonetic Insight
60The Eternal Pentangle
- Speech Research Provides an Excellent Example of
the Tension between Science and Technology - Phonetic insight is on the side of the angels
(a.k.a. science) - While speech technology is on the side of the
apes
Phonetic Insight
61The Eternal Pentangle
- Speech Research Provides an Excellent Example of
the Tension between Science and Technology - Phonetic insight is on the side of the angels
(a.k.a. science) - While speech technology is on the side of the
apes (a.k.a. the real world)
The Real World
Phonetic Insight
62The Inner Triangle
The Inner Triangle of the Eternal Pentangle Can
Potentially Shed Light on this Philosophical (and
Methodological) Conundrum
63The Inner Triangle
- The Inner Triangle of the Eternal Pentangle Can
Potentially Shed Light on this Philosophical (and
Methodological) Conundrum - Manual annotation provides the empirical
foundation with which to train machine
algorithms
64The Inner Triangle
- The Inner Triangle of the Eternal Pentangle Can
Potentially Shed Light on this Philosophical (and
Methodological) Conundrum - Manual annotation provides the empirical
foundation with which to train machine
algorithms - Statistical characterization of the annotated
material provides the basis for structuring the
machine learning regime
65The Inner Triangle
- The Inner Triangle of the Eternal Pentangle Can
Potentially Shed Light on this Philosophical (and
Methodological) Conundrum - Manual annotation provides the empirical
foundation with which to train machine
algorithms - Statistical characterization of the annotated
material provides the basis for structuring the
machine learning regime - Machine learning provides a method for evaluating
phonetic knowledge
66The Inner Triangle
- The Inner Triangle of the Eternal Pentangle Can
Potentially Shed Light on this Philosophical (and
Methodological) Conundrum - Manual annotation provides the empirical
foundation with which to train machine
algorithms - Statistical characterization of the annotated
material provides the basis for structuring the
machine learning regime - Machine learning provides a method for evaluating
phonetic knowledge - Phonetic knowledge can be used to efficiently
train machine algorithms
67The Inner Triangle
- The Inner Triangle of the Eternal Pentangle Can
Potentially Shed Light on this Philosophical (and
Methodological) Conundrum - Manual annotation provides the empirical
foundation with which to train machine
algorithms - Statistical characterization of the annotated
material provides the basis for structuring the
machine learning regime - Machine learning provides a method for evaluating
phonetic knowledge - Phonetic knowledge can be used to efficiently
train machine algorithms - Statistical characterization can serve as a
reality check on phonetic knowledge
68The Inner Triangle
Thus, the three apices of the Inner Triangle feed
into each other and provide insight and
perspective difficult to achieve from a single
vantage point
69The Inner Triangle
- Thus, the three apices of the Inner Triangle feed
into each other and provide insight and
perspective difficult to achieve from a single
vantage point - In a manner analogous to Rashomon, insight may be
gained from this multi- dimensional perspective
that deepens our knowledge of spoken language
70The Inner Triangle
- Thus, the three apices of the Inner Triangle feed
into each other and provide insight and
perspective difficult to achieve from a single
vantage point - In a manner analogous to Rashomon, insight may be
gained from this multi- dimensional perspective
that deepens our knowledge of spoken language - And thus enables the development of superior
technology that truly works in the real world
71The Inner Triangle
- Thus, the three apices of the Inner Triangle feed
into each other and provide insight and
perspective difficult to achieve from a single
vantage point - In a manner analogous to Rashomon, insight may be
gained from this multi- dimensional perspective
that deepens our knowledge of spoken language - And thus enables the development of superior
technology that truly works in the real world - The development of sterling technology provides
(in principle) a means to fund further basic
technology-driven research
72The Inner Triangle
- Thus, the three apices of the Inner Triangle feed
into each other and provide insight and
perspective difficult to achieve from a single
vantage point - In a manner analogous to Rashomon, insight may be
gained from this multi- dimensional perspective
that deepens our knowledge of spoken language - And thus enables the development of superior
technology that truly works in the real world - The development of sterling technology provides
(in principle) a means to fund further basic
technology-driven research - And that, in turn, results in further
technological advances
73The Inner Triangle
- Thus, the three apices of the Inner Triangle feed
into each other and provide insight and
perspective difficult to achieve from a single
vantage point - In a manner analogous to Rashomon, insight may be
gained from this multi- dimensional perspective
that deepens our knowledge of spoken language - And thus enables the development of superior
technology that truly works in the real world - The development of sterling technology provides
(in principle) a means to fund further basic
technology-driven research - And that, in turn, results in further
technological advances - And so on
74The Inner Triangle
- Thus, the three apices of the Inner Triangle feed
into each other and provide insight and
perspective difficult to achieve from a single
vantage point - In a manner analogous to Rashomon, insight may be
gained from this multi-dimensional perspective
that deepens our knowledge of spoken language - And thus enables the development of superior
technology that truly works in the real world - The development of sterling technology provides
(in principle) a means to fund further basic
technology-driven research - And that, in turn, results in further
technological advances - And so on (forever after)
75Part Three THE IMPORTANCE OF BEING PHONETICALLY
ANNOTATED
- A Corpus-Centric Perspective on Spoken Language
- Phonetic Annotation of Spontaneous American
English Discourse
76Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech
77Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization
78Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language
79Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding
80Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language
81Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in
82Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in - Automatic speech recognition, particularly
pronunciation models
83Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in - Automatic speech recognition, particularly
pronunciation models - Speech synthesis, in pronunciation models as well
as in
84Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in - Automatic speech recognition, particularly
pronunciation models - Speech synthesis, in pronunciation models as well
as in - Cross-linguistic transfer of technology
algorithms, etc.
85Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in - Automatic speech recognition, particularly
pronunciation models - Speech synthesis, in pronunciation models as well
as in - Cross-linguistic transfer of technology
algorithms, etc. - They Promote Development of NOVEL Algorithms for
Speech Technology
86Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in - Automatic speech recognition, particularly
pronunciation models - Speech synthesis, in pronunciation models as well
as in - Cross-linguistic transfer of technology
algorithms, etc. - They Promote Development of NOVEL Algorithms for
Speech Technology - Including pronunciation models and lexical
representations for
87Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in - Automatic speech recognition, particularly
pronunciation models - Speech synthesis, in pronunciation models as well
as in - Cross-linguistic transfer of technology
algorithms, etc. - They Promote Development of NOVEL Algorithms for
Speech Technology - Including pronunciation models and lexical
representations for - automatic speech recognition and speech
synthesis, as well as
88Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in - Automatic speech recognition, particularly
pronunciation models - Speech synthesis, in pronunciation models as well
as in - Cross-linguistic transfer of technology
algorithms, etc. - They Promote Development of NOVEL Algorithms for
Speech Technology - Including pronunciation models and lexical
representations for - automatic speech recognition and speech
synthesis, as well as - Multi-tier representations of spoken language
89Phonetic Annotation is Useful, Because
- Many Properties of Spontaneous Spoken Language
Differ from Those of Laboratory and Citation
Speech - There are systematic patterns in real speech
that potentially reveal underlying principles
of linguistic organization - Such Corpora Provide Empirical Material for the
Study of Spoken Language - Such data provide an important basis for
scientific insight and understanding - And facilitate development of new models of
spoken language - They Also Provide Training Material for
Technology Applications in Automatic speech
recognition, particularly pronunciation models - Speech synthesis, in pronunciation models as well
as in - Cross-linguistic transfer of technology
algorithms, etc. - They Promote Development of NOVEL Algorithms for
Speech Technology - Including pronunciation models and lexical
representations for - automatic speech recognition and speech
synthesis, as well as - Multi-tier representations of spoken language
- All of Which Can be Used for Gaining Further
Insight into Spoken Language
90Corpus-Centric View of Spoken Language
Each Tier of Linguistic Organization Provides a
Unique Perspective
91Corpus-Centric View of Spoken Language
Each Tier of Linguistic Organization Provides a
Unique Perspective However, integrating the
annotated material across levels is tricky
92Corpus-Centric View of Spoken Language
Each Tier of Linguistic Organization Provides a
Unique Perspective However, integrating the
annotated material across levels is tricky . And
a lot of work!!
93Corpus-Centric View of Spoken Language
Each Tier of Linguistic Organization Provides a
Unique Perspective However, integrating the
annotated material across levels is tricky . And
a lot of work!! Lets Focus on a Specific Aspect
of Linguistic Organization in Order to Exemplify
the Concepts Involved
94Corpus-Centric View of Spoken Language
Each Tier of Linguistic Organization Provides a
Unique Perspective However, integrating the
annotated material across levels is tricky . And
a lot of work!! Lets Focus on a Specific Aspect
of Linguistic Organization in Order to Exemplify
the Concepts Involved In order to do so, we first
consider the nature of the transcription material
used
95Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and segmented)
  Â
96Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated   Â
97Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level
98Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level
99Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods
100Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material
101Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) Â
102Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) Â There is a Lot of Diversity in the
Material Transcribed
103Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) Â There is a Lot of Diversity in the
Material Transcribed Spans speech of both genders
(ca. 50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality
104Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD CORPUS, have been
phonetically annotated (labeled and
segmented) Most of this Material has been
Manually Annotated    4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material has
been segmented at the phonetic-segment level
using automatic methods 45 minutes of
stress-accent-labeled material An additional four
hours of material automatically labeled with
respect to accent (this latter material not used
in the current analysis, but will be available
soon) Â There is a Lot of Diversity in the
Material Transcribed Spans speech of both genders
(ca. 50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality Transcription System A variant of
Arpabet, with phonetic diacritics such
as_gl,_cr, _fr, _n, _vl, _vd
105Phonetic Transcription of Spontaneous English
The Data are Available at .
106Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp
107Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp This Means there is
Phonetically Validated Material at the Level of
the
108Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp This Means there is
Phonetically Validated Material at the Level of
the WORD
109Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp This Means there is
Phonetically Validated Material at the Level of
the WORD SYLLABLE
110Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp This Means there is
Phonetically Validated Material at the Level of
the WORD SYLLABLE PHONETIC SEGMENT
111Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp This Means there is
Phonetically Validated Material at the Level of
the WORD SYLLABLE PHONETIC
SEGMENT ARTICULATORY-ACOUSTIC FEATURE
112Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp This Means there is
Phonetically Validated Material at the Level of
the WORD SYLLABLE PHONETIC
SEGMENT ARTICULATORY-ACOUSTIC FEATURE
and STRESS ACCENT
113Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp This Means there is
Phonetically Validated Material at the Level of
the WORD SYLLABLE PHONETIC
SEGMENT ARTICULATORY-ACOUSTIC FEATURE
and STRESS ACCENT (as well as at the utterance
level)
114The Eternal Pentangle (Redux)
Lets re-examine the eternal triangle from the
perspective of manual annotation for three
linguistic tiers.
115Phonetic Transcription
How was the Labeling and Segmentation Performed?
116Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students
117Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform,
118Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram,
119Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
120Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries)
121Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio (listening at
multiple time scales - phone, word, utterance)
122Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio (listening at
multiple time scales - phone, word, utterance) on
Sun workstations
123Phonetic Transcription
How was the Labeling and Segmentation
Performed? VERY carefully . by UC-Berkeley
linguistics students Using a display of the
signal waveform, spectrogram, word transcription
and forced alignments (automatic estimates of
phones and boundaries) audio (listening at
multiple time scales - phone, word, utterance) on
Sun workstations Additionally, automatic
segmentation and labeling of articulatory manner
was used as a guide for phonetic labeling and
segmentation in the current year
124Phonetic Transcription
In addition to phonetic labels and syllabic
segmentation,
125Phonetic Transcription
In addition to phonetic labels and syllabic
segmentation, 45 minutes of this material was
labeled with respect to stress accent for each
syllable Three levels of stress were marked -
FULLY Stressed, Unstressed and Intermediate Stress
126Phonetic Transcription
Such material can be used to perform statistical
characterization of spontaneous speech as well
as train machine algorithms to label and segment
additional material
127Phonetic Transcription
Such material can be used to perform statistical
characterization of spontaneous speech as well
as train machine algorithms to label and segment
additional material In addition, the
transcription material can be used to evaluate
the performance of automatic speech
recognition systems
128Phonetic Transcription
Such material can be used to perform statistical
characterization of spontaneous speech as well
as train machine algorithms to label and segment
additional material In addition, the
transcription material can be used to evaluate
the performance of automatic speech
recognition systems Lets first consider how this
transcription can be used for ASR evaluation
129Phonetic Transcription
Such material can be used to perform statistical
characterization of spontaneous speech as well
as train machine algorithms to label and segment
additional material In addition, the
transcription material can be used to evaluate
the performance of automatic speech
recognition systems Lets first consider how this
transcription can be used for ASR
evaluation Well focus on stress-accent, but then
relate this to syllable structure
130Part Four PHONETIC DISSECTION OF AUTOMATIC
SPEECH RECOGNITION SYSTEMS A Case Study
Stress Accent and Word Error Rate Syllable
Structure and Word Error Rate
In Collaboration with Shawn Chang
131The Eternal Pentangle (Redux)
Lets re-examine the eternal triangle from the
perspective of automatic speech recognition .
132Generation of Evaluation Data - 1
A complex sequence of data formatting was
required to place the speech recognition data
of 8 separate sites into register with the
transcription material (and vice versa)
133Generation of Evaluation Data - 2
But, lets not sweat the details during this
presentation
134Generation of Evaluation Data - 2
Lets not sweat the details during this
presentation Interested parties may consult the
relevant papers (Greenberg, Hollenback and Chang,
2000 Greenberg and Chang, 2000)
at www.icsi.berkeley.edu/steveng
135Generation of Evaluation Data - 3
Recognition performance was analyzed with
reference to ca. 50 separate acoustic,
linguistic and structural parameters
136Summary of Corpus Acoustic Properties
- LEXICAL PROPERTIES
- Lexical Identity
- Unigram Frequency
- Number of Syllables in Word
- Number of Phones in Word
- Word Duration
- Speaking Rate
- Prosodic Prominence
- Energy Level
- Lexical Compounds
- Non-Words
- Word Position in Utterance
- SYLLABLE PROPERTIES
- Syllable Structure
- Syllable Duration
- Syllable Energy
- Prosodic Prominence
- Prosodic Context
- PHONE PROPERTIES
- Phonetic Identity
- Phone Frequency
- Position within the Word
- Position within the Syllable
- Phone Duration
- Speaking Rate
- Phonetic Context
- Contiguous Phones Correct
- Contiguous Phones Wrong
- Phone Segmentation
- Articulatory Features
- Articulatory Feature Distance
- Phone Confusion Matrices
- OTHER PROPERTIES
- Speaker (Dialect, Gender)
- Utterance Difficulty
- Utterance Energy
- Utterance Duration
137Summary of Corpus Acoustic Properties
- LEXICAL PROPERTIES
- Lexical Identity
- Unigram Frequency
- Number of Syllables in Word
- Number of Phones in Word
- Word Duration
- Speaking Rate
- Prosodic Prominence
- Energy Level
- Lexical Compounds
- Non-Words
- Word Position in Utterance
- SYLLABLE PROPERTIES
- Syllable Structure
- Syllable Duration
- Syllable Energy
- Prosodic Prominence
- Prosodic Context
- PHONE PROPERTIES
- Phonetic Identity
- Phone Frequency
- Position within the Word
- Position within the Syllable
- Phone Duration
- Speaking Rate
- Phonetic Context
- Contiguous Phones Correct
- Contiguous Phones Wrong
- Phone Segmentation
- Articulatory Features
- Articulatory Feature Distance
- Phone Confusion Matrices
- OTHER PROPERTIES
- Speaker (Dialect, Gender)
- Utterance Difficulty
- Utterance Energy
- Utterance Duration
138Summary of Corpus Acoustic Properties
- LEXICAL PROPERTIES
- Lexical Identity
- Unigram Frequency
- Number of Syllables in Word
- Number of Phones in Word
- Word Duration
- Speaking Rate
- Prosodic Prominence
- Energy Level
- Lexical Compounds
- Non-Words
- Word Position in Utterance
- SYLLABLE PROPERTIES
- Syllable Structure
- Syllable Duration
- Syllable Energy
- Prosodic Prominence
- Prosodic Context
- PHONE PROPERTIES
- Phonetic Identity
- Phone Frequency
- Position within the Word
- Position within the Syllable
- Phone Duration
- Speaking Rate
- Phonetic Context
- Contiguous Phones Correct
- Contiguous Phones Wrong
- Phone Segmentation
- Articulatory Features
- Articulatory Feature Distance
- Phone Confusion Matrices
- OTHER PROPERTIES
- Speaker (Dialect, Gender)
- Utterance Difficulty
- Utterance Energy
- Utterance Duration
139Summary of Corpus Acoustic Properties
- LEXICAL PROPERTIES
- Lexical Identity
- Unigram Frequency
- Number of Syllables in Word
- Number of Phones in Word
- Word Duration
- Speaking Rate
- Prosodic Prominence
- Energy Level
- Lexical Compounds
- Non-Words
- Word Position in Utterance
- SYLLABLE PROPERTIES
- Syllable Structure
- Syllable Duration
- Syllable Energy
- Prosodic Prominence
- Prosodic Context
- PHONE PROPERTIES
- Phonetic Identity
- Phone Frequency
- Position within the Word
- Position within the Syllable
- Phone Duration
- Speaking Rate
- Phonetic Context
- Contiguous Phones Correct
- Contiguous Phones Wrong
- Phone Segmentation
- Articulatory Features
- Articulatory Feature Distance
- Phone Confusion Matrices
- OTHER PROPERTIES
- Speaker (Dialect, Gender)
- Utterance Difficulty
- Utterance Energy
- Utterance Duration
140What is (usually) Meant by Stress Accent?
- Prosody is supposed to pertain to extra-phonetic
cues in the acoustic signal
141What is (usually) Meant by Stress Accent?
- Prosody is supposed to pertain to extra-phonetic
cues in the acoustic signal - The pattern of variation over a sequence of
SYLLABLES pertaining to syllabic DURATION,
AMPLITUDE and PITCH (fo) variation over time
142What is (usually) Meant by Stress Accent?
- Prosody is supposed to pertain to extra-phonetic
cues in the acoustic signal - The pattern of variation over a sequence of
SYLLABLES pertaining to syllabic DURATION,
AMPLITUDE and PITCH (fo) variation over time - But, the plot thickens (considerably) . as well
shortly see
143Stress Accent and Word Error Rate
The effect of stress accent is most discernable
among word-deletion errors
- Data are averaged across all eight sites
144Stress Accent and Word Error Rate
The effect of stress accent is most discernable
among word-deletion errors There is no essential
relation between accent and word-substitution
errors
- Data are averaged across all eight sites
145Syllable Structure and Word Error Rate
Lets now consider syllable structure with
respect to ASR word error
146Syllable Structure and Word Error Rate
Lets now consider syllable structure with
respect to ASR word error There is a certain
similarity with the pattern observed for stress
accent .
147Syllable Structure and Word Error Rate
Vowel-initial forms show the greatest error,
particularly for word deletions
- Data are averaged across all eight sites
148Syllable Structure and Word Error Rate
Vowel-initial forms show the greatest error,
particularly for word deletions Polysyllabic
forms manifest the lowest error, especially for
word deletions
- Data are averaged across all eight sites
149Syllable Structure and Word Error Rate
Vowel-initial forms show the greatest error,
particularly for word deletions Polysyllabic
forms manifest the lowest error, especially for
word deletions The vowel-initial forms tend to be
unstressed, so .
- Data are averaged across all eight sites
150Syllable Structure and Word Error Rate
Vowel-initial forms show the greatest error,
particularly for word deletions Polysyllabic
forms manifest the lowest error, especially for
word deletions The vowel-initial forms tend to be
unstressed, so . Perhaps the similarity in
pattern is not so surprising after all
- Data are averaged across all eight sites
151The Plot So Far
- The Proportion of Word (Deletion) Errors is Much
Higher Among Unstressed Syllables
152The Plot So Far
- The Proportion of Word (Deletion) Errors is Much
Higher Among Unstressed Sylla