Title: AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY
1AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY
- R.J.J.H. van Son, Barbertje M. Streefkerk, and
- Louis C.W. Pols
Institute of Phonetic Sciences / ACLC University
of Amsterdam, Herengracht 338, 1016 CG Amsterdam,
The Netherlandstel 31 20 5252183 fax 31 20
5252197 email Rob.van.Son_at_hum.uva.nl ICSLP2000,
Beijing, China, Oct. 20, 2000
2INTRODUCTION
- Speech is "efficient" Important components are
emphasized - Less important ones are de-emphasized
- Two mechanisms
- 1) Prosody Lexical Stress and Sentence Accent
(Prominence) - 2) Predictability Frequency of Occurrence
(tested) and - Context (not tested)
3MECHANISMS FOR EFFICIENT SPEECH
- Speech emphasis should mirror importance
- which largely corresponds to unpredictability
- Prosodic structure distributes emphasis according
to importance (lexical stress, sentence accent /
prominence) - Speakers can (de-)emphasize according to supposed
(un)importance - Speech production mechanisms can facilitate
redundant speech or hamper unpredictable speech
4QUESTIONS
- Can the distribution of emphasis or reduction be
completely explained from Prosody? (Lexical
stress - and Sentence Accent / Prominence)
- If not, can we identify a speech production
mechanism that would assist efficiency in speech? - e.g. preprogrammed articulation of redundant and
/ or high-frequent syllable-like segments?
5SPEECH MATERIAL (DUTCH)
- Single Male Speaker Vowels and Consonants
Matched Informal and Read speech, 791 matched VCV
pairs - Polyphone Vowels only 273 speakers (out of
5000), telephone speech, 1244 read sentences
Segmented with a modified HMM recognizer (Xue
Wang)
- Corpora sizes Number of realizations of vowels
and consonants
Unstressed Stressed Total
Corpus ? Accent ?
Single consonants 550 180 569
283 1582 Speaker vowels 812 461 528
224 2025 Polyphone vowels
4435 4942 9603 3516 22496
- Accent Sentence accent / Prominence
- Stressed/Unstressed Lexical stress
6METHODS SPEECH PREPARATION
- Single speaker corpus
- All 2 x 791 VCV segments hand-labeled
- Also sentence accent determined by hand
- 22 Native listeners identified consonants from
this corpus - Polyphone corpus
- Automatically labeled using a pronunciation
lexicon and a modified HMM recognizer - 10 Judges marked prominent words (prominence
1-10) - Word and Syllable -log2(Frequencies) for both
corpora were determined from Dutch CELEX
7METHODS ANALYSISSingle Speaker
CorpusConsonants and Vowels
- Duration in ms (vowels and consonants)
- Contrast (vowels only) F1 / F2 distance to (300,
1450) Hz in semitones - Spectral Center of Gravity (CoG) (V and
C)Weighted mean frequency in semitones at point
of maximum energy - Log2(Perplexity) from consonant identification
Calculated from confusion matrices
8METHODS ANALYSISPolyphone Corpus Vowels only
- Loudness
- in sone
- Spectral Center of Gravity (CoG)
- Weighted mean frequency in semitones averaged
over the segment - Prominence (1-10)The number of 'PROMINENT'
listener judgements0 5 is considered
Unaccented6 10 is considered Accented
9CONSISTENCY OF MEASUREMENTS Correlation
coefficients between factors
G
Single Speaker
E
S
A
2
C
Polyphone
Filled symbols Plt0.01
- Duration in ms Loudness in sones
- CoG Spectral Center of Gravity (semitones)
- Px log2(Perplexity) plotted is R
- Contrast F1/ F2 distance to (300, 1450) Hz
(semitones)
10CONSONANT REDUCTION VERSUS FREQUENCY OF
OCCURRENCE (correlation coefficients)
Single speaker corpus (n1582)
G
E
A
Filled symbols Plt0.01
- CoG Spectral Center of Gravity (semitones)
- Perplexity log2(Perplexity), plotted is R.
- Syllable and word frequencies were correlated
(R0.230, p0.01)
11VOWEL REDUCTION VERSUS FREQUENCY OF
OCCURRENCE (correlation coefficients)
Single speaker corpus (n2025)
Filled symbols Plt0.01
- Duration in ms
- Contrast F1/ F2 distance to (300, 1450) Hz
(semitones) - CoG Spectral Center of Gravity (semitones)
- Syllable and word frequencies were correlated
(R0.280, plt0.01)
12DISCUSSION OF SINGLE SPEAKER DATA
- There are consistent correlations between
frequency of occurrence and acoustic reduction
(duration, CoG and contrast), but not for
consonant identification (perplexity) - Correlations for syllable frequencies tend to be
larger than those for word frequencies (p?0.01) - Correlations were found after accounting for
Phoneme identity, Lexical Stress and Sentence
Accent
13PROMINENCE VERSUS VOWEL REDUCTION AND FREQUENCY
OF OCCURRENCE (correlation coefficients)
Polyphone corpus (n22496)
G
Loudness
E
CoG
C
Syllable freq.
A
Word freq.
Filled plt0.01
Filled symbols Plt0.01
- Loudness (sone)
- CoG Spectral Center of Gravity (semitones)
- Syllable and word frequencies (-log2(freq))
14VOWEL REDUCTION VERSUS FREQUENCY OF
OCCURRENCE (correlation coefficients)
Polyphone corpus (n22496)
Filled symbols Plt0.01
Accent Prom gt 5 Prom lt 5
- Loudness (sone)
- CoG Spectral Center of Gravity (semitones)
- Syllable and word frequencies were correlated
(R0.316, plt0.01)
15DISCUSSION OF POLYPHONE DATA
- Perceived prominence correlates with acoustic
vowel reduction (loudness, CoG) and frequency of
occurrence (syllable and word) - There are small but consistent correlations
between acoustic vowel reduction and frequency
of occurrence - Correlations were found after accounting for
Vowel identity, Lexical Stress and Prominence
16CONCLUSIONS
- LEXICAL STRESS and
- SENTENCE ACCENT / PROMINENCE cannot explain all
of the efficiency of speech FREQUENCY OF
OCCURRENCE and possibly CONTEXT in general are
needed for a full account - A SYLLABARY which speeds up (and reduces) the
articulation of stored, high-frequency,
syllables with respect to computed, rare,
syllables might explain at least part of our data
17SPOKEN LANGUAGE CORPUSHow Efficient is Speech
- 8-10 speakers 60 minutes of speech each
(fixed and variable materials) - Informal story telling and retold stories 15
min - Reading continuous texts 15 min
- Reading Isolated (Pseudo-) sentences 20 min
- Word lists 5 min
- Syllable lists 5 min
18MEASURINGSPEECH EFFICIENCY
- Speaking Style differences
- (Informal, Retold, Read, Sentences, Lists)
- Predictability
- Frequency of Occurrence (words and syllables)
- In Context (language models)
- Cloze-tests
- Shadowing (RT or delay)
- Acoustic Reduction
- Segment identification
- Duration
- Spectral reduction