Title: Speech Communication
1Speech Communication
EEM.ssr Speaker Speech Recognition
lecturer in speech audio Centre for Vision,
Speech Signal Processing, Department of
Electronic Engineering.
http//www.ee.surrey.ac.uk/Teaching/Courses/eem.ss
r
2World of speech technologies
Spoken language generation
Spoken language understanding
Spoken dialogue processing
Automatic speech recognition
Emotion in speech
Speaker recognition
Speech technology
Speech perception
Phonetics
Speech enhancement
Speech production
Speech modification
Speech analysis
Speech synthesis
Speech coding
3Speech-related disciplines
Linguistics
Psychology
Maths stats
Phonetics
Speech science
Acoustics
Computer science
Signal processing
Electronics
4The speech chain
LISTENER
SPEAKER
SENSORY NERVES
FEEDBACK LINK
EAR
SENSORY NERVES
EAR
MOTOR NERVES
SOUND WAVES
VOCAL MUSCLES
Linguistic
Linguistic
Acoustic
Physiological
Physiological
5It comes as naturally as breathing
- Speech is mans preferred modality
- Can use natural language for interacting with
complex systems - Hands-free
- Eyes-free
- Small footprint
- Requires no training
6Ideas and language
- Ideas are concepts or abstract notions
- Language has a grammar and syntax, and is made up
of words - Develop our understanding of the world at the
same time as we are learning to talk - Many of our thoughts are framed in terms of words
- Language (and culture) affect the way we think
7Written vs. spoken language
- Written language
- discrete words separated by spaces
- usually complete, correct spelling
- opportunity to skip, skim or re-read
- Spoken language
- continuous sequence of sounds, usually without
spaces - often damaged, interrupted, parts mumbled
8Speech is not acoustic text
9Sounds and words
- Phonetics
- How speech sounds are produced
- Acoustic result of speech articulation
- Phonology
- How sounds are used to make words
- The functions of the sounds within a particular
language
10Acoustic signal
- Sound produced by vibration of vocal cords
- Sound modified by resonances of the vocal tract
- International Phonetic Alphabet (IPA)
- smallest unit in speech where substitution could
change meaning phoneme
11Speech sounds
- Speech production
- Articulators how do they affect the speech
waveform? - Phonemes
- What are they, why are they useful?
- Phonemes are speech sounds in an ideal world.
- Phonetics
- How are phonemes actually realized?
- Phones are speech sounds in the real world.
- Allophones are different types of realisation.
- The wider context
- Language, accent,
- Speaker differences,
- Effect of external factors.
12Vowels, consonants and syllables
- Vowels
- Vibrating vocal cords in larynx with clear vocal
tract - Produced using slower extrinsic muscles
- Consonants
- Usually some occlusion of the vocal tract
- Sound source can be from larynx, click or hiss
- Produced using faster intrinsic muscles
- Syllables
- All languages have CV syllables
- Basic unit of articulation
- Consonant clusters
13Phonetics vs. orthography
- Letter-phoneme mapping is not 1-to-1
- Some sounds require several letters
- e.g., sh, ph
- Some letters have several pronunciations
- e.g., g, c
- Some sounds have several transcriptions
- e.g., /f/ f and ph
- Some letters produce several sounds
- e.g., x /ks/
- Some combinations have complex relations
- e.g., -ough-
- Different accents use different phonemes
- e.g., bath
14Prosody
- Pitch
- Corresponds to the frequency of vibration of the
vocal cords - (Has phonetic significance in tonal languages)
- Intensity
- How loud a particular word or syllable is
- Timing
- Durations depend on the phrasing (punctuation),
context (cf. league, leek), etc. - Stress timed vs. syllable timed languages
15Language, accent and dialect
- Language
- A system of communication with a vocabulary of
words, grammar and syntax - Different languages have different phonetic
contrasts (right, light) - Accent
- Pronunciation variations that do not affect
meaning of spoken utterance (good, food) - Intelligible by native speakers
- Dialect
- Variations in vocabulary, and possibly other
aspects, for distinct population
16Non-acoustic signals
- Many other sources of information from other
senses face, body, gesture, touch, - can make you hear things differently
- Lip reading
- Information about articulation can be derived
from (peripherally) observing lips - Major cue for hearing impaired
- Significant effect for normal hearers (McGurk)
- Para-linguistic information
- Facial mood and emotion
- Culturally-grounded gestures
- Modifying gestures
- Body language
- Stress and emphasis
17Complexity demands intelligence
- Speech is very complex
- requires fusion of many sources of knowledge
- Humans have developed large brains and supreme
intelligence in the animal kingdom to deal with
it - very large number of neurons, in parallel
18Summary of speech comm.
- Speech is natural modality of man to interact
with machines - Ideas and language
- Written vs. spoken language (phonology)
- Continuous acoustic signal (phonetics)
- Phonemes, phones and allophones
- Vowels and consonants
- Phonetic vs. orthographic transcriptions
- Intelligible by native speakers
- Para-linguistic information
- Prosody intensity, pitch and timing
- Language, accent and dialect
- Visual, haptic and contextual information
19Speech recognition
20What is speech recognition?
- Types of spoken language processing
- Automatic speech recognition (ASR)
- Spoken language understanding
- Dialogue systems
- Paralinguistic speech processing
- Speech verification
- Speech coding, enhancement modification
- Speech synthesis
- Spoken language generation
- Speaker recognition identification and
authentication/verification
21Speech recognition problem
- The dream and reality
- Intelligent machines?
- Size of vocabulary 50, 1000, 20000 words
- Speaker -dependent/-independent
- Discovering our ignorance
- How does the ear work?
- How does the brain process sounds to perceive
concepts? - Circumventing our ignorance
- Ad-hoc rules vs. pattern matching techniques
- Most successful based on stochastic modelling
- Recent advances in neural network approaches
22Dimensions of difficulty
- Speaker dependency
- Vocabulary size
- Isolated words vs. continuous speech
- Language constraints and knowledge sources
- Acoustic ambiguity
- Noise robustness
23Speech recognition summary
- Dream and reality
- Speech-to-text machines
- Vocabulary size and speaker-dependency trade off
against recognition accuracy - Incomplete specification
- Of language, of the human ear, the auditory
nerves and of how the cortex processes speech to
derive meaning - An engineering solution
- Use pattern matching techniques
- Most successful based on Hidden Markov Models
- Recent advances in HMM/ANN hybrids