Voice DSP Processing I - PowerPoint PPT Presentation

About This Presentation
Title:

Voice DSP Processing I

Description:

Part 1 Speech biology and what we can learn from it ... Similarly vision lengths of lines, taste saltiness, sound frequency. Stein VoiceDSP 1.32 ... – PowerPoint PPT presentation

Number of Views:186
Avg rating:3.0/5.0
Slides: 54
Provided by: yjs8
Category:

less

Transcript and Presenter's Notes

Title: Voice DSP Processing I


1
VoiceDSPProcessing I
  • Yaakov J. Stein
  • Chief ScientistRAD Data Communications

2
Voice DSP
  • Part 1 Speech biology and what we can learn from
    it
  • Part 2 Speech DSP (AGC, VAD, features, echo
    cancellation)
  • Part 3 Speech compression techiques
  • Part 4 Speech Recognition

3
Voice DSP - Part 1a
  • Speech production mechanisms
  • Biology of the vocal tract
  • Pitch and formants
  • Sonograms
  • The basic LPC model
  • The cepstrum
  • LPC cepstrum
  • Line spectral pairs

4
Voice DSP - Part 1b
  • Speech perception mechanisms
  • Biology of the ear
  • Psychophysical phenomena
  • Webers law
  • Fechners law
  • Changes
  • Masking

5
Voice DSP - Part 1c
  • Speech quality measurement
  • Subjective measurement
  • MOS and its variants
  • Objective measurement
  • PSQM, PESQ

6
Voice DSP - Part 2a
  • Basic speech processing
  • Simplest processing
  • AGC
  • Simplistic VAD
  • More complex processing
  • pitch tracking
  • formant tracking
  • U/V decision
  • computing LPC and other features

7
Voice DSP - Part 2b
  • Echo Cancellation
  • Sources of echo (acoustic vs. line echo)
  • Echo suppression and cancellation
  • Adaptive noise cancellation
  • The LMS algorithm
  • Other adaptive algorithms
  • The standard LEC

8
Voice DSP - Part 3
  • Speech compression techniques
  • PCM
  • ADPCM
  • SBC
  • VQ
  • ABS-CELP
  • MBE
  • MELP
  • STC
  • Waveform Interpolation

9
Voice DSP - Part 4
  • Speech Recognition tasks
  • ASR Engine
  • Phonetic labeling
  • DTW
  • HMM
  • State-of-the-Art

10
Voice DSP - Part 1a
  • Speech
  • production
  • mechanisms

11
Speech Production Organs

Brain
Hard Palate
Nasal cavity
Velum
Teeth
Uvula
Lips
Mouth cavity
Pharynx
Tongue
Larynx
Trachea
Lungs
12
Speech Production Organs - cont.
  • Air from lungs is exhaled into trachea (windpipe)
  • Vocal chords (folds) in larynx can produce
    periodic pulses of air
  • by opening and closing (glottis)
  • Throat (pharynx), mouth, tongue and nasal cavity
    modify air flow
  • Teeth and lips can introduce turbulence
  • Epiglottis separates esophagus (food pipe) from
    trachea

13
Voiced vs. Unvoiced Speech
  • When vocal cords are held open air flows
    unimpeded
  • When laryngeal muscles stretch them glottal flow
    is in bursts
  • When glottal flow is periodic called voiced
    speech
  • Basic interval/frequency called the pitch
  • Pitch period usually between 2.5 and 20
    milliseconds
  • Pitch frequency between 50 and 400 Hz
  • You can feel the vibration of the larynx
  • Vowels are always voiced (unless whispered)
  • Consonants come in voiced/unvoiced pairs
  • for example B/P K/G D/T V/F J/CH TH/th
    W/WH Z/S ZH/SH

14
Excitation spectra
  • Voiced speech
  • Pulse train is not sinusoidal - harmonic
    rich
  • Unvoiced speech
  • Common assumption white noise

f
f
15
Effect of vocal tract
  • Mouth and nasal cavities have resonances
  • Resonant frequencies
  • depend on geometry

16
Effect of vocal tract - cont.
  • Sound energy at these resonant frequencies is
    amplified
  • Frequencies of peak amplification are called
    formants

frequency response
frequency
F0
17
Formant frequencies
  • Peterson - Barney data (note the vowel triangle)

18
Sonograms
19
Cylinder model(s)
  • Rough model of throat and mouth cavity
  • With nasal cavity

Voice Excitation
open
open
Voice Excitation
open/closed
20
Phonemes
  • The smallest acoustic unit that can change
    meaning
  • Different languages have different phoneme sets
  • Types (notations
    phonetic, CVC, ARPABET)
  • Vowels
  • front (heed, hid, head, hat)
  • mid (hot, heard, hut, thought)
  • back (boot, book, boat)
  • dipthongs (buy, boy, down, date)
  • Semivowels
  • liquids (w, l)
  • glides (r, y)

21
Phonemes - cont.
  • Consonants
  • nasals (murmurs) (n, m, ng)
  • stops (plosives)
  • voiced (b,d,g)
  • unvoiced (p, t, k)
  • fricatives
  • voiced (v, that, z, zh)
  • unvoiced (f, think, s, sh)
  • affricatives (j, ch)
  • whispers (h, what)
  • gutturals ( ? ,? )
  • clicks, etc. etc. etc.

22
Basic LPC Model

Pulse Generator
LPC synthesis filter
U/V Switch
White Noise Generator
23
Basic LPC Model - cont.
  • Pulse generator produces a harmonic rich periodic
    impulse train (with pitch period and gain)
  • White noise generator produces a random signal
  • (with gain)
  • U/V switch chooses between voiced and unvoiced
    speech
  • LPC filter amplifies formant frequencies
  • (all-pole or AR IIR filter)
  • The output will resemble true speech to within
    residual error

24
Cepstrum
  • Another way of thinking about the LPC model
  • Speech spectrum is the obtained from
    multiplication
  • Spectrum of (pitch) pulse train times
  • Vocal tract (formant) frequency response
  • So log of this spectrum is obtained from addition
  • Log spectrum of pitch train plus
  • Log of vocal tract frequency response
  • Consider this log spectrum to be the spectrum of
    some new signal
  • called the cepstrum
  • The cepstrum is the sum of two components
  • excitation plus vocal tract

25
Cepstrum - cont.
  • Cepstral processing has its own language
  • Cepstrum (note that this is really a signal in
    the time domain)
  • Quefrency (its units are seconds)
  • Liftering (filtering)
  • Alanysis
  • Saphe
  • Several variants
  • complex cepstrum
  • power cesptrum
  • LPC cepstrum

26
Do we know enough?
  • Standard speech model (LPC)
  • (used by most speech processing/compression/re
    cognition systems)
  • is a model of speech production
  • Unfortunately, speech production and speech
    perception systems
  • are not matched
  • So next well look at the biology of the hearing
    (auditory) system
  • and some psychophysics (perception)

27
Voice DSP - Part 1b
  • Speech
  • Hearing perception mechanisms

28
Hearing Organs
29
Hearing Organs - cont.
  • Sound waves impinge on outer ear enter auditory
    canal
  • Amplified waves cause eardrum to vibrate
  • Eardrum separates outer ear from middle ear
  • The Eustachian tube equalizes air pressure of
    middle ear
  • Ossicles (hammer, anvil, stirrup) amplify
    vibrations
  • Oval window separates middle ear from inner ear
  • Stirrup excites oval window which excites liquid
    in the cochlea
  • The cochlea is curled up like a snail
  • The basilar membrane runs along middle of cochlea
  • The organ of Corti transduces vibrations to
    electric pulses
  • Pulses are carried by the auditory nerve to the
    brain

30
Function of Cochlea
  • Cochlea has 2 1/2 to 3 turns
  • were it straightened out it would be 3 cm in
    length
  • The basilar membrane runs down the center of the
    cochlea
  • as does the organ of Corti
  • 15,000 cilia (hairs) contact the vibrating
    basilar membrane
  • and release neurotransmitter stimulating
    30,000 auditory neurons
  • Cochlea is wide (1/2 cm) near oval window and
    tapers towards apex
  • is stiff near oval window and
    flexible near apex
  • Hence high frequencies cause section near oval
    window to vibrate
  • low frequencies cause section
    near apex to vibrate
  • Overlapping bank of filter frequency decomposition

31
Psychophysics - Webers law
  • Ernst Weber Professor of physiology at Leipzig in
    the early 1800s
  • Just Noticeable Difference
  • minimal stimulus change that can be detected
    by senses
  • Discovery D I K I
  • Example
  • Tactile sense place coins in each hand
  • subject could discriminate between with 10 coins
    and 11,
  • but not 20/21, but could 20/22!
  • Similarly vision lengths of lines, taste
    saltiness, sound frequency

32
Webers law - cont.
  • This makes a lot of sense

Bill Gates
33
Psychophysics - Fechners law
  • Webers law is not a true psychophysical law
  • it relates stimulus threshold to stimulus
    (both physical entities)
  • not internal representation (feelings) to
    physical entity
  • Gustav Theodor Fechner student of Weber
    medicine, physics philosophy
  • Simplest assumption JND is single internal unit
  • Using Webers law we find
  • Y A log I B
  • Fechner Day (October 22 1850)

34
Fechners law - cont.
  • Log is very compressive
  • Fechners law explains the fantastic ranges of
    our senses
  • Sight single photon - direct sunlight 1015
  • Hearing eardrum move 1 H atom - jet plane 1012
  • Bel defined to be log10 of power ratio
  • decibel (dB) one tenth of a Bel
  • d(dB) 10 log10 P 1 / P 2

35
Fechners law - sound amplitudes
  • Companding
  • adaptation of logarithm to positive/negative
    signals
  • m-law and A-law are piecewise linear
    approximations
  • Equivalent to linear sampling at 12-14 bits
  • (8 bit linear sampling is significantly more
    noisy)

36
Fechners law - sound frequencies
  • octaves, well tempered scale
  • Critical bands
  • Frequency warping
  • Melody 1 KHz 1000, JND afterwards M 1000
    log2 ( 1 fKHz )
  • Barkhausen can be simultaneously heard B 25
    75 ( 1 1.4 f2KHz )0.69
  • excite different basilar
    membrane regions

f
37
Psychophysics - changes
  • Our senses respond to changes

38
Psychophysics - masking
  • Masking strong tones block weaker ones at nearby
    frequencies
  • narrowband noise blocks
    tones (up to critical band)

f
39
Voice DSP - Part 1c
  • Speech
  • Quality
  • Measurement

40
Why does it sound the way
it sounds?
  • PSTN
  • BW0.2-3.8 KHz, SNRgt30 dB
  • PCM, ADPCM (BER 10-3)
  • five nines reliability
  • line echo cancellation
  • Voice over packet network
  • speech compression
  • delay, delay variation, jitter
  • packet loss/corruption/priority
  • echo cancellation

41
Subjective Voice Quality
  • Old Measures
  • 5/9
  • DRT
  • DAM
  • The modern scale
  • MOS
  • DMOS

meet neat seat feet Pete beat heat
42
MOS according to ITU
  • P.800 Subjective Determination of Transmission
    Quality
  • Annex B Absolute Category Rating (ACR)
  • Listening Quality
    Listening Effort
  • 5 excellent relaxed
  • 4 good attention needed
  • 3 fair moderate effort
  • 2 poor considerable effort
  • 1 bad no meaning
  • with feasible
    effort

43
MOS according to ITU (cont)
  • Annex D Degradation Category Rating (DCR)
  • Annex E Comparison Category Rating (CCR)
  • ACR not good at high quality speech
  • DCR
    CCR
  • 5 inaudible
  • 4 not annoying
  • 3 slightly annoying much better
  • 2 annoying better
  • 1 very annoying slightly better
  • 0 the same
  • -1 slightly worse
  • -2 worse
  • -3 much worse

44
Some MOS numbers
  • Effect of Speech Compression
  • (from ITU-T Study Group 15)
  • Quiet room 48 KHz 16 bit linear sampling 5.0
  • PCM (A-law/mlaw) 64 Kb/s 4.1
  • G.723.1 _at_ 6.3 Kb/s 3.9
  • G.729 _at_ 8 Kb/s 3.9
  • ADPCM G.726 32 Kb/s 3.8
    toll quality
  • GSM _at_ 13Kb/s 3.6
  • VSELP IS54 _at_ 8Kb/s 3.4

45
The Problem(s) with MOS
  • Accurate MOS tests are the only reliable
    benchmark
  • BUT
  • MOS tests are off-line
  • MOS tests are slow
  • MOS tests are expensive
  • Different labs give consistently different
    results
  • Most MOS tests only check one aspect of system

46
The Problem(s) with SNR
  • Naive question Isnt CCR the same as SNR?
  • SNR does not correlate well with subjective
    criteria
  • Squared difference is not an accurate comparator
  • Gain
  • Delay
  • Phase
  • Nonlinear processing

47
Speech distance measures
  • Many objective measures have been proposed
  • Segmental SNR
  • Itakura Saito distance
  • Euclidean distance in Cepstrum space
  • Bark spectral distortion
  • Coherence Function
  • None correlate well with MOS
  • ITU target - find a quality-measure that does
    correlate well

48
Some objective methods
  • Perceptual Speech Quality Measurement (PSQM)
  • ITU-T P.861
  • Perceptual Analysis Measurement System (PAMS)
  • BT proprietary technique
  • Perceptual Evaluation of Speech Quality (PESQ)
  • ITU-T P.862
  • Objective Measurement of Perceived Audio Quality
    (PAQM)
  • ITU-R BS.1387

49
Objective Quality Strategy
speech
50
PSQM philosophy(from P.861)
Internal Representation
Perceptual model
Audible Difference
Cognitive Model
Perceptual model
Internal Representation
51
PSQM philosophy (cont)
  • Perceptual Modelling (Internal representation)
  • Short time Fourier transform
  • Frequency warping (telephone-band filtering, Hoth
    noise)
  • Intensity warping
  • Cognitive Modelling
  • Loudness scaling
  • Internal cognitive noise
  • Asymmetry
  • Silent interval processing
  • PSQM Values
  • 0 (no degradation) to 6.5 (maximum degradation)
  • Conversion to MOS
  • PSQM to MOS calibration using known references
  • Equivalent Q values

52
Problems with PSQM
  • Designed for telephony grade speech codecs
  • Doesnt take network effects into account
  • filtering
  • variable time delay
  • localized distortions
  • Draft standard P.862 adds
  • transfer function equalization
  • time alignment, delay skipping
  • distortion averaging

53
PESQ philosophy(from P.862)
Perceptual model
Internal Representation
Cognitive Model
Audible Difference
Time Alignment
Perceptual model
Internal Representation
Write a Comment
User Comments (0)
About PowerShow.com