Title: Voice Transformations
1Voice Transformations
Definition modifying a signal to intentionally
change its characteristics
- Challenges Signal processing techniques have
advanced faster than our understanding of the
physics - Examples
- Rate of articulation maintaining the formant
structure - Alter F0 and modify the spacing between the
harmonics components. Change between male,
female, and child voices. - Modify the intensity multiplying the amplitudes
of signal sections - Voice Transformation Alter a persons speech to
sound like anothers - Voice Morphing Morph audio spoken by one speaker
to sound like the same audio spoken by another
2Heliums Effect on Speech
- Changes the formants (resonances of F0), but not
the pitch - Vocal tension, geometry, and length affects the
pitch - Speed of sound greater, so resonances shifted
higher - Diagram Second formant shifted to the right, off
the diagram. - Less power at lower frequencies vowels
articulate differently
Normal voice spectrum
Helium voice spectrum
The vertical lines are resonances of F0
3Voice Characteristics
- Breathy voice The amplitude the first F0
harmonic/amplitude much larger than the amplitude
of the second F0 harmonic (large vocal opening) - Creaky voice Small or negative value, when
subtracting the amplitude of higher formants of
F0 from the amplitude of first F0 (spectral tilt)
4Vowel Acoustics
- Each person has a unique acoustic space vowels
exhibit patterns within that space - Vowels are primarily distinguished by their first
two formant frequencies F1 and F2 - F1 corresponds to vowel height
- A smaller F1 amplitude implies a higher vowel
- A larger F1 amplitude implies a lower vowel
- F2 corresponds to a front or back vowel
- A smaller F2 amplitude implies a back vowel
- A larger F2 amplitude implies a front vowel
- Lip rounding tends to lower both F1 and F2
5 at different pitches
100 Hz
120 Hz
150 Hz
F1 moves slightly to the right and F2 to the left
as F0 increases
6(No Transcript)
7(No Transcript)
8(No Transcript)
9Men lower F0, Women higher F0
10(No Transcript)
11Synthesizing Speech
- Source-filter model
- Excitation glottal signal (source)
- Time varying linear filter (vocal tract)
- Simplest form
- Excitation
- Quasi-periodic pulse sequences (voiced speech)
- Noise (unvoiced speech)
- Time varying linear filter (Linear prediction)
- Challenge define an excitation sequence that
produces natural sounding speech
12Synthesis Approaches
- Multi-pulse sequences of zeros and ones to better
represent the glottal excitation - Combine a series of sinusoids to create glottal
like excitation. - Determine F0 and use harmonics of F0 as
excitation inputs. - Concatenation and unit selection approaches
Most modern synthesis implementations utilize
unit selection. However, because of a desire to
implement voice transformation algorithms, there
is a renewed focus on utilizing digital signal
processing techniques
13Pitch and Rate of Change
- TD-PSOLA (Time domain pitch synchronized
overlap and add) - Advantages
- Does a good job when changes are less than a
factor of two - Time domain algorithm very efficient
- Disadvantages Not sufficient for complex
transformations - Maintain amplitude and phase relationships
between formants - Repeated fricative frames starts sounding tonal.
Reversing or randomizing fricative spectrums
helps, but not for voiced fricatives. - Increased articulation compresses
vowels/consonants by 50/25 (We protect
consonants which carry more information). - The pitch values and contour are affected.
- Non-linearities between sub-glottal resonances
- Unexpected artifacts contained in the synthesized
signal
14Energy Modification
- Naïve approach Multiply each sample by some
constant. - Problems
- When we speak louder, we emphasize some parts of
the signal more than others we stress consonants
more than vowels. - More sub-glottal pressure will stress higher
frequencies more than those that are lower - Pitch tends to rise as speech becomes louder.
15Harmonic Plus Excitation Model
- Speech harmonic and excitation components
- Harmonic Vocal tract as a linear prediction
filter - Noise component collection of sinusoids with
time varying amplitudes and frequencies - Harmonic component Linear prediction
- yn rn ?i1,P aiyn-P or yn ?i1,P aiyn-P
- Residue rn excitation and nasal/sub-glottal
non-linearities) - Excitation Signal Estimate e(t) ?k0,K(t)
mk(t)eifk(t) - K(t) is the number of sinusoids at time t
- mk is the amplitude of the kth sinusoid at time t
- fk(t) is the phase of the kth sinusoid at time t
16The Harmonic Model
Excitation signal e(t) ?k0,K(t) mk(t)eifk(t)
- Questions to answer
- How do we determine which sine waves to use?
- How do we determine the phases and amplitudes?
- How many sine waves should we use?
- How do we represent unvoiced speech?
- Note f k(t) 2pkF0(t)
- The sinusoids are harmonics of F0 (fundamental
frequency) - Otherwise this would be a sinusoidal model (not
harmonic)
17Linear Interpolation
Goal Compute partial phases/amplitudes at time, t
- Formula (y-y0)/(x-x0) (y1-y0)/(x1-x0)
- Application
- Assume window size w ms
- Frame n represents time nw
- Frame n1 represents time (n1)w
- nw lt t lt (n1)w is time of interest
- x0, x1 phases at times nw, (n1)w
- y0, y1 amplitudes at times nw, (n1)w
- x, y phase and amplitude at time t
Note Cubic interpolation uses the successive and
previous windows and interpolates points between
18McAulay-Quatieri Algorithm
- Perform FFT on the signal
- Extract peak frequencies with phases/amplitudes.
- Find F0 whose harmonics closely represent the
partials - Connect partials of successive and previous
windows - Generate time varying sign waves cubic
interpolation - Apply to the vocal track filter to generate
synthesized speech - Death of a track If no matching successive
window partial - Birth of a track If no matching previous window
partial - Partial An FFT peak extracted with its phases
and amplitudes - Track Connections between partials of adjacent
windows - Note Typical number of partials for synthesis
is from 20 to 160.
19Sinusoid Death and Birth
20Unvoiced Speech
- Problem
- Unvoiced speech resembles noise
- Noise requires too many sinusoids for an accurate
representation - Signal transformations (such as stretching) to
closely related harmonics produces sound heard as
(wormy or jittery) - Unvoiced tracks span only a small number of
windows so interpolation methods become
problematic - Solution Bandwidth enhanced oscillators
21Definitions
- Carrier signal A sinusoidal signal transmitted
at a steady frequency - Modulation the process of varying one or more
properties of a high-frequency carrier periodic
waveform - Oscillation is the repetitive variation,
typically in time
22Bandwidth Enhanced Oscillation
- Technique A partials energy is increased
relative to its spectral amplitude and spread
across adjacent frequencies - Details (a) The center frequency stays the same,
(b) Energy is spread evenly on both sides (c)
Random modulations - Parameters widening amount, fall off intensities
- Result A closer representation to the original
signal
- Partial with no widening (b) Partial with
moderate widening (c) Partial with large amount
of widening
23Algorithm Refinements
- Add bandwidth enhanced oscillation
- Vary the spread of bandwidths based on the amount
of voicing in the signal - Formula Yt ?k0,K-1 ?n0,N(Ak(t) ßt)
sin(kNnF0 ?k(t)) - Yt is the synthesized signal at time t
- Ak(t) is the carrier frequency amplitude at time
t - k is a harmonic multiple of F0 (partial) K
number of partials - ?k(t) of phase of the kth partial
- N is the number of oscillations for introducing
noise - Nn is the output of a random number generator to
modulate F0 - ß is a noise modulation factor