Voice Transformations - PowerPoint PPT Presentation

About This Presentation
Title:

Voice Transformations

Description:

Non-linearities between sub-glottal resonances Unexpected ... Bandwidth enhanced oscillators Definitions Carrier signal: A sinusoidal signal transmitted at ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 24
Provided by: Harve164
Learn more at: https://cs.sou.edu
Category:

less

Transcript and Presenter's Notes

Title: Voice Transformations


1
Voice Transformations
Definition modifying a signal to intentionally
change its characteristics
  • Challenges Signal processing techniques have
    advanced faster than our understanding of the
    physics
  • Examples
  • Rate of articulation maintaining the formant
    structure
  • Alter F0 and modify the spacing between the
    harmonics components. Change between male,
    female, and child voices.
  • Modify the intensity multiplying the amplitudes
    of signal sections
  • Voice Transformation Alter a persons speech to
    sound like anothers
  • Voice Morphing Morph audio spoken by one speaker
    to sound like the same audio spoken by another

2
Heliums Effect on Speech
  • Changes the formants (resonances of F0), but not
    the pitch
  • Vocal tension, geometry, and length affects the
    pitch
  • Speed of sound greater, so resonances shifted
    higher
  • Diagram Second formant shifted to the right, off
    the diagram.
  • Less power at lower frequencies vowels
    articulate differently

Normal voice spectrum
Helium voice spectrum
The vertical lines are resonances of F0
3
Voice Characteristics
  • Breathy voice The amplitude the first F0
    harmonic/amplitude much larger than the amplitude
    of the second F0 harmonic (large vocal opening)
  • Creaky voice Small or negative value, when
    subtracting the amplitude of higher formants of
    F0 from the amplitude of first F0 (spectral tilt)

4
Vowel Acoustics
  • Each person has a unique acoustic space vowels
    exhibit patterns within that space
  • Vowels are primarily distinguished by their first
    two formant frequencies F1 and F2
  • F1 corresponds to vowel height
  • A smaller F1 amplitude implies a higher vowel
  • A larger F1 amplitude implies a lower vowel
  • F2 corresponds to a front or back vowel
  • A smaller F2 amplitude implies a back vowel
  • A larger F2 amplitude implies a front vowel
  • Lip rounding tends to lower both F1 and F2

5
at different pitches
100 Hz
120 Hz
150 Hz
F1 moves slightly to the right and F2 to the left
as F0 increases
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Men lower F0, Women higher F0
10
(No Transcript)
11
Synthesizing Speech
  • Source-filter model
  • Excitation glottal signal (source)
  • Time varying linear filter (vocal tract)
  • Simplest form
  • Excitation
  • Quasi-periodic pulse sequences (voiced speech)
  • Noise (unvoiced speech)
  • Time varying linear filter (Linear prediction)
  • Challenge define an excitation sequence that
    produces natural sounding speech

12
Synthesis Approaches
  • Multi-pulse sequences of zeros and ones to better
    represent the glottal excitation
  • Combine a series of sinusoids to create glottal
    like excitation.
  • Determine F0 and use harmonics of F0 as
    excitation inputs.
  • Concatenation and unit selection approaches

Most modern synthesis implementations utilize
unit selection. However, because of a desire to
implement voice transformation algorithms, there
is a renewed focus on utilizing digital signal
processing techniques
13
Pitch and Rate of Change
  • TD-PSOLA (Time domain pitch synchronized
    overlap and add)
  • Advantages
  • Does a good job when changes are less than a
    factor of two
  • Time domain algorithm very efficient
  • Disadvantages Not sufficient for complex
    transformations
  • Maintain amplitude and phase relationships
    between formants
  • Repeated fricative frames starts sounding tonal.
    Reversing or randomizing fricative spectrums
    helps, but not for voiced fricatives.
  • Increased articulation compresses
    vowels/consonants by 50/25 (We protect
    consonants which carry more information).
  • The pitch values and contour are affected.
  • Non-linearities between sub-glottal resonances
  • Unexpected artifacts contained in the synthesized
    signal

14
Energy Modification
  • Naïve approach Multiply each sample by some
    constant.
  • Problems
  • When we speak louder, we emphasize some parts of
    the signal more than others we stress consonants
    more than vowels.
  • More sub-glottal pressure will stress higher
    frequencies more than those that are lower
  • Pitch tends to rise as speech becomes louder.

15
Harmonic Plus Excitation Model
  • Speech harmonic and excitation components
  • Harmonic Vocal tract as a linear prediction
    filter
  • Noise component collection of sinusoids with
    time varying amplitudes and frequencies
  • Harmonic component Linear prediction
  • yn rn ?i1,P aiyn-P or yn ?i1,P aiyn-P
  • Residue rn excitation and nasal/sub-glottal
    non-linearities)
  • Excitation Signal Estimate e(t) ?k0,K(t)
    mk(t)eifk(t)
  • K(t) is the number of sinusoids at time t
  • mk is the amplitude of the kth sinusoid at time t
  • fk(t) is the phase of the kth sinusoid at time t

16
The Harmonic Model
Excitation signal e(t) ?k0,K(t) mk(t)eifk(t)
  • Questions to answer
  • How do we determine which sine waves to use?
  • How do we determine the phases and amplitudes?
  • How many sine waves should we use?
  • How do we represent unvoiced speech?
  • Note f k(t) 2pkF0(t)
  • The sinusoids are harmonics of F0 (fundamental
    frequency)
  • Otherwise this would be a sinusoidal model (not
    harmonic)

17
Linear Interpolation
Goal Compute partial phases/amplitudes at time, t
  • Formula (y-y0)/(x-x0) (y1-y0)/(x1-x0)
  • Application
  • Assume window size w ms
  • Frame n represents time nw
  • Frame n1 represents time (n1)w
  • nw lt t lt (n1)w is time of interest
  • x0, x1 phases at times nw, (n1)w
  • y0, y1 amplitudes at times nw, (n1)w
  • x, y phase and amplitude at time t

Note Cubic interpolation uses the successive and
previous windows and interpolates points between
18
McAulay-Quatieri Algorithm
  • Perform FFT on the signal
  • Extract peak frequencies with phases/amplitudes.
  • Find F0 whose harmonics closely represent the
    partials
  • Connect partials of successive and previous
    windows
  • Generate time varying sign waves cubic
    interpolation
  • Apply to the vocal track filter to generate
    synthesized speech
  • Death of a track If no matching successive
    window partial
  • Birth of a track If no matching previous window
    partial
  • Partial An FFT peak extracted with its phases
    and amplitudes
  • Track Connections between partials of adjacent
    windows
  • Note Typical number of partials for synthesis
    is from 20 to 160.

19
Sinusoid Death and Birth
20
Unvoiced Speech
  • Problem
  • Unvoiced speech resembles noise
  • Noise requires too many sinusoids for an accurate
    representation
  • Signal transformations (such as stretching) to
    closely related harmonics produces sound heard as
    (wormy or jittery)
  • Unvoiced tracks span only a small number of
    windows so interpolation methods become
    problematic
  • Solution Bandwidth enhanced oscillators

21
Definitions
  • Carrier signal A sinusoidal signal transmitted
    at a steady frequency
  • Modulation the process of varying one or more
    properties of a high-frequency carrier periodic
    waveform
  • Oscillation is the repetitive variation,
    typically in time

22
Bandwidth Enhanced Oscillation
  • Technique A partials energy is increased
    relative to its spectral amplitude and spread
    across adjacent frequencies
  • Details (a) The center frequency stays the same,
    (b) Energy is spread evenly on both sides (c)
    Random modulations
  • Parameters widening amount, fall off intensities
  • Result A closer representation to the original
    signal
  1. Partial with no widening (b) Partial with
    moderate widening (c) Partial with large amount
    of widening

23
Algorithm Refinements
  • Add bandwidth enhanced oscillation
  • Vary the spread of bandwidths based on the amount
    of voicing in the signal
  • Formula Yt ?k0,K-1 ?n0,N(Ak(t) ßt)
    sin(kNnF0 ?k(t))
  • Yt is the synthesized signal at time t
  • Ak(t) is the carrier frequency amplitude at time
    t
  • k is a harmonic multiple of F0 (partial) K
    number of partials
  • ?k(t) of phase of the kth partial
  • N is the number of oscillations for introducing
    noise
  • Nn is the output of a random number generator to
    modulate F0
  • ß is a noise modulation factor
Write a Comment
User Comments (0)
About PowerShow.com