Title: Microcomputer Systems 2
1Microcomputer Systems 2
- Time Stretching Pitch Shifting of Audio Signals
2Time Stretching Pitch Shifting of Audio Signals
- Outline
- Introduction
- Techniques Used for Time Compression/Expansion
and Pitch Shifting - Comparison
- Timbre and Formants
3Outline
- Introduction
- Frequency Shift vs. Pitch Shift Audio Examples
- Time Compression/Expansion
- Techniques Used for Time Compression/Expansion
and Pitch Shifting - The Phase Vocoder
- Related Topics
- Why Phase
- Time Domain Harmonic Scaling (TDHS)
- More recent approaches
- Comparison
- Which Method to Use
- Pitch Shifting Considerations
- Audio Examples
- Timbre and Formants
- Phase Vocoder and Formants
- Time Domain Harmronic scaling and Formants
4Introduction
- Time Stretching Pitch Shifting
- Are two dominant techniques that used for speech
and sound manipulation. - Typical applications entail
- Changing the speed of play-back (altering the
length of the signal) without altering the pitch
of the voice and/or instruments - Changing the pitch of the voice and/or
instruments without changing the length of the
signal.
5Pitch Shifting
6Pitch Shifting
- As opposed to the process of pitch transposition
achieved using (a simple) sample rate conversion,
Pitch Shifting is a way to change the pitch of a
signal without changing its length. - In practical applications, this is usually
achieved by changing the length of a sound using
one of the methods discussed next and then
performing a sample rate conversion to change the
pitch.
7Introduction
- Pitch Shifting is NOT Frequency Shifting
- There exists a certain confusion in terminology
in the literature, as Pitch Shifting is often
also incorrectly named 'Frequency Shifting'. - A true Frequency Shift (as obtainable by
modulating an analytic signal by a complex
exponential) will shift the spectrum of a sound,
while - Pitch Shifting will dilate it, upholding the
harmonic relationship of the sound. - Frequency Shifting yields a metallic, inharmonic
sound which may well be an interesting special
effect but which is a totally inadequate process
for changing the pitch of any harmonic sound
except a single sine wave.
8Audio Examples of Pitch Shifting vs. Frequency
Shifting
- Original Sound
- Pitch Shifted
- Frequency Shifted
9Time Compression/Expansion
10Time Compression/Expansion
- Time Compression/Expansion, also known as "Time
Stretching" is the reciprocal process to Pitch
Shifting. - It leaves the pitch of the signal intact while
changing its speed (tempo). - This is a useful application when you wish to
change the speed of a voiceover without messing
with the timbre of the voice.
11Time Compression/Expansion
- There are several fairly good methods to do time
compression/expansion and pitch shifting but most
of them will not perform well on all different
kinds of signals and for any desired amount of
shift/stretch ratio. -
- Typically, good algorithms allow pitch shifting
up to 5 semitones on average or stretching the
length by 130. -
- When time stretching and pitch shifting single
instrument recordings you might even be able to
achieve a 200 time stretch, or a one-octave
pitch shift with no audible loss in quality.
12Time Compression/Expansion of Speech
- Typical Goals
- To either speed up or slow down a speech signal
while maintaining the approximate pitch - Applications
- Change voice mail playback
- Court stenographers-play proceedings quicker
- Sound effects
- Etc
13Techniques Used for Time Compression/Expansion
Pitch Shifting
- Option 1 Change sample rate
- If you modify the sample rate, you can change the
speed but the pitch is also changed - Increase sample rate higher pitch (chipmunk
sound) - Decrease sample rate lower pitch (drawn out
echo sound) - Option 2 Decimate or Interpolate Signal
- If you change the number of samples, the result
is the same as modifying the sample rate
14Techniques Used for Time Compression/Expansion
Pitch Shifting
- Option 3 Use more complex methods
- This will change the speed of the sample while
preserving the pitch data - Short Time Fourier Transform
- Short Time Fourier Transform Magnitude
- Sinusoidal Synthesis
- Linear Prediction Synthesis
15Techniques Used for Time Compression/Expansion
Pitch Shifting
- Currently, there are two different principal time
compression/expansion and pitch shifting schemes
employed in most of today's applications - Phase Vocoder.
- Time Domain Harmonic Scaling (TDHS).
16Phase Vocoder
17Phase Vocoder
- Phase Vocoder. This method was introduced by
Flanagan and Golden in 1966 and digitally
implemented by Portnoff ten years later. - Portnoff, M.R. 1981a."Short-Time Fourier
Analysis of Sampled Speech."IEEE Transactions on
Acoustics, Speech and Signal ProcessingASSP-29(3)
364-373. - Portnoff, M.R. 1981b."Time-Scale Modification of
Speech Based on Short-Time Fourier
Analysis."IEEE Transactions on Acoustics, Speech
and Signal ProcessingASSP-29(3)374-390.
18Phase Vocoder
- It uses a Short Time Fourier Transform (use
abbreviation STFT from here on) to convert the
audio signal to the complex Fourier
representation. - Since the STFT returns the frequency domain
representation of the signal at a fixed frequency
grid, the actual frequencies of the partial bins
have to be found by converting the relative phase
change between two STFT outputs to actual
frequency changes. - Note the term 'partial' has nothing to do with
the signal harmonics. In fact, a STFT will never
readily give you any information about true
harmonics if you are not matching the STFT length
to the fundamental frequency of the signal and
even then is the frequency domain resolution
quite different to what our ear and auditory
system perceives. - The timebase of the signal is changed by
calculating the frequency changes in the Fourier
domain on a different time basis, and then an
iSTFT is done to regain the time domain
representation of the signal.
19Phase Vocoder
- Phase vocoder algorithms are used mainly in
scientific and educational software products (to
show the use and limitations of the Fourier
Transform) but have gained in popularity over the
past few years due to improvements that made it
possible to greatly reduce the artifacts of the
"original" phase vocoder algorithm. -
- The basic phase vocoder suffers from a severe
drawback because it introduces a considerable
amount of artifacts audible as 'smearing' and
'reverberation' (even at low expansion ratios)
due to the non-synchronized vertical coherence
of the sine and cosine basis functions that are
used to change the timebase.
20Phase Vocoder
- Puckette, Laroche and Dolson have shown that the
phasiness can be greatly reduced by picking peaks
in the Fourier spectrum and keeping the relative
phases around the peaks unchanged. Even though
this improves the quality considerably it still
renders the result somewhat phasey and diffuse
when compared to time domain methods. - Current research focuses on improving the phase
vocoder by applying intra-frame sinusoidal sweep
and ramp rate correction (Bristow-Johnson and
Bogdanowicz) and multi-resolution phase vocoder
concepts (Bonada).
21Links to Publicly Available Vocoders
- Pointers - Phase Vocoder
- The MIT Lab Phase Vocoder
- WaveMasher - GPL/Open Source Phase Vocoder by
Kenneth Sturgis - Sculptor A Real Time Phase Vocoder by Nick
Bailey - A Phase Vocoder implementation using Matlab
- More reading on the Phase Vocoder
- The IRCAM "Super Phase Vocoder
- S.M.Bernsee's Pitch Shifting Using The Fourier
Transform article (with C code)
22Time Domain Harmonic Scaling (TDHS).
23Time Domain Harmonic Scaling (TDHS).
- Time Domain Harmonic Scaling (TDHS). This is
based on a method proposed by Rabiner and Schafer
in 1978. It is heavily based on a correct
estimate of the fundamental frequency of the
sound processed.
24Theory
- Short Time Fourier Transform Methods
- Chapter 7 in our text (Discrete-Time Speech
Signal Processing) - Refer to notes from in class for mathematical
theory of operation - I will pick up from where Dr. Kepuska stopped in
his notes
25How is the Speech/Sound Signal Processed
- Link
- Ch7-Short-Time_Fourier_Transform_Analysis_and_Synt
hesis.ppt
26Terminology Basic Idea
Frame Rate
Window Size
27Short Time Fourier Transform
- Short Time Fourier Transform
- Also called the Fairbanks method
- Extract successive short-time segments and then
discard the following ones
28Short Time Fourier Transform
- Frame Rate factor L
- In frequency domain after taking the STFT, you
get - X(nL,?)
- Form a new signal by
- Y(nL, ?) X(snL, ?)
- where s compression factor
- Take Inverse Fourier Transform
- Use Overlap and Add method to form new signal
29Short Time Fourier Transform
X(nL, ?)
Y(nL, ?) X(2nL, ?)
30Short Time Fourier Transform
New Sequence
Original Windowed Sequence
31Short Time Fourier Transform
- Problems
- Pitch Synchronization
- It is highly likely that the pitch periods will
not line up properly
32Short Time Fourier Transform Magnitude
- Short Time Fourier Transform Magnitude
- Problems with STFT method relate directly to the
linear phase component of the STFT - Time shift phase change
- Alternate approach is to only use the magnitude
portion of the STFTShort Time Fourier Transform
Magnitude
33Short Time Fourier Transform Magnitude
- Compression
- With the Fairbanks method, time slices were
discarded - Now we can just compress the time slices
- Form a new signal by
- Y(nM, ?) X(nL, ?) where
- M compression factor L / speed
- i.e. for speeding up by two gt M L/2
34Short Time Fourier Transform Magnitude
- Compression
- Take Inverse Fourier Transform
- Use Overlap and Add method to form new signal
35Short Time Fourier Transform Magnitude
X(nL, ?)
Y(nM, ?) X(nL, ?) ML/2
36Short Time Fourier Transform Magnitude
New Sequence
Original Windowed Sequence
37Other Methods
- Sinusoidal SynthesisChapter 9
- Time-warp the sinewave frequency track and the
amplitude function - This technique has been successful with not only
speech but also music, biological, and mechanical
signals - Problems
- Does not maintain the original phase relations
- Suffer from reverberance
38Other Methods
- Linear Prediction Synthesis
- Use Homomorphic and Linear Prediction results to
modify the time base - Book briefly mentions this is possible but ran
out of time before I could investigate this
process more
39Other Methods
- New Techniques
- Internet search showed several methods trying to
improve on what is out there now - Software
- Different software programs that will change
speed for you - Adobe Audition is one of the most all
encompassing right now
40Matlab Code-Prepare the Workspace
-
- Prepare Workspace
-
- close all
- clear all
- window_size_1 200
- frame_rate_1 100
- Speed to slow down by
- speed 2
41Matlab Code-Load the Speech Signal
-
- Load Data File
-
- filename input('Please enter the file name to
be used. ') - sample_data,sample_rate,nbits
wavread(filename) - loop_time floor(max(size(sample_data))/frame_rat
e_1) - sample_data((max(size(sample_data)))(loop_time1)
frame_rate_1)0
42Matlab Code-Develop the Window
-
- Create Windows
-
- Want windows of 25ms
- File sampled at 10,000 samples/sec
- Want a window of size 10000 25ms(10ms)
- triangle_30ms triang(window_size_1)
- triangle_30ms hamming(window_size_1)
- W0 sum(triangle_30ms)
43Matlab Code-Window the Entire Speech Signal
-
- Window the speech
-
- for i 0loop_time-1
- window_data(,i1)sample_data((frame_rate_1i
)1((i2) frame_rate_1)).triangle_30ms - end
44Matlab Code-Perform the Fast Fourier Transform
-
- Create FFT
-
- for i 1loop_time
- window_data_fft(,i) fft(window_data(,i),10
24) - end
45Matlab Code-Recreate the Modified Signal
-
- Recreate Original Signal
-
- Initialize the recreated signals
- reconstructed_signal(1(loop_time1)frame_rate_1)
0 - real_reconstructed_signal(1(loop_time1)frame_ra
te_1)0 - modified_reconstructed_signal(1(loop_time3)(fra
me_rate_1/speed))0 - modified_reconstructed_signal_compressed(1(loop_t
ime3) (frame_rate_1/ speed))0
46Matlab Code-Recreate the Modified Signal
- Perform the ifft
- for i 1loop_time
- recreated_data_ifft(,i) ifft(window_data_ff
t(,i),1024) - real_recreated_data_ifft(,i)
ifft(abs(window_data_fft(,i)),1024) - truncated_recreated_data_ifft(,i)
recreated_data_ifft(1window_size_1,i).(frame_rat
e_1/W0) - real_truncated_recreated_data_ifft(,i)
real_recreated_data_ifft(1window_size_1,i).(fram
e_rate_1/W0) - end
47Matlab Code-Recreate the Modified Signal
- Get back to the original signal
- for i0loop_time-1
- reconstructed_signal((frame_rate_1i)1((i2)
frame_rate_1)) reconstructed_signal((frame_rate
_1i)1((i2)frame_rate_1))
truncated_recreated_data_ifft(,i1)' - real_reconstructed_signal((frame_rate_1i)1(
(i2)frame_rate_1)) real_reconstructed_signal((
frame_rate_1i)1((i2)frame_rate_1))
real_truncated_recreated_data_ifft(,i1)' - end
48Matlab Code-Recreate the Modified Signal
- Get a modified signal by deleting certain parts
(STFT) - for i0(loop_time-1)/speed
- modified_reconstructed_signal((frame_rate_1i)
1((i2) frame_rate_1)) modified_reconstructed
_signal((frame_rate_1i)1((i2)frame_rate_1))
real_truncated_recreated_data_ifft(,ispeed1)'
- end
49Matlab Code-Recreate the Modified Signal
- Initialize the compressed sequence (STFTM)
- modified_reconstructed_signal_compressed(1frame_r
ate_1frame_rate_1/speed1)truncated_recreated_da
ta_ifft(frame_rate_1-frame_rate_1/speedwindow_siz
e_1,1)' - Get a modified signal by compressing
- for i0(loop_time-2)
- modified_reconstructed_signal_compressed((fram
e_rate_1/speedi)1(frame_rate_1/speedi)window_
size_1) modified_reconstructed_signal_compressed
((frame_rate_1/speedi)1(frame_rate_1/speedi)w
indow_size_1) real_truncated_recreated_data_ifft
(,i2)' - end
50Matlab Code-Plot Results
-
- Plot Results
-
- Figure subplot(211)
- plot(sample_data)
- title('Original Speech') v1axis
- hold on subplot(212)
- plot(real(modified_reconstructed_signal))
- title('STFT Synthesis w/ Speed
',num2str(speed),'X') v2axis - if speed gt 1
- subplot(211) axis(v1)
- subplot(212) axis(v1)
- else
- subplot(211) axis(v2)
- subplot(212) axis(v2)
- end
51Matlab Code-Write Sound Files
-
- Write sound files
-
- wavwrite(modified_reconstructed_signal,sample_rate
,nbits,'C\Classes\ECE_5525\tea party fairbanks
2x.wav')
52Examples Baseline Samples
STFT Sound file
Sample Rate 2X
Original File
STFTM Sound file
Sample Rate .5X
53Examples STFTSpeed 0.5X
Sound file
54Examples STFTSpeed 2X
Sound file
55Examples STFTSpeed 4X
Sound file
56Examples STFTMSpeed 0.5X
Sound file
57Examples STFTMSpeed 2X
Sound file
58Examples STFTMSpeed 4X
Sound file
59More Results
- Change in window size
- If the window size becomes too small, then a
change in pitch will occur - Need window to be 2 to 3 pitch periods long
- I generally used 20 30 ms windows
60More Results
- Change in frame rate
- If the frame rate decreases too much, then there
will be too many samples overlapping to get an
intelligible signal
61More Results
- Change filter type
- Tried Hammingnot much perceptual difference
- Using the window energy becomes important here
- Frame Rate/W0 is not equal to one
62Conclusion
- Optimum area
- Frame rate is one half of the window size
- Window size needs to be 2 to 3 pitch periods long
- It is possible to easily change the time scale
and still maintain the original pitch although
the result is not always natural sounding
63Conclusion
- Further investigation
- What to do when you want to slow down over half.
- Using the STFTM means there will be gaps between
the sequences
64Conclusion
- Further investigation
- What to do when you want to slow down over half
- Could replicate windowed segments
65Conclusion
- Further investigation
- Use the other methods to determine quality
- Implement Sinusoidal Synthesis
- Implement Linear Predictive Synthesis using
linear prediction and homomorphic methods - Work on synchronizing pitch periods
- Shift samples so that the peaks line up
- Scott and GerberSynchronized Overlap and Add
(SOLA) - Cross-correlation of two samples to find peak
- Use the peaks to line up samples
- Align the window at same relative location within
a pitch period
66Questions
67References
- Quatieri, Thomas E. Discrete-Time Speech Signal
Processing. Prentice Hall, Upper Saddle River,
NJ, 2002. - Rabiner, L.R. and Schafer, R.W. Digital
Processing of Speech Signals. Prentice Hall,
Upper Saddle River, NJ, 1978. - Oppenheim, A.V and Schafer, R.W. Digital Signal
Processing. Prentice Hall, Englewood Cliffs, NJ,
1975. - Scott, R. and Gerber, S. Pitch Synchronous
Time-Compression of Speech, Proc. Conf. Speech
Communications Processing, p63-85, April 1972.
68References
- Fairbanks, G., Everitt, W.L., and Jaeger, R.P.
Method for Time or Frequency Compression-Expansio
n of Speech, IEEE Transaction Audio and
Electroacoustics, vol. AU-2 pp.7-12, Jan 1954.
69Reference Material
- http//www.dspdimension.com/ of Stephan M. Bernsee