Title: A SegmentBased Probabilistic Generative Model of Speech
1A Segment-Based Probabilistic Generative Model
of Speech
- Kannan Achan
- Joint work with
- Sam Roweis, Aaron Hertzmann, Brendan Frey
- University of Toronto
- http//www.psi.toronto.edu/kannan/segmental
2Time Domain Speech Processing
- Speech processing in purely time domain is
generally considered difficult - Very high variability
- Microphones, room acoustics.
- Noise can be a serious problem
- Time frequency representation is generally used
- Stable
- Spectrogram reading
- But,
- phase information is generally discarded
- Timing information is lost
- Employs arbitrary windowing
Same utterance different microphones
3Still, time domain is appealing
No information is discarded from the input
signal Notice that there is a lot of amazing
structure in the time signal
4Speech wave a quick primer
- Vibrating vocal cords voiced speech
- Frequency of vibration called pitch
- Turbulent air flow unvoiced speech
- Coloured noise
- Silence period
5Goal Segment the Waveform
- Group samples into one of voiced/unvoiced/silence
regions - Segment voiced regions in to glottal pulses
6Generative Model of Speech Production
- Assuming segments are generated by a first order
Markov process 4 types of transitions are
possible - 1. Voiced to Voiced 3. Unvoiced to Voiced
- 2. Voiced to Unvoiced 4. Unvoiced to Unvoiced
- Given segment boundaries b, segment types v
(voiced v1 or unvoiced v0) and transformation t
, the generative model is a conditional Markov
model
For simplicity, we discard silence periods
beforehand
7Time domain modeling voiced region
- Voiced ? Voiced transition Next segment is a
noisy copy of the transformed version of the
previous one - Transformations t(a,b,g)
- Time Warp (a) Stretch/Shrink - Maps a
n-vector to an-vector - Amplitude Scaling (b) Scalar multiplication
(bx) - Amplitude Shift (g) Scalar addition (xg)
8Generative Model Harmonic region
- Successive voiced regions
- red overlay in the 2nd period is the prediction
?Best transformation can be found locally using
linear regression
9Generative Model Non-Harmonic region
When 2 successive frames are not voiced, we
assume that phase information in the latter
cannot be reliably predicted ? Model only the
power spectrum lk and ?k are the (learned) mean
and covariance of the normalized power spectrum
of the model. f(y) is the normalized power
spectrum of y
10 Waveform continuity Zero Crossings(due to
John Hopfield)
- Constraint the segment boundaries to start and
end at only upward zero crossings
- Ensures waveform continuity
- Makes optimization tractable
- To further regularize the space of valid segment
boundaries, we can impose constraint on the
minimum and maximum length of segments
11Inference
- Computational task
- Infer segment boundaries (b), segment types (v)
and transformation parameters (t) - Exact inference intractable
- Valid configurations of boundary variable
exponential - Find MAP estimates using dynamic programming
- 2-dimensional dynamic programming grid with size
given by the cardinality of zero crossings - For every valid pair of boundary configuration
(a,b), entry in the grid refers to the
probability of (a,b) being the last segment in
the best segmentation of the signal up to b. - Grid is sparse
12Learning
- Learn the parameters of the model l0 and l1 by
maximizing the expected value of the complete log
likelihood (posterior is the delta function
computed during inference) - Updates for l0 and l1 correspond to normalized
average spectrum of voiced and unvoiced segments
?s correspond to variances on these spectra
13Results typical segmentation
14Time Scale Modification
- Stochastically remove or add frames
Original clip
2 x slower
2 x faster
15Pitch Tracking
Counting the number of samples in the segments
for voiced region gives an estimate for pitch
period.
16Voicing detection
Performed automatically during the course of
dynamic programming. ?We can just read off the
optimal segmentation labels
17Quantitative Results
18Filling in missing/corrupted region of speech
- Our algorithm treats the corrupted region as
unvoiced. - To reconstruct - fill in the corrupted region by
generating new segments with periods between the
two bounding voiced regions.
19Work in progress Declipping / Denoising
- Clipped speech restoration
- Saturation due to poor recording / quantization
- Can we use the inferred transformation to
complete?
20Conclusion
- We have presented a simple segmental model for
analyzing speech waveforms directly in time
domain. - wide range of applications become possible under
this single framework - We are also investigating many possible
applications including, voice conversion, volume
equalization and reverberant filtering
21Dynamic programming in detail
22Estimating transformation can be seen as linear
regression
23Generative model of noisy time domain signals
24Voice/Gender conversion
- A very naïve approach
- Pitch of male voice around 110Hz
- Pitch of female voice around 210 Hz
- Idea Stretch/shrink segments to
decrease/increase pitch - Cubic spline smoother along segment boundaries
Work in progress
25Current Work
- Multiple sound sources
- Several templates evolving simultaneously
- Need for more complicated model
- Example
- Voice background music
- Timescale modified (slower)
- Denoising
- Compression
- Companding (volume normalization)
- Reverberant filtering