Statistical and Signal Processing Approaches for Voicing Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical and Signal Processing Approaches for Voicing Detection

Description:

SPOKEN LANGUAGE SYSTEMS. MIT Computer Science and Artificial ... Shouting/singing not possible without voicing. Low frequencies less attenuated over distances ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 17
Provided by: michell189
Category:

less

Transcript and Presenter's Notes

Title: Statistical and Signal Processing Approaches for Voicing Detection


1
Statistical and Signal Processing Approaches for
Voicing Detection
  • Alex Park
  • July 25th, 2003

2
Overview
  • Motivation and background for voicing detection
  • Overview of recent methods
  • Signal Processing approaches
  • Statistical approaches
  • Performance comparison of voicing detection
    methods
  • Detection error rates on small task
  • Example outputs
  • Conclusions and Future Work

Introduction
3
Motivation
  • Voicing is not necessary for speech understanding
  • E.g. Whispered speech excitation is provided
    by aspiration
  • E.g. Sinewave speech no periodic excitation,
    resonances produced directly
  • What is the value of adding voicing to the speech
    signal?
  • Separability? Pitch is useful for distinguishing
    between concurrent speakers and background
  • Redundancy? Harmonics provide regular structure
    from which we can detect speech in multiple bands
  • Robustness? Unvoiced speech has lower SNR than
    voiced speech
  • Whispering is intended to prevent unwanted
    listeners from hearing
  • Shouting/singing not possible without voicing
  • Low frequencies less attenuated over distances
  • Current speech recognition systems typically
    discard voicing information in the front end
    because
  • Energy is environment dependent, pitch is speaker
    dependent
  • Vocal tract configuration carries most phonetic
    information

Introduction
4
Background
  • Voicing produced by periodic vibrations of the
    vocal folds.
  • In time, voiced speech consists of repeated
    segments.
  • In frequency, spectrum has harmonic structure
    shaped by formant resonances
  • Pitch estimation and voicing decision can be made
  • In time, using repetition rate and similarity of
    pitch periods
  • In frequency, using spacing and relative height
    of harmonic peaks

Time Domain
Freq Domain
Introduction
5
Signal Processing Approaches
  • Signal processing approaches marked by lack of
    training phase
  • Voicing detection typically paired with pitch
    extraction
  • Well known approach peak-picking (spectral or
    temporal)
  • Usually followed by smoothing gross errors via
    Dynamic Programming
  • Many proposed solutions
  • Spectral
  • Cepstral Pitch tracking
  • Harmonic Product Sum
  • Logarithmic DFT pitch tracker (Wang)
  • Temporal
  • Autocorrelation
  • Sinusoid matching (Saul)
  • Synchrony (Seneff)
  • Exotic methods
  • Image based pitch tracking (Quatieri)

Signal Processing
6
I. Autocorrelation
  • Temporal domain approach, used in ESPS tool
    get_f0
  • Compute inner product of signal with shifted
    version of itself
  • If is a speech frame, then
    autocorrelation is

Speech Frame
Peaks occur at multiples of fundamental period
Short Time Autocorrelation
Signal Processing
7
II. Band-limited Sinusoid Fitting (Saul 2002)
  • Filter bandwidths allow at least one filter to
    resolve single harmonics
  • Frames of filtered signals fit with sinusoid of
    frequency w and error u
  • At each step, lowest u gives voicing
    probability, w gives pitch estimate
  • Algorithm is fast and gives accurate pitch tracks

Signal Processing
8
Statistical Approaches
  • Statistical voicing detectors are not strictly
    dependent on spectral features (but these are the
    features widely used)
  • Training data useful for capturing acoustic cues
    of voicing not explicitly specified in signal
    processing approaches
  • Possible classifiers suitable for voicing
    detection include
  • GMM classifier (w/ MFCC features)
  • Structured Bayesian Network (alternative
    features)
  • Neural Network classifier
  • Support Vector Machines

Statistical
9
I. GMM Classifier
  • Train two GMMs, p(xV) and p(xUV) using
    frame-level feature vectors (MFCCs surrounding
    frames (for Ds and DDs))
  • 50 mixtures each, dimensions reduced to 50 via
    PCA
  • Using Bayes rule, voicing score is given by
    likelihood ratio
  • Discriminative framework is useful because it
    uses knowledge of unvoiced speech characteristics
    in making decision

Statistical
10
II. Bayesian Network (Saul/Rahim/Allen 1999)
  • Feature vector constructed for frames of
    narrowband speech
  • (Autocorrelation peaks and valleys) (SNR
    Estimate) 5 dims/band/frame
  • Individual voicing decisions made on each channel
  • Channel sigmoid decision weights (qs) trained
    via EM algorithm
  • Overall voicing decision triggered by positive
    example in individual channels

Statistical
11
Comparison Matched Conditions
  • Trained on 410 TIMIT sentences from 40 speakers
    (126k frames)
  • Evaluated on 100 TIMIT sentences from 10 speakers
    (28k frames)
  • Speech was resampled to 8kHz, phone labels used
    as voicing ref
  • Also evaluated on Keele database (laryngograph
    reference)

Results
12
Sample Outputs Matched Conditions
  • Some example voicing tracks output by individual
    methods

Results
13
Comparison Mismatched Conditions
  • Evaluated with different kinds of signal
    corruption
  • Condition not known a priori gt same threshold as
    before
  • threshold can be adaptive to environment (same as
    modifying output prob.)
  • Overall error rates are unsatisfactory
  • GMM classifier has best performance on clean
    data, but unpredictable results in varied
    conditions

GMM
Autocorrelation
Sinusoid Fit
Results
14
Sample Outputs Mismatched Conditions
  • Voicing tracks on NTIMIT utterance

Results
15
Conclusions and Future Work
  • Error rates are still high compared with
    literature
  • Post processing to remove stray frames
  • Problem with scoring procedure?
  • Statistical framework with knowledge based
    features
  • Weight contribution of multiple detectors using
    SNR-based variable
  • Using same approach, apply to phonetic detectors
    for voiced speech
  • Nasality broad F1 bandwidth, low spectral slope
    in F1F2 region, stable low frequency energy
  • Rounding Low F1, F2.
  • Retroflex Low F3, rising formants.
  • Combine feature streams with SNR based weight as
    input to HMM

Conclusions
16
References
  • L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun
    (2003). Real time voice processing with
    audiovisual feedback Toward autonomous agents
    with perfect pitch in S. Becker, S. Thrun, and
    K. Obermayer (eds.), Advances in Neural
    Information Processing Systems 15. MIT Press
    Cambridge, MA.
  • L. K. Saul, M. G. Rahim, and J. B. Allen
    (2001).A statistical model for robust
    integration of narrowband cues in
    speech.Computer Speech and Language 15(2)
    175-194.
  • C. Wang, and S. Seneff (2000). "Robust Pitch
    Tracking for Prosodic Modeling in Telephone
    Speech," In Proc. ICASSP 00, Istanbul, Turkey.
  • S. Seneff (1985). Pitch and spectral analysis of
    speech based on an auditory synchrony model,
    Ph.D Thesis, Dept. of Electrical Engineering,
    M.I.T., Cambridge, MA.
  • T. F. Quatieri (2002). "2-D Processing of Speech
    with Application to Pitch Estimation," In Proc.
    ICLSP 02, Denver, Colorado.

Conclusions
Write a Comment
User Comments (0)
About PowerShow.com