Neuro-Fuzzy and Soft Computing for Speaker Recognition (????) - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Neuro-Fuzzy and Soft Computing for Speaker Recognition (????)

Description:

Neuro-Fuzzy and Soft Computing for Speaker Recognition ( ) J.-S. Roger Jang ( ) CS Dept., Tsing Hua Univ., Taiwan http://www.cs.nthu.edu.tw/~jang – PowerPoint PPT presentation

Number of Views:210
Avg rating:3.0/5.0
Slides: 31
Provided by: KenH171
Category:

less

Transcript and Presenter's Notes

Title: Neuro-Fuzzy and Soft Computing for Speaker Recognition (????)


1
Neuro-Fuzzy and Soft Computing forSpeaker
Recognition (????)
1999 CSIST
  • J.-S. Roger Jang (???)
  • CS Dept., Tsing Hua Univ., Taiwan
  • http//www.cs.nthu.edu.tw/jang
  • jang_at_cs.nthu.edu.tw

2
Outline
  • Introduction
  • Data acquisition
  • Feature extraction
  • Data reduction
  • Condensing, editing, fuzzy clustering
  • Fuzzy classifier refinement
  • Random search
  • Experiments
  • Conclusions and future work

3
Speaker Recognition
  • Types
  • Text-dependent or text-independent
  • Close-set or open-set
  • Methodologies involved
  • Digital signal processing
  • Pattern recognition
  • Clustering or vector quantization
  • Nonlinear optimization
  • Neuro-fuzzy techniques

4
Data Acquisition
  • Recording
  • Recording program of Windows 95/98
  • 8 kHz sampling rate, 8-bit resolution
    (worse than phone quality)
  • 5-second speech signal takes about 40KB.
  • Samples
  • Speaker 1
  • Speaker 2
  • Speaker 3

5
Feature Extraction
  • Major steps
  • Overlapping frames of 256 points (32 ms)
  • Hamming windowing to lessen distortion
  • cepstrum(frame) real(IFFT(logFFT(frame)))
  • FFT Fast Fourier Transform
  • IFFT Inverse FFT
  • A feature vector consists of the first 14
    cepstral coefficients of a frame.
  • Optional steps
  • Frequency-selective filter to reduce noise
  • Mel-warped cepstral coefficients
  • Feature-wise normalization

6
Physical Meanings of Cepstrum
7
Feature Extraction
2.39 sec. speech signal
148 frames of 256 points
Hamming windowing
take frames
low-pass filter
FFT
abs
log
resample
IFFT
real
first 14 coefficients
normalization
148 feature vectors of length 14
8
Feature Extraction
  • Upper speaker 1 , lower speaker 2

9
Pattern Recognition
  • Schematic diagram

Feature Extraction
Data Reduction
Sample speech
Sample set
Classifier
Feature Extraction
Recognized speaker
Test speech
10
Pattern Recognition Methods
  • K-NNR K nearest neighbor rule
  • Euclidean distance
  • Mahalanobis distance
  • Maximum log likelihood
  • Adaptive networks
  • Multilayer perceptrons
  • Radial basis function networks
  • Fuzzy classifiers with random search

11
K-Nearest Neighbor Rule (K-NNR)
  • Steps
  • 1. Find the first k nearest neighbors of a given
    point.
  • 2. Determine the class of the given point by a
    voting mechanism among these k nearest
    neighbors.

class-A point class-B point point with
unknown class
Feature 2
Feature 1
12
Decision Boundary for 1-NNR
  • Voronoi diagram piecewise linear boundary

13
Distance Metrics
  • Euclidean distance
  • Mahalanobis distance

14
Maximum Log Likelihood
  • Multivariate Normal Distribution N(m, S)
  • Likelihood of x in class j
  • Log likelihood

15
Maximum Log Likelihood
  • Likelihood of X x1, , xn in class j
  • Log likelihood

16
Data Reduction
  • Purpose
  • Reduce NNR computation load
  • Increase data consistency
  • Techniques
  • To reduce data size
  • Editing To eliminate noisy (boundary) data
  • Condensing To eliminate redundant (deeply
    embedded) data
  • Vector quantization To find representative data
  • To reduce data dimensions
  • Principal component projection To reduce the
    dimensions of the feature sets
  • Discriminant projection To find the best set of
    vectors which best separates the patterns

17
Editing
  • To remove noisy (boundary) data

18
Condensing
  • To remove redundant (deeply embedded) data

19
VQ Fuzzy C-Means Clustering
  • A point can belong to various clusters with
  • various degrees.

20
Fuzzy Classifier
  • Rule base
  • if x is close to (A1 or A2 or A3), then class
  • if x is close to (B1 or B2 or B3), then class

A3
A fuzzy classifier is equivalent to a 1-NNR if
all MFs have the same width.
v
A2
B1
v
B2
A1
v
B3
21
Fuzzy Classifier
  • Adaptive network representation

A1
max
A2
x1
A3

y
S
-
B1
x2
max
B2
B2
multidimensional MFs
x x1 x2 belongs to class if y gt 0
class if y lt 0
22
Refining Fuzzy Classifier
MFs with the same width
v
v
v
MFs widths refined via random search
23
Principal Component Projection
  • Eigenvalues of covariance matrix l1 gt l2 gt l3 gt
    ... gt ld
  • Projection on v1 v2
    Projection on v3 v4

24
Discriminant Projection
  • Best discriminant vectors v1, v2, ... , vd
  • Projection on v1 v2 Projection
    on v3 v4

25
Experiments
  • Experimental data
  • Sample size 578, test size 1063, no. of class
    3
  • No. of each speaker for sample data 148 280 150
  • No. of each speaker for test data 256 457 350
  • Experiments
  • K-NNR with all sample data
  • K-NNR with reduced sample data
  • Fuzzy classifier refined via random search

26
Performance Using All Samples
  • Sample size 578
  • Test size 1063

Recognition rates as functions of the speech
signal length
  • Confusion
  • matrix

27
Performance After E D
  • Sample size 497 after editing, 64 after
    condensing
  • Test size 1063

Recognition rates as functions of the speech
signal length
  • Confusion
  • matrix

28
Performance After VQ (FCM)
  • Sample size 60 after FCM
  • Test size 1063

Recognition rates as functions of the speech
signal length
Confusion matrix
29
Performance After VQ RS
  • Sample (rule) size 60, tuned via random search
  • Test size 1063

Recognition rates as functions of the speech
signal length
Confusion matrix
30
On-line Recognition Hardware Setup
31
Conclusions
  • Performance after editing and condensing is
    unpredictable.
  • Performance after VQ (FCM) is consistently better
    than that of editing and condensing.
  • A simple derivative-free optimization method,
    I.e., random search, can significantly enhance
    the performance.

32
Future work
  • Data dimension reduction
  • Other feature extraction methods (e.g., LPC)
  • Scale up the problem size
  • More speakers (ten or more)
  • Other vocal signals (laughter, coughs, singing,
    etc.)
  • Other biometric identification using
  • Faces
  • Fingerprints and palm prints
  • Retina and iris scans
  • Hand shapes/sizes/proportions
  • Hand vein distributions
Write a Comment
User Comments (0)
About PowerShow.com