Title: Neuro-Fuzzy and Soft Computing for Speaker Recognition (????)
1Neuro-Fuzzy and Soft Computing forSpeaker
Recognition (????)
1999 CSIST
- J.-S. Roger Jang (???)
- CS Dept., Tsing Hua Univ., Taiwan
- http//www.cs.nthu.edu.tw/jang
- jang_at_cs.nthu.edu.tw
2Outline
- Introduction
- Data acquisition
- Feature extraction
- Data reduction
- Condensing, editing, fuzzy clustering
- Fuzzy classifier refinement
- Random search
- Experiments
- Conclusions and future work
3Speaker Recognition
- Types
- Text-dependent or text-independent
- Close-set or open-set
- Methodologies involved
- Digital signal processing
- Pattern recognition
- Clustering or vector quantization
- Nonlinear optimization
- Neuro-fuzzy techniques
4Data Acquisition
- Recording
- Recording program of Windows 95/98
- 8 kHz sampling rate, 8-bit resolution
(worse than phone quality) - 5-second speech signal takes about 40KB.
- Samples
- Speaker 1
- Speaker 2
- Speaker 3
5Feature Extraction
- Major steps
- Overlapping frames of 256 points (32 ms)
- Hamming windowing to lessen distortion
- cepstrum(frame) real(IFFT(logFFT(frame)))
- FFT Fast Fourier Transform
- IFFT Inverse FFT
- A feature vector consists of the first 14
cepstral coefficients of a frame. - Optional steps
- Frequency-selective filter to reduce noise
- Mel-warped cepstral coefficients
- Feature-wise normalization
6Physical Meanings of Cepstrum
7Feature Extraction
2.39 sec. speech signal
148 frames of 256 points
Hamming windowing
take frames
low-pass filter
FFT
abs
log
resample
IFFT
real
first 14 coefficients
normalization
148 feature vectors of length 14
8Feature Extraction
- Upper speaker 1 , lower speaker 2
9Pattern Recognition
Feature Extraction
Data Reduction
Sample speech
Sample set
Classifier
Feature Extraction
Recognized speaker
Test speech
10Pattern Recognition Methods
- K-NNR K nearest neighbor rule
- Euclidean distance
- Mahalanobis distance
- Maximum log likelihood
- Adaptive networks
- Multilayer perceptrons
- Radial basis function networks
- Fuzzy classifiers with random search
11K-Nearest Neighbor Rule (K-NNR)
- Steps
- 1. Find the first k nearest neighbors of a given
point. - 2. Determine the class of the given point by a
voting mechanism among these k nearest
neighbors.
class-A point class-B point point with
unknown class
Feature 2
Feature 1
12Decision Boundary for 1-NNR
- Voronoi diagram piecewise linear boundary
13Distance Metrics
- Euclidean distance
- Mahalanobis distance
14Maximum Log Likelihood
- Multivariate Normal Distribution N(m, S)
- Likelihood of x in class j
- Log likelihood
15Maximum Log Likelihood
- Likelihood of X x1, , xn in class j
- Log likelihood
16Data Reduction
- Purpose
- Reduce NNR computation load
- Increase data consistency
- Techniques
- To reduce data size
- Editing To eliminate noisy (boundary) data
- Condensing To eliminate redundant (deeply
embedded) data - Vector quantization To find representative data
- To reduce data dimensions
- Principal component projection To reduce the
dimensions of the feature sets - Discriminant projection To find the best set of
vectors which best separates the patterns
17Editing
- To remove noisy (boundary) data
18Condensing
- To remove redundant (deeply embedded) data
19VQ Fuzzy C-Means Clustering
- A point can belong to various clusters with
- various degrees.
20Fuzzy Classifier
- Rule base
- if x is close to (A1 or A2 or A3), then class
- if x is close to (B1 or B2 or B3), then class
A3
A fuzzy classifier is equivalent to a 1-NNR if
all MFs have the same width.
v
A2
B1
v
B2
A1
v
B3
21Fuzzy Classifier
- Adaptive network representation
A1
max
A2
x1
A3
y
S
-
B1
x2
max
B2
B2
multidimensional MFs
x x1 x2 belongs to class if y gt 0
class if y lt 0
22Refining Fuzzy Classifier
MFs with the same width
v
v
v
MFs widths refined via random search
23Principal Component Projection
- Eigenvalues of covariance matrix l1 gt l2 gt l3 gt
... gt ld - Projection on v1 v2
Projection on v3 v4
24Discriminant Projection
- Best discriminant vectors v1, v2, ... , vd
- Projection on v1 v2 Projection
on v3 v4
25Experiments
- Experimental data
- Sample size 578, test size 1063, no. of class
3 - No. of each speaker for sample data 148 280 150
- No. of each speaker for test data 256 457 350
- Experiments
- K-NNR with all sample data
- K-NNR with reduced sample data
- Fuzzy classifier refined via random search
26Performance Using All Samples
- Sample size 578
- Test size 1063
Recognition rates as functions of the speech
signal length
27Performance After E D
- Sample size 497 after editing, 64 after
condensing - Test size 1063
Recognition rates as functions of the speech
signal length
28Performance After VQ (FCM)
- Sample size 60 after FCM
- Test size 1063
Recognition rates as functions of the speech
signal length
Confusion matrix
29Performance After VQ RS
- Sample (rule) size 60, tuned via random search
- Test size 1063
Recognition rates as functions of the speech
signal length
Confusion matrix
30On-line Recognition Hardware Setup
31Conclusions
- Performance after editing and condensing is
unpredictable. - Performance after VQ (FCM) is consistently better
than that of editing and condensing. - A simple derivative-free optimization method,
I.e., random search, can significantly enhance
the performance.
32Future work
- Data dimension reduction
- Other feature extraction methods (e.g., LPC)
- Scale up the problem size
- More speakers (ten or more)
- Other vocal signals (laughter, coughs, singing,
etc.) - Other biometric identification using
- Faces
- Fingerprints and palm prints
- Retina and iris scans
- Hand shapes/sizes/proportions
- Hand vein distributions