Title: Speaker, Speech, and Facial Recognition
1Speaker, Speech, and Facial Recognition By
Austin Ouyang Xin Chen TA Tony Mangognia
2Overall Design
Digital Signal Processor C6713
Microphone
control signals
speaker identification class
Webcam
Matlab GUI
33 Stages of Verification
- Speaker Identification
- Will be able to identify a person's voice based
on the physical structure of a person's throat
generating his or her voice - Speech Pattern Recognition
- Will be able to identify what a person is saying
assuming that the speech said is already in the
database - Facial Recognition
- Will be able to identify a persons face with
those present in the database
4Basic Recognition System
Pre-processing
signal
Basic structure used for speaker, speech, and
facial recogntion
Feature Extraction
Classification
Post-processing
5Speaker Recognition System
Framing and Windowing
signal
- Key parameters
- Size of frames
- Hop sizes
- Number of LPC coefficients
- Thresholds for majority ruling
- (imposter determination)
Feature Extraction using Linear Prediction Coding
(LPC)
Linear Classifier
Majority Ruling
6Framing and Windowing
7Feature Extraction using Linear Prediction Coding
(LPC)
1. Find autocorrelation coefficients based on how
many LPC coefficients needed 2. Compute LPC
coefficients through Levinson-Durbin
algorithm 3. LPC coefficients are actually the
filter coefficients of an IIR filter that can
model the frequency response of a signal thus LPC
coefficients are like a spectral envelope in the
frequency domain.
8Linear Classifier
The linear classifier assumes that the system is
a linear system. With non-linear systems, this
classifier would not be effective. This system
can be seen as a linear system, since the LPC
coefficients derived are also linearly
calculated.
such that w is the template matrix and x is the
matrix containing all of the LPC coefficients. t
is the known class matrix.
Where m is the number of classes n is the
number of LPC coefficients l is the number of
frames
9Linear Classifier (cont.)
During training the w matrix needs to be found.
To find this matrix, the t matrix must be
multiplied by the inversion of the x matrix. This
will be done by typical matrix inversion
techniques.
where C is the adjugate matrix of A
10Majority Ruling
During testing, a t matrix will be generated
which will be of the order m x l where m is the
number classes and l is the number of frames. For
each frame the index with the largest value will
be the identified class for that frame.
It can be seen that given 8 columns 5 of these
had a max value at index 4 and the remaining 3
had a max value at index 5. With majority ruling,
it would then be determined that the class for
the testing was class 4, since more maxes are
present at index 4 than any other index.
11Speech Pattern Recognition By Bernd Plannerer
signal
Framing
- Key parameters
- Size of frames
- Hop sizes
- Number of channels for Mel filters
Mel Filter Bank
Distance calculations with samples in database
12Mel-Frequency Filter Banks
Humans do not hear frequencies on a linear scale
and more based on a logarithmic scale. Human
hearing as seen from the figure is more sensitive
to lower frequencies than at higher frequencies.
The triangular filters perform a masking
effect. Multiplying the mel-frequency filters
with the original power spectrum of the signal
will return a weighting of how strong the signal
is in each frequency bank, and thus providing the
feature coefficients
13Distance
Where ... m is the number of channels for the
mel banks n is the number of frames in sample
x l is the number of frames in sample y
Uses predecessor distances to accumulate algorithm
finds the shortest distance possible to match
sample x with sample y. Total accumulated
distance is in bottom right corner
Euclidean distance for two arbitrary frames
14Face RecognitionBy David Lowe
image
- SIFT (Shift Invariant Feature Transform)
- Keypoint detector
- Edge / Low Contrast Removal
- Orientation Assignment
- Vector Creation
Feature Extraction using Shift Invariant Feature
Transform (SIFT)
Closest vector with 2nd closest far enough
Greatest matched keypoints
15Face Recognition - SIFT
- Each layer on left is original image multiplied
by Gaussian of increasing s (by factor k)
16Face Recognition - SIFT
- How keypoints are found if x is max or min
compared to rest of the neighbors
17Face Recognition - SIFT
- Remove the edges and points of areas of low
contrast
18Face Recognition - SIFT
- Assign an orientation based on gradient on
neighboring pixels of the keypoint scale them
using Gaussian scale
19Face Recognition - SIFT
- Each descriptor is a 4x4x8 vector
20Face Recognition - Classification
- Compare each image in database separately.
- For each image in database, sort database vectors
based on lowest angle with test vector. - dotprod des(i,) database_des, j'
- vals,indx sort(acos(dotprod))
- if (vals(1)ltvals(2)ratio)
- count(j) count(j)1
- end
21Face Recognition - SIFT
- Advantages
- Shift Invariant angle as well as scale
- Fast to Match
- Distinctive
- Disadvantage
- Undergoing patent cannot use repeatedly for
commercial unless license is obtained
22Testing Speaker Recognition
- 1st stage test with 2 person database Mid to
poor performance - Decided to make noise as a class and not count it
- 2nd stage test with noise as class Improvement
- Decided to eliminate noise in training other
classes by repeatedly saying password - 3rd stage test with repeated password
- Computation of w template too long
- With 2 people, 15.3 seconds
23Testing Speaker Recognition
- 4th stage test reducing LPC size from 64 to 32,
but increasing fs from 8kHz to 44.1kHz - Time for LPC calculation increased from 0.5
seconds to 2.5 seconds - Time for template calculation for 2 people
decreased from 15.3 seconds to 8 seconds. - 5th stage test increasing level of input level
- Some softer voices were having poor performance,
this increased signal to noise ratio
24Testing Speaker Recognition
- Results Percentage of features identified to
right person (minus noise)
Test Person Matched Person
Trial Number
25Testing Speech Pattern Recognition
- 1st stage test speech pattern algorithm with
various country names with one persons voice
populating the database - Accuracy with the same person was extremely high
however, accuracy with other people testing was
not as good - 2nd stage test populating database with passwords
said by different users - Accuracy of speech recognition with users in
database saying their password was good with an
accuracy of around 75 - Accuracy of speech recognition with users in
database saying other peoples password was very
low making imposter recognition very difficult
26Testing Speech Pattern Recognition
- 3rd stage test training of database by having the
users repeat password as to minimize noise space
in the 3 second buffer - Accuracy of users saying their own passwords was
very high with well over 95 accuracy - Accuracy of users saying other peoples passwords
improved significantly however, errors were
still present. This is very well due to the fact
that in the training, a password is only trained
with one persons voice.
27Testing Speech Pattern Recognition
- Results Minimal accumulated distance per person
Matched Person
Test Person
28Testing Face Recognition
- 1st stage test using camera, then performing
facial recognition - Detected hundreds of features too much in
background - Distance, positioning, lighting and angle
inconsistent - 2nd stage test using white background with
consistent lighting and webcam fixed - of features dropped to around 100
- Matching performance at 100 with 8 person
database - 3rd stage test using linear classifier to match
- Performance dropped to almost random
classification - Reason SIFT is a nonlinear system
29Testing Face Recognition
- Invariant to what it is comparing can be faces,
or objects, etc.
30Testing Face Recognition
- Results Number of features matched to each
person
Matched Person
Test Person
31Additional Considerations
- Speaker Recognition
- Training of password does not encompass all
phonemes in English language - Facial Recognition
- If head is turned side to side, performance
drops significantly - Algorithm does not care about what is being
matched to what no locality
32List of sources
- http//labts.troja.mff.cuni.cz/machl5bm/sift/
- Distinctive image features from scale-invariant
keypoints. David G. Lowe, International Journal
of Computer Vision, 60, 2 (2004), pp. 91-110 - Sebastian Thrun and Jana Koecká
- Ling Feng at Denmark Technical University
- Ogg Vorbis SOFTWARE