Speaker, Speech, and Facial Recognition

About This Presentation

Title:

Speaker, Speech, and Facial Recognition

Description:

Will be able to identify a person's voice based on the physical structure of a ... http://labts.troja.mff.cuni.cz/~machl5bm/sift ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 33

Provided by: Xin86

Category:

more less

Transcript and Presenter's Notes

Title: Speaker, Speech, and Facial Recognition

1
Speaker, Speech, and Facial Recognition By
Austin Ouyang Xin Chen TA Tony Mangognia
2
Overall Design
Digital Signal Processor C6713
Microphone
control signals
speaker identification class
Webcam
Matlab GUI
3
3 Stages of Verification

Speaker Identification
Will be able to identify a person's voice based
on the physical structure of a person's throat
generating his or her voice
Speech Pattern Recognition
Will be able to identify what a person is saying
assuming that the speech said is already in the
database
Facial Recognition
Will be able to identify a persons face with
those present in the database

4
Basic Recognition System
Pre-processing
signal
Basic structure used for speaker, speech, and
facial recogntion
Feature Extraction
Classification
Post-processing
5
Speaker Recognition System
Framing and Windowing
signal

Key parameters
Size of frames
Hop sizes
Number of LPC coefficients
Thresholds for majority ruling
(imposter determination)

Feature Extraction using Linear Prediction Coding
(LPC)
Linear Classifier
Majority Ruling
6
Framing and Windowing
7
Feature Extraction using Linear Prediction Coding
(LPC)
1. Find autocorrelation coefficients based on how
many LPC coefficients needed 2. Compute LPC
coefficients through Levinson-Durbin
algorithm 3. LPC coefficients are actually the
filter coefficients of an IIR filter that can
model the frequency response of a signal thus LPC
coefficients are like a spectral envelope in the
frequency domain.
8
Linear Classifier
The linear classifier assumes that the system is
a linear system. With non-linear systems, this
classifier would not be effective. This system
can be seen as a linear system, since the LPC
coefficients derived are also linearly
calculated.
such that w is the template matrix and x is the
matrix containing all of the LPC coefficients. t
is the known class matrix.
Where m is the number of classes n is the
number of LPC coefficients l is the number of
frames
9
Linear Classifier (cont.)
During training the w matrix needs to be found.
To find this matrix, the t matrix must be
multiplied by the inversion of the x matrix. This
will be done by typical matrix inversion
techniques.
where C is the adjugate matrix of A
10
Majority Ruling
During testing, a t matrix will be generated
which will be of the order m x l where m is the
number classes and l is the number of frames. For
each frame the index with the largest value will
be the identified class for that frame.
It can be seen that given 8 columns 5 of these
had a max value at index 4 and the remaining 3
had a max value at index 5. With majority ruling,
it would then be determined that the class for
the testing was class 4, since more maxes are
present at index 4 than any other index.
11
Speech Pattern Recognition By Bernd Plannerer
signal
Framing

Key parameters
Size of frames
Hop sizes
Number of channels for Mel filters

Mel Filter Bank
Distance calculations with samples in database
12
Mel-Frequency Filter Banks
Humans do not hear frequencies on a linear scale
and more based on a logarithmic scale. Human
hearing as seen from the figure is more sensitive
to lower frequencies than at higher frequencies.
The triangular filters perform a masking
effect. Multiplying the mel-frequency filters
with the original power spectrum of the signal
will return a weighting of how strong the signal
is in each frequency bank, and thus providing the
feature coefficients
13
Distance
Where ... m is the number of channels for the
mel banks n is the number of frames in sample
x l is the number of frames in sample y
Uses predecessor distances to accumulate algorithm
finds the shortest distance possible to match
sample x with sample y. Total accumulated
distance is in bottom right corner
Euclidean distance for two arbitrary frames
14
Face RecognitionBy David Lowe
image

SIFT (Shift Invariant Feature Transform)
Keypoint detector
Edge / Low Contrast Removal
Orientation Assignment
Vector Creation

Feature Extraction using Shift Invariant Feature
Transform (SIFT)
Closest vector with 2nd closest far enough
Greatest matched keypoints
15
Face Recognition - SIFT

Each layer on left is original image multiplied
by Gaussian of increasing s (by factor k)

16
Face Recognition - SIFT

How keypoints are found if x is max or min
compared to rest of the neighbors

17
Face Recognition - SIFT

Remove the edges and points of areas of low
contrast

18
Face Recognition - SIFT

Assign an orientation based on gradient on
neighboring pixels of the keypoint scale them
using Gaussian scale

19
Face Recognition - SIFT

Each descriptor is a 4x4x8 vector

20
Face Recognition - Classification

Compare each image in database separately.
For each image in database, sort database vectors
based on lowest angle with test vector.
dotprod des(i,) database_des, j'
vals,indx sort(acos(dotprod))
if (vals(1)ltvals(2)ratio)
count(j) count(j)1
end

21
Face Recognition - SIFT

Advantages
Shift Invariant angle as well as scale
Fast to Match
Distinctive
Disadvantage
Undergoing patent cannot use repeatedly for
commercial unless license is obtained

22
Testing Speaker Recognition

1st stage test with 2 person database Mid to
poor performance
Decided to make noise as a class and not count it
2nd stage test with noise as class Improvement
Decided to eliminate noise in training other
classes by repeatedly saying password
3rd stage test with repeated password
Computation of w template too long
With 2 people, 15.3 seconds

23
Testing Speaker Recognition

4th stage test reducing LPC size from 64 to 32,
but increasing fs from 8kHz to 44.1kHz
Time for LPC calculation increased from 0.5
seconds to 2.5 seconds
Time for template calculation for 2 people
decreased from 15.3 seconds to 8 seconds.
5th stage test increasing level of input level
Some softer voices were having poor performance,
this increased signal to noise ratio

24
Testing Speaker Recognition

Results Percentage of features identified to
right person (minus noise)

Test Person Matched Person
Trial Number
25
Testing Speech Pattern Recognition

1st stage test speech pattern algorithm with
various country names with one persons voice
populating the database
Accuracy with the same person was extremely high
however, accuracy with other people testing was
not as good
2nd stage test populating database with passwords
said by different users
Accuracy of speech recognition with users in
database saying their password was good with an
accuracy of around 75
Accuracy of speech recognition with users in
database saying other peoples password was very
low making imposter recognition very difficult

26
Testing Speech Pattern Recognition

3rd stage test training of database by having the
users repeat password as to minimize noise space
in the 3 second buffer
Accuracy of users saying their own passwords was
very high with well over 95 accuracy
Accuracy of users saying other peoples passwords
improved significantly however, errors were
still present. This is very well due to the fact
that in the training, a password is only trained
with one persons voice.

27
Testing Speech Pattern Recognition

Results Minimal accumulated distance per person

Matched Person
Test Person
28
Testing Face Recognition

1st stage test using camera, then performing
facial recognition
Detected hundreds of features too much in
background
Distance, positioning, lighting and angle
inconsistent
2nd stage test using white background with
consistent lighting and webcam fixed
of features dropped to around 100
Matching performance at 100 with 8 person
database
3rd stage test using linear classifier to match
Performance dropped to almost random
classification
Reason SIFT is a nonlinear system

29
Testing Face Recognition

Invariant to what it is comparing can be faces,
or objects, etc.

30
Testing Face Recognition

Results Number of features matched to each
person

Matched Person
Test Person
31
Additional Considerations

Speaker Recognition
Training of password does not encompass all
phonemes in English language
Facial Recognition
If head is turned side to side, performance
drops significantly
Algorithm does not care about what is being
matched to what no locality

32
List of sources

http//labts.troja.mff.cuni.cz/machl5bm/sift/
Distinctive image features from scale-invariant
keypoints. David G. Lowe, International Journal
of Computer Vision, 60, 2 (2004), pp. 91-110
Sebastian Thrun and Jana Koecká
Ling Feng at Denmark Technical University
Ogg Vorbis SOFTWARE

Write a Comment

User Comments (0)

About PowerShow.com

Speaker, Speech, and Facial Recognition - PowerPoint PPT Presentation

Speaker, Speech, and Facial Recognition

Will be able to identify a person's voice based on the physical structure of a ... http://labts.troja.mff.cuni.cz/~machl5bm/sift ... – PowerPoint PPT presentation