AdvAIR - PowerPoint PPT Presentation

About This Presentation

Title:

AdvAIR

Description:

Simulate vocal track shape. Features Extraction (con't) ... Speech content recognition. Music pattern matching. Distributed system for segmentation. Q & A ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 42

Provided by: Ling87

Category:

Tags: advair

more less

Transcript and Presenter's Notes

Title: AdvAIR

1
AdvAIR
An Advanced Audio Information Retrieval System

Supervised by Prof. Michael R. Lyu
Prepared by Alex Fok, Shirley Ng
2002 Fall

2
Outline

Introduction
System Overview
Applications
Experiment
Future Work
QA

Introduction

4
Motivation

Rapid expansion of audio information due to
blooming of internet
Little attention paid on audio mining
Lack of a framework for generic audio information
processing

5
Targets

Open platform that can provide a basis for
various voice oriented applications
Enhance audio information retrieval by
performance with guaranteed accuracy
Generic speech analysis tools for data mining

6
Approaches

Robust low-level sound information preprocess
module
Speed oriented but accuracy algorithms
Generalized model concept for various usage
A visual framework for presentation

System Design

8
System Flow Chart
Scene Cutting
Audio Signal
Implements
Video Scene Change And Speaker Tracking
Features Extraction
Database Storage
Segmentation and clustering Preprocessing
Speaker Identification
Training and Modeling
Linguistic Identification
Core Platform
Extended tools
9
Features Extraction

Energy Measurement
Zero Crossing Rate
Pitch
Human resolves frequencies non-linearly across
the audio spectrum
MFCC approach
Simulate vocal track shape

10
Features Extraction (cont)

The idea of filter-bank, which approximates the
non-linear frequency resolution
Bins hold a weighted sum representing the
spectral magnitude of channels
Lower and upper frequency cut-offs

Frequency
magnitude

11
Segmentation

Segmentation is to cut audio stream at the
acoustic change point
BIC (Bayesian Information Criterion) is used
It is threshold-free and robust
Input audio stream is modeled
as Gaussians

Gaussian
Mean
12
Segmentation

Notations for an audio stream
N Number of frames
X xi i 1,2,,N a set of feature vectors
µ is the mean
S is the full covariance matrix

13
Segmentation for single change pt.

Assume change point is at frame i
H0,H1 two different models
H0 models the data as one Gaussian
X1 XN N( µ , S )
H1 models the data as two Gaussians
X1 Xi N( µ1 , S1 )
Xi1 XN N( µ2 , S2 )

14
Segmentation for single change pt. (cont)

maximum likelihood ratio statistics is
R(i) N log S - N1 log S1 - N2
log S2

Audio Stream
Frame N
Change point
Frame 1
Frame i
15
Segmentation for single change pt. (cont)

BIC(i) R(i) -? P
BIC(i) is ve i is the change point
BIC(i) is ve i is not the change point
Which model fits the data better, single
Gaussian(H0) or 2 Gaussians(H1)?

16
Segmentation for single change pt. (cont)

To detect a single change point, we need to
calculate BIC(i) for all i 1,2,,N
The frame i with largest BIC value is the change
point
O(N) to detect a single change point

17
Segmentation for multiple change pt.

Step 1 Initialize interval a,b, set a 1, b
2
Step 2 Detect change point in interval a,b
through BIC single change point detection
algorithm
Step 3 If no change point in interval a,b,
then set b b1
else let t be the changing point
detected,
set a t1, b t2
Step 4 Go to Step (2)

18
Enhanced Implementation Algorithm

Original multiple change point detection
algorithm
Start to detect change point within 2 frames
Increase investigation interval by 1 every time
Enhanced Implementation algorithm
minimum processing interval used in our engine is
100 frames
Increase investigation interval by 100 every time

19
Enhanced Implementation Algorithm (cont)

Why do we choose to increase the interval by 100
frames?
It increases is too large, then scene change may
be missed.
Must be smaller than 170 frames because there are
around 170 frames in 1 second
It increases is too small, then speed of
processing is too slow

20
Enhanced Implementation Algorithm (cont)

Advantage Speed up
Trade-off the change point we detected is not
too accurate
To compensate
investigate on the frames around the change point
again
investigation interval is incremented by 1 to
locate a more accurate change point

21
Training and Modeling

Before doing various identification, training and
modeling is needed
Probability-based Model ? Gaussian Mixture Model
(GMM) is used
GMM is used for language identification, gender
identification and speaker identification
GMM is modeled by many different Gaussian
distributions
A Gaussian distribution is represented by its
mean and variance

22
Gaussian Mixture Model (GMM)
Model for Speaker i

To train a model is to calculate the mean ,
variance and weight (?) for each of the Gaussian
distribution

23
Training of speaker GMMs

Collect sound clips that is long enough for each
speaker (e.g. 20 minutes sound clips)
Steps for training one speaker model
Step 1. Start with an initial model ?
Step 2. Calculate new mean, variance, weighting
(new ?) by training
Step 3. Use a new?if it represents the model
better than the old?
Step 4. Repeat Step 2 to Step 3
Finally, we get ?that can represent the model

Applications

25
Applications

Video scene change and speaker tracking
Speaker Identification
Telephony message notification

26
Video scene change and Speaker tracking
Multimedia Presentation
Video Clip
AdvAIR core Segmentation
Timing And Speaker Information
Video Playing Mechanism
Speakers Index Information
27
Usage

Speaker tracking enhance data mining about a
particular person (e.g. Political person in a
conference)
Audio information indexing and sorting for audio
library storage
It as an auxiliary tool for video cutting and
editing applications

28
Screenshot
Input clip
Multimedia player
Time information and indexing
29
Speaker Identification
Preprocessed Speaker clip
Sound source
GMM Model Training
Speaker Comparison Mechanism
Speaker Models Database
Speaker Identity
Training Stage
Testing Stage
30
Usage

Security authentication
Speaker identification of telephone base system
Criminal investigation (For example, similar to
fingerprint)

31
Screenshot
Input source
Flexible length comparison
Media Player for visual verification
Speaker Identity
32
Telephony Message Notification
Caller phone
Desired group Model database
GMM model comparison
User cant listen
Record the leaving message of caller
Desired group
Non-desired Group
AdvAIR segmentation
Messaging API
Short Message System
E-mail system
33

Experiment Results

34
Threshold-free BIC criterion
Test Wave length Actual Turing Point False Alarm Missed Point Time used
1 9 seconds 2 0 0 2 seconds
2 12 seconds 4 0 0 4 seconds
3 25 seconds 3 0 0 8 seconds
4 120 seconds 8 1 0 134 seconds
5 540 seconds 12 8 0 1200 seconds
Background Noise affect accuracy
35
Enhanced Implementation
Test Method Wave length Actual Turning Point False Alarm Missed Point Time used
1 Old 9 seconds 2 0 0 10 seconds
1 New 9 seconds 2 0 0 2 seconds
2 Old 12 seconds 4 0 0 40 seconds
2 New 12 seconds 4 0 0 4 seconds
3 Old 25 seconds 3 1 0 1300 seconds
3 New 25 seconds 3 2 0 8 seconds
4 Old 540 seconds 18 7 2 Over 1 days
4 New 540 seconds 18 8 2 1200 seconds
Speed enhance is determined by relative number of
changing point by length
36
GMM modal closed-set speaker identification

Training Stage
10 speaker
5 males, 5 females
20 minutes for each speaker
Testing Stage
50 sound clips with 5 seconds duration
48 sound clips are correct, i.e. 96

37
GMM modal open-set speaker identification

Accept or Reject as result
Same setting as closed-set
i.e. 10 speaker, which each 20 minutes
Correct 45/50 90
False reject 3/50 6
False accept 2/50 4

Problems
and
Limitation

39
Problems and limitations

Accuracy is affected by background noise
Some speakers have very likely features of sound
Open set speaker identification determination
function is not so accurate if duration is short
Segmentation is still a time consuming process

40
Future Work