Title: AdvAIR
1AdvAIR
An Advanced Audio Information Retrieval System
- Supervised by Prof. Michael R. Lyu
- Prepared by Alex Fok, Shirley Ng
- 2002 Fall
2Outline
- Introduction
- System Overview
- Applications
- Experiment
- Future Work
- QA
3 4Motivation
- Rapid expansion of audio information due to
blooming of internet - Little attention paid on audio mining
- Lack of a framework for generic audio information
processing
5Targets
- Open platform that can provide a basis for
various voice oriented applications - Enhance audio information retrieval by
performance with guaranteed accuracy - Generic speech analysis tools for data mining
6Approaches
- Robust low-level sound information preprocess
module - Speed oriented but accuracy algorithms
- Generalized model concept for various usage
- A visual framework for presentation
7 8System Flow Chart
Scene Cutting
Audio Signal
Implements
Video Scene Change And Speaker Tracking
Features Extraction
Database Storage
Segmentation and clustering Preprocessing
Speaker Identification
Training and Modeling
Linguistic Identification
Core Platform
Extended tools
9Features Extraction
- Energy Measurement
- Zero Crossing Rate
- Pitch
- Human resolves frequencies non-linearly across
the audio spectrum - MFCC approach
- Simulate vocal track shape
10Features Extraction (cont)
- The idea of filter-bank, which approximates the
non-linear frequency resolution - Bins hold a weighted sum representing the
spectral magnitude of channels - Lower and upper frequency cut-offs
Frequency
magnitude
11Segmentation
- Segmentation is to cut audio stream at the
acoustic change point - BIC (Bayesian Information Criterion) is used
- It is threshold-free and robust
- Input audio stream is modeled
- as Gaussians
Gaussian
Mean
12Segmentation
- Notations for an audio stream
- N Number of frames
- X xi i 1,2,,N a set of feature vectors
- µ is the mean
- S is the full covariance matrix
13Segmentation for single change pt.
- Assume change point is at frame i
- H0,H1 two different models
- H0 models the data as one Gaussian
- X1 XN N( µ , S )
- H1 models the data as two Gaussians
- X1 Xi N( µ1 , S1 )
- Xi1 XN N( µ2 , S2 )
14Segmentation for single change pt. (cont)
- maximum likelihood ratio statistics is
- R(i) N log S - N1 log S1 - N2
- log S2
Audio Stream
Frame N
Change point
Frame 1
Frame i
15Segmentation for single change pt. (cont)
- BIC(i) R(i) -? P
- BIC(i) is ve i is the change point
- BIC(i) is ve i is not the change point
- Which model fits the data better, single
Gaussian(H0) or 2 Gaussians(H1)?
16Segmentation for single change pt. (cont)
- To detect a single change point, we need to
calculate BIC(i) for all i 1,2,,N - The frame i with largest BIC value is the change
point - O(N) to detect a single change point
17Segmentation for multiple change pt.
- Step 1 Initialize interval a,b, set a 1, b
2 - Step 2 Detect change point in interval a,b
through BIC single change point detection
algorithm - Step 3 If no change point in interval a,b,
- then set b b1
- else let t be the changing point
detected, - set a t1, b t2
- Step 4 Go to Step (2)
18Enhanced Implementation Algorithm
- Original multiple change point detection
algorithm - Start to detect change point within 2 frames
- Increase investigation interval by 1 every time
- Enhanced Implementation algorithm
- minimum processing interval used in our engine is
100 frames - Increase investigation interval by 100 every time
19Enhanced Implementation Algorithm (cont)
- Why do we choose to increase the interval by 100
frames? - It increases is too large, then scene change may
be missed. - Must be smaller than 170 frames because there are
around 170 frames in 1 second - It increases is too small, then speed of
processing is too slow
20Enhanced Implementation Algorithm (cont)
- Advantage Speed up
- Trade-off the change point we detected is not
too accurate - To compensate
- investigate on the frames around the change point
again - investigation interval is incremented by 1 to
locate a more accurate change point
21Training and Modeling
- Before doing various identification, training and
modeling is needed - Probability-based Model ? Gaussian Mixture Model
(GMM) is used - GMM is used for language identification, gender
identification and speaker identification - GMM is modeled by many different Gaussian
distributions - A Gaussian distribution is represented by its
mean and variance
22Gaussian Mixture Model (GMM)
Model for Speaker i
- To train a model is to calculate the mean ,
variance and weight (?) for each of the Gaussian
distribution
23Training of speaker GMMs
- Collect sound clips that is long enough for each
speaker (e.g. 20 minutes sound clips) - Steps for training one speaker model
- Step 1. Start with an initial model ?
- Step 2. Calculate new mean, variance, weighting
(new ?) by training - Step 3. Use a new?if it represents the model
better than the old? - Step 4. Repeat Step 2 to Step 3
- Finally, we get ?that can represent the model
24 25Applications
- Video scene change and speaker tracking
- Speaker Identification
- Telephony message notification
26Video scene change and Speaker tracking
Multimedia Presentation
Video Clip
AdvAIR core Segmentation
Timing And Speaker Information
Video Playing Mechanism
Speakers Index Information
27Usage
- Speaker tracking enhance data mining about a
particular person (e.g. Political person in a
conference) - Audio information indexing and sorting for audio
library storage - It as an auxiliary tool for video cutting and
editing applications
28Screenshot
Input clip
Multimedia player
Time information and indexing
29Speaker Identification
Preprocessed Speaker clip
Sound source
GMM Model Training
Speaker Comparison Mechanism
Speaker Models Database
Speaker Identity
Training Stage
Testing Stage
30Usage
- Security authentication
- Speaker identification of telephone base system
- Criminal investigation (For example, similar to
fingerprint)
31Screenshot
Input source
Flexible length comparison
Media Player for visual verification
Speaker Identity
32Telephony Message Notification
Caller phone
Desired group Model database
GMM model comparison
User cant listen
Record the leaving message of caller
Desired group
Non-desired Group
AdvAIR segmentation
Messaging API
Short Message System
E-mail system
33 34Threshold-free BIC criterion
Test Wave length Actual Turing Point False Alarm Missed Point Time used
1 9 seconds 2 0 0 2 seconds
2 12 seconds 4 0 0 4 seconds
3 25 seconds 3 0 0 8 seconds
4 120 seconds 8 1 0 134 seconds
5 540 seconds 12 8 0 1200 seconds
Background Noise affect accuracy
35Enhanced Implementation
Test Method Wave length Actual Turning Point False Alarm Missed Point Time used
1 Old 9 seconds 2 0 0 10 seconds
1 New 9 seconds 2 0 0 2 seconds
2 Old 12 seconds 4 0 0 40 seconds
2 New 12 seconds 4 0 0 4 seconds
3 Old 25 seconds 3 1 0 1300 seconds
3 New 25 seconds 3 2 0 8 seconds
4 Old 540 seconds 18 7 2 Over 1 days
4 New 540 seconds 18 8 2 1200 seconds
Speed enhance is determined by relative number of
changing point by length
36GMM modal closed-set speaker identification
- Training Stage
- 10 speaker
- 5 males, 5 females
- 20 minutes for each speaker
- Testing Stage
- 50 sound clips with 5 seconds duration
- 48 sound clips are correct, i.e. 96
37GMM modal open-set speaker identification
- Accept or Reject as result
- Same setting as closed-set
- i.e. 10 speaker, which each 20 minutes
- Correct 45/50 90
- False reject 3/50 6
- False accept 2/50 4
38 39Problems and limitations
- Accuracy is affected by background noise
- Some speakers have very likely features of sound
- Open set speaker identification determination
function is not so accurate if duration is short - Segmentation is still a time consuming process
40Future Work
- Speaker gender identification
- Robust open-set speaker identification
- Speech content recognition
- Music pattern matching
- Distributed system for segmentation
41Q A