Title: METISS
1METISS
- Modélisation
- et Expérimentation
- pour le Traitement
- des Informations
- et des Signaux
- Sonores
-
Audio speech processing
INRIA-Rennes
Scientific leader Frédéric BIMBOT
Overview of activities 2002-2005
2Introduction
3Framework and foundations
analysis, processing modelling, representation
description, decomposition detection,
classification recognition
audio speech music multimedia
signals recordings streams tracks
of
Audio scene analysis, description and recognition
- Scientific foundations
- Probabilistic models and statistical estimation
- Redundant systems and adaptive representations
4Scientific objectives
- to design generic, robust, fast and flexible
approaches to a variety of problems in speech and
audio segmentation, detection and classification,
operating in the probabilistic framework - to investigate on theoretical properties and
practical applications of adaptive
representations and sparseness criteria with the
purpose of advanced processing and structured
description of audio signals - to extend and adapt approaches classically used
in the context of speech processing to other
classes of signals and problems - to study convergence between statistical
approaches and adaptive decomposition within a
common framework embedding signal representations
and classification
5Application domain and focus
- Applicative fields
- Security, verification, authentication, rights
management - Rich audio transcription, content-based indexing,
multi-purpose navigation, information retrieval
and summarization - Advanced audio processing segmentation,
separation, spatialisation, sound object
extraction, music modeling - Audio and audio-visual authoring, production and
repurposing - Education and entertainement
- Primary focuses
- Speaker characterisation
- Audio structuring and indexing
- Sparse representations theory and applications
- Audio source separation (under-determined case)
6Team composition
2005
2002
2003
2004
Permanent researchers (CR - CNRS or INRIA)
3
Non-permanent staff (Engineers, ATER, Post-Doc)
2
PhD - 100 with METISS
PhD 50 with METISS
3
2
Marie-Noëlle Georgeault ? administrative
assistant ( 25 )
7Probabilistic modeling of audio signals
8Probabilistic modeling (1)
- 1 audio class or 1 sound object
- ?
- a variety of observations
- 1 family of sounds ? 1 probabilistic model
- 1 probability density function ? 1
likelihood function
9Probabilistic modeling (2)
Probabilistic modeling Statistical
estimation State-sequence decoding Bayesian
decision know-how
Detection Classification Verification Segmentation
?
Probabilistic models offer a well-understood
generic inter-operable framework for the
description and the classification of audio and
speech signals
- Dominant position of Hidden Markov Models (HMM)
(and variants) - Highly competitive field in speech processing
(research industry) - More open in audio indexing (additional factors
of complexity)
10Challenges and positioning
Generalisation to wider classesof signals with
an audio component ? multiple scales ? multiple
sources ? multiple structures ? multiple
sensors ? multiple levels of underlying
processes ? heterogeneous streams
(audio-visual) ? external sources of knowledge
- Robustness
- ? to unseen acoustic conditions
- ? to scarce training data
- ? to poorly representative samples
- ? to missing observations
- ? to
- Implementability
- ? size
- ? speed
- ? scalability
- ? distribution
- ? etc
METISS positioning - robust training and test
methods - compact distributed algorithms -
versatility / migration of formalism -
methodology and evaluation
? speaker verification ? audio segmentation ?
broad sound-class indexing (? speech
recognition)
11Adaptive representations
12Adaptive representations (1)
- Audio signal
- diversity of structures (time, frequency,
statistics,) - superimposition of objects (notes, sources,
tracks, )
Redundant system (dictionary of atoms)
Adaptive decomposition
with
- Selection of the best decomposition,according
to a given criterion - sparsity
- perception criterion
- separability
- conditional entropy
-
- Large set of vectors with various
- scales
- time structures
- frequency structures
- phases
- statistical properties
-
13Adaptive representations (2)
Sparsity criteria
Decomposition
- ? 2 quadratic norm ? maximizes dispersion
- ? 0 minimum non-zero coefficient ?
NP-complete - ? 1 tractable compromise
? Pursuit algorithms (Matching Pursuit)
14Ongoing scientific issues
- Optimality and convergence of adaptive
decompositions - Dictionary design (knowledge-based, data driven,
) - Deformable, stochastic, multi-dimensional,
atoms - Efficient decomposition algorithms and
implementations - Application scope
- Recent fast-growing field
- High applicative potential
- Intense emerging competition
15Achievements2002-2005and selected results
- Speaker characterisation
- Audio structuring and indexing
- Sparse representations theory and applications
- Audio source separation (under-determined case)
16Speaker characterisation
- CART trees for scalable and distributable speaker
verification - Model-based metrics and normalisations for
speaker verification - Structural adaptation of speaker models
(hierarchical Bayesian networks) - Methodology and algorithms for optimizing the
coverage of a speaker database - Relative speaker space and metrics for efficient
speaker indexing and retrieval ongoing
17CART based speaker verification
Blouet, Bimbot, Gonon, et al.
direct score function assignment
?
CART Trees used as a family of approximating funct
ions
-0.8
NO
YES
0.7
NO
0.3
YES
YES
NO
-0.8
-0.4
0.7
0.9
-0.4
NO
YES
Extension to oblique trees
-0.5
0.9
NO
YES
-0.5
0.3
complexity down 200 x error rate up 33 only
EU-IST INSPIRED Project
18Speaker recognition inthe model space (1)
Ben, Bimbot et al.
Formal links between LLR and KL-divergence mean-
only adaptation training procedure
likelihood ratio test Euclidean distance in
the model space
?
19Speaker recognition inthe model space (2)
Ben, Bimbot et al.
Consequences - faster score computation
procedure (at least -50) - simpler
normalization schemes (M-Norm) no
need of additional development data with no
performance degradation
20Audio indexing
- HMM-based audio and audio-visual structuring
(applied to sports programmes) - Audio segmentation and tracking using
probabilistic models and statistical tests - Detection of simultaneous events in audio tracks
- Granular models of audio signals using deformable
atoms - Comparison and evaluation of beam-search
techniques and hypothesis rescoring using
external sources of knowledge ongoing - Algebraic representations and statistical
modeling of formal music ongoing
21Multi-stream HMM modeling (1)of a tennis match
Kijak et al. (with TMM)
multi-level state-sequence representation of a
tennis match
inspired and adapted from the speech recognition p
aradigms
? multi-stream audio-visual HMM
22Multi-stream HMM modeling (2)
Delakis, Gravier et al. (with TexMex)
- segmental models ?
- relaxed synchrony
- constraints
Video-only Shot-based C 77
VideoAudio Shot-based segmental C 85
?
23Sparse representations
- Mathematical test for the optimality of a sparse
representation - Matching pursuit made tractable (1 hour ? 0.25 x
RT) - Structured matching pursuit incorporating
explicit signal family models - Adaptive computational strategies
- Beyond sparsity recovering structured
representations - Learning shift-invariant atoms (MoTIF algorithms)
ongoing
24Sparse solutions to inverse linear problems
Gribonval et al.
- In the under-determined case
-
If a sparse representation is sparse enough, then
it is the sparsest one
25Matching Pursuit made tractable
Gribonval, Krstulovic et al.
C ToolkitGPL Licence
MPTK
flexible operation reproducible results
for a 1 hour audio signalprocessing time reduced
from 20 h ? 0.25 h
usable in other fields medical signals,
sismology, etc
26Source separation(with primary focus on
undertermined problems)
- Statistical schemes and adaptive training for
single-channel separation - Source separation approaches using multi-channel
Matching Pursuit in the underdetermined case - Contributions in evaluation methodology task
definition performance measurements - Speech denoising using underdetermined
sources separation techniques - Dictionary design methods for source separation
ongoing - DEMIX a robust algorithm to estimate the number
of sources using clustering techniques ongoing
27Single sensor audio source separation
Observed signalVoice Music
Benaroya, Bimbot, Gribonval, Ozerov (with FTRD)
EstimatedVoice signal
Factorial GMM
Voice GMM
Use of a factorial GMM to build a
time-varying Wiener filter
Music GMM
Wiener filter
Article in IEEE Trans SAP 2006 new results to
come
- innovative scheme for underdetermined source
separation - compatibility with speech processing
state-of-the-art - strong links with sparse decomposition problems
- versatile and efficient for a range of audio
description tasks
28Underdetermined stereophonicsource separation
using sparse method
Lesage, Gribonval et al.
Mixing matrix
Separation
Audio examples available
least squares ?
sparsity ?
29Collaborations, Disseminationand Visibility
- Privileged cooperation with the TEXMEX group at
IRISA ( VISTA) - Consistent network of academic and industrial
partners outside IRISA - Regular participation to collaborative projects
(EU-IST, RNRT, bilateral partnership, ) - Strong involvement in concerted research actions
(ESTER, MathSTIC, GDR-ISIS, NIST evaluations, ) - Visible participation to and production of free
software ELISA platform, AudioSeg, MPTK,
SIROCCO, BSS-EVAL - Sustained effort of publication and dissemination
of the group research results - Additional visibility through responsability
taking in scientific societies, workshop
organisation and editorial boards
30Summary 2002-2005Strategy and
perspectives2006-2010
31Achievements 2002-2005 (1)
- solid contributions to the state-of-the art with
respect to several topics related to speaker and
audio class modelling and recognition - key extension, experimentation and validation of
the Hidden Markov Model framework for joint audio
and video modelling and structuring - major theoretical and experimental progress in
the field of sparse representations and adaptive
decomposition - pioneering work in mono- and multi-channel source
separation in the underdetermined case
32Achievements 2002-2005 (2)
- strategic improvement in the efficiency of
pursuit algorithms both in terms of search
strategy and implementation - development of a usable know-how in keyword
spotting and speech recognition - sustained activities in assessment methodology,
resource distribution and evaluation campaigns - scientific objective 4 needs consolidation
33Strategy 2006-2010
- To keep our position in our initial field of
expertise models, algorithms and tools for
automatic processing of audio and speech signal - To push our advantage in the field of sparse
representations, both from the theoretical and
applicative viewpoint. - To extend our scope towards more powerful
approaches for the representation and modeling of
audio and multi-modal signals with an audio
component - To step in and progress in the area of
compressing large-scale high-dimensional
multi-modal data
34Scientific challenges
- Probabilistic multi-level multi-stream dependency
models for the representation of multiple sources
and the integration of heterogeneous levels of
knowledge in audio (-visual) streams ? Bayesian
networks - Data-driven representations, model discovery and
self-structuring of information in audio and
audio-visual streams and contents ?
theoretical consolidation - Experimental platforms and numerically efficient
algorithms for large scale data and near
real-time processing ? engineering work - Deeper understanding of the links between
theoretical concepts of adaptive representation,
sparse decomposition, multi-scale analysis and
pratical implications in terms of robustness,
separability and adaptability ? potential
links with SVM - Compressing large-scale high-dimensional
multimodal data for storage, description and
classification ? compressed sensing
35Questions