Locating Singing Voice Segments Within Music Signals

About This Presentation

Title:

Locating Singing Voice Segments Within Music Signals

Description:

Lyrics Recognition: Baby Steps. Segmentation. Forced Alignment. A Corpus ... Lyrics Recognition: Can YOU do it? Notoriously hard, even for humans. ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 21

Provided by: adamber

Learn more at: https://www.ee.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Locating Singing Voice Segments Within Music Signals

1
Locating Singing Voice Segments Within Music
Signals

Adam Berenzweig and Daniel P.W. Ellis
LabROSA, Columbia University
alb63_at_columbia.edu, dpwe_at_ee.columbia.edu

2
LabROSA

What
Where
Who
Why you love us

3
The Future as We Hear It

Online Digital Music Libraries
The Coming Age of Streaming Music Services
Information Retrieval How do we find what we
want?
Recommendation How do we know what we want to
find?
Collaborative Filtering vs. Content-Based
What is Quality?

4
Motivation

Lyrics Recognition Baby Steps
Segmentation
Forced Alignment
A Corpus
Song structure through singing structure?
Fingerprinting
Retreival
Feature for similarity measures

5
Lyrics Recognition Can YOU do it?

Notoriously hard, even for humans.
amIright.com, kissThisGuy.com
Why so hard?
Noise, music, whatever.
Singing is not speech voice transformations
Strange word sequences (poetry)
Need a corpus

6
History of the Problem

Segmentation for Speech Recognition Music/Speech
Scheirer Slaney
Forced Alignment - Karaoke
Cano et al. REF NEEDED
Acoustic feature design Custom job or Kitchen
Sink?
Idea! Use a speech recognizer PPF (Posterior
Probability Features)
Williams Ellis
Ultimately Source separation, CASA

7
A Peek at the End
8
Architecture Overview

Entropy H
H/h
Dynamism D
P(h)

posteriogram
cepstra
Audio
PLP
Speech Recognizer (Neural Net)
Feature Calculation
Time- averaging
Segmentation (HMM)
Gaussian Model
Gaussian Model
9
Architecture Overview
posteriogram
cepstra
Audio
PLP
Speech Recognizer (Neural Net)
Neural Net
Segmentation (HMM)
Neural Net
10
So hows that working out for you, being clever?

Entropy
Entropy excluding background
Dynamism
Background probability
Distribution Match Likelihoods under single
Gaussian model
Cepstra
PPF

11
Recovering context with the HMM

Transition probabilities
Inverse average segment duration
Emission probabilities
Gaussian fit to time-averaged distribution
Segmentation the Viterbi path
Evaluation
Frame error rate (no boundary consideration)

12
Results

Table, figures
Listen!
Good, bad
trigger stick
genre effects?

13
Results
14

E .075
P(h) in effect

E .68
P(h) gone bad

16
m,n
uw
ey

E .61
Strong phones trigger, but cant hold it
Production quality effect?

17
s

E .25
Trigger and Stick

18
bcl,dcl,b, d
l,r

E .54
False phones

E .20
Genre effect?

20
Discussion

The Moral of the Story Just give it the data
PPF is better than cepstra. Speech Recognizer is
pretty powerful.
Why does the extra Gaussian model help PPF but
not cepstra?
Time averaging helps PPF proves that its using
the overall distribution, not short-time detail
(at least, when modelled by single gaussians)

Write a Comment

User Comments (0)