Course presentation: Speech Recognition - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Course presentation: Speech Recognition

Description:

Robust Methods for Automatic Transcription and Alignment of Speech Signals Course presentation: Speech Recognition Leif Gr nqvist (leifg_at_ling.gu.se) – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 15
Provided by: Leif166
Category:

less

Transcript and Presenter's Notes

Title: Course presentation: Speech Recognition


1
Robust Methods for Automatic Transcription and
Alignment of Speech Signals
  • Course presentation Speech Recognition
  • Leif Grönqvist (leifg_at_ling.gu.se)
  • Växjö University (Mathematics and Systems
    Engineering)
  • GSLT (Graduate School of Language Technology)
  • Göteborg University (Department of Linguistics)

2
Introduction GSLC
  • GSLC (Göteborg Spoken Language Corpus) A
    multimodal corpus
  • Video and/or audio recording
  • GTS (Göteborg Transcription Standard)
  • Overlaps on word level, background information,
    and comments relevant for interaction
  • MSO (Modified Standard Orthography)
  • Closer to speech than written language
  • NOT phonetic
  • Keeps possibilities to compare to written
    language
  • Designed for studies of natural speech in various
    activities
  • 25 social activity types 200 hours 360
    recordings 1.3 million running words
  • Recording/transcription only aligned for a few
    recordings, not word by word

3
Transcription example
  • L heej
  • G heej
  • L heej hur haru haft det i veckan lt / gt
  • _at_ lt event applause gt
  • G jättebra jag tycker inte att du är ett
    svart hål
  • L tycker 4 du inte det va bra 4
  • V 4 joe kolla lungan 4
  • G jag tycker att du e0 5 blå å0 (...) 5 jo
    det kan du väl ändå 6 tycka tycker ja 6
  • L 5 ja tycker inte att du e0 en röd stjärna 5
  • L 6 nä 6
  • V vad tycker 7 ni själva rå1 7
  • L 7 röd ja 7 stjärna ja
  • G kan du öh sluta avbryta 8 oss vi håller 8
  • V 8 vad e0 ni vad tycker ni själva att
    ni 8 e0
  • L (...)
  • G va

4
MultiTool
  • Prerelease 0.7
  • Browsing, searching, coding, counting
  • Easy navigation through recordings
  • Search in transcription, partiture, media file,
    or time scale
  • Only manual alignment
  • Partial alignment of specific events would help a
    lot!

5
(No Transcript)
6
What can speech technology do for MultiTool?
  • A lot of research I didnt know about
  • Question should we use the transcription or not?
  • Yes Automatic forced alignment on word level
  • No Speech recognition alignment
  • Yes, find the time for
  • Utterance start and end points
  • Non speech annotations (coughing, whispering,
    click, loud, high pitch, glottalization, etc) and
    silent sections
  • Easy-to-recognize speech sounds or words
  • Find out if two utterances are uttered by the
    same person

7
Challenging task
  • Speech recognition/alignment work best with high
    quality sound signals
  • Recordings of spontaneous speech in natural
    situations have some unwanted properties
  • Long distance between microphone and speaker
  • Many speakers in the same signal
  • Overlapped speech
  • Unlimited vocabulary
  • Whatever you call it Disfluencies, repairs,
    repetitions, deletions, fragmental speech
  • Various background noise
  • Will any of the existing methods work here?

8
Existing research
  • The Production of Speech Corpora (Schiel et. al.)
    fully automatic methods with usable results
  • Segmentation into words, if known vocabulary and
    not very spontaneous speech
  • Markup of prosodic features
  • Time alignment of phonemes ( probabilistic
    pronunciation rules give word alignment

9
Research, cont.
  • Sentence boundary tagging (Stolcke Shriberg
    1996)
  • Probabilities for boundaries between words
  • HMM Viterbi
  • POS-tags improves
  • Good sound quality
  • Interesting, but sentences are not utterances
  • Inter-word event tagging (Stolcke et. al. 1998)
  • Events are disfluencies in general
  • Input is forced alignment acoustic features
  • Not directly usable but, similar model and
    acoustic features may be useful for other events
    as well

10
HMM-based segmentation and alignment
  • Find the most probable alignment for a sequence
    of words
  • Sjölander (2003) describes an interesting system
  • Very interesting!
  • Reports correct alignment for 85.5 of boundaries
    within 20ms
  • Will it work on noisy signals?
  • A result of say 5 would be very useful
  • I have tried to get the system

11
Related tasks
  • Intensity discrimination
  • Easy to measure
  • Useful as indicator for phoneme changes, etc.
  • Voicing Determination and Fundamental Frequency
  • Many methods Cepstrum, probabilities based on
    weighted features
  • Voicing patterns could give good hints when
    specific words occur.
  • Glottalization and impulse detection
  • Intensity and sudden f0 decrease could be used
  • Glottalization is marked in the transcription!

12
Robust alignment
  • How could the algorithm used by Sjölander be
    revised for more robustness?
  • f0 (voicing) and glottalization detection
    ordinary probabilities for phonemes could help
  • Problem the speech models will not give
    probabilities for phonemes in simultaneous speech
  • Problem 2 GSLC does not contain phonetic
    transcription
  • Would training on letters work?
  • My guess this will not work good enough
  • Better approach to identify things that could be
    recognized since word-by-word alignment is not
    necessary

13
Conclusion
  • First thing to try Sjölanders aligner
  • Second Spoken event tagger
  • Identify events that could be recognized
  • Identify useful acoustic features
  • May for example a decision tree help to recognize
    the events?
  • Lots of test and experiments will be needed, if
    the forced alignment doesnt give useable results

14
The End!
  • Thank you for listening ?
  • ??? !!!
Write a Comment
User Comments (0)
About PowerShow.com