Title: Ingen lysbildetittel
1RUNDKASTAn Annotated NorwegianBroadcast News
Speech Corpus
- LREC 2008
- Ingunn Amdal, Ole Morten Strand,
- Jørn Almberg, and Torbjørn Svendsen
2Overview
- Purpose of Rundkast
- An overview of the database Rundkast
- Structure of annotation
- Orthographic transcription
- Broad phonetic annotation
3Purpose of Rundkast
- Databases of broadcast news can be used for a
number of research topics in speech technology
such as - Supplement to existing databases of read speech
for training and testing automatic speech
recognition and speaker adaptation. - Research on recognition of spontaneous speech.
- Research on automatic indexing of audio data.
- Research on topic and/or speaker segmentation.
- Research on speech/non-speech detection (e.g.
background music). - International research cooperation involving
speech technology for broadcast news
applications. - A corpus of this kind is necessary for language
technology research, but has not been available
for Norwegian
4Overview of Rundkasthttp//www.iet.ntnu.no/projec
ts/rundkast/
- Database of 77 hours radio broadcast news
fromthe Norwegian Broadcasting Corporation
(NRK) - Read and spontaneous speech, as well as
spontaneous dialogsand multipart discussions - There is large variation between speakers,
speaking styles and topics - Speaker turns may be rapid and several speakers
may talk simultaneously - The quality of the recordings include studio and
telephone(mobile, satellite etc) - Frequent occurrences of background noise,
jingles,music and audio illustrations - Funded by the Norwegian University of Science and
Technology (NTNU)
5Structure of annotation
- Rundkast is hierarchically organizedand
orthographically annotated - Name of programme, type and date
- Name of speaker (if known) and dialect (5
regions) - Type of speech spontaneity, channel, recording
quality - Segmented in speaker turns of app. 2-5 seconds
- Orthographic transcription (standard Norwegian)
- Labels for noise (speaker noise, background noise
etc.) - Labels for pronunciation mistakes, foreign words,
unintelligible speech etc. - 70 hrs work per hour of recording
- Transcriber used for annotation standard-tool
6Hierarchy of annotation levels
levels 1section, 2speaker turn, and 3segment
7Orthographic transcription
- The lowest level in the annotation hierarchy,
segments, are transcribed orthographically. - Orthographic transcription of spoken language is
a challenge, especially for Norwegian. Using
dialect also in official circumstances is more
and more accepted. - The majority of RUNDKAST is not compliant to any
standard pronunciation. - The aim of the conventions for the orthographic
transcription in RUNDKAST is to minimize
uncertainty about pronunciations and facilitate
consistency.
8Orthographic transcriptionMain conventions
- Words are transcribed with the written forms
closest to actual pronunciations. A limited
number of interjections are allowed. - Text codes are used to mark mispronunciations,
truncations, and unknown words. - Numbers and symbols are written out as words.
- Abbreviations are not used.
- Punctuation marks are restricted to comma,
period, and question mark. - Space is used between spelled letters, also when
acronyms have spelled pronunciation. - Capital letters are used in proper names,
spellings, and acronyms, but not at the start of
sentences.
9Example annotation in Transcriber
10Broad phonetic annotation
- Part of the data were to be phonetically
annotated - Use for low-level experiments in ASR (new
methods), smaller Norwegian counterpart to TIMIT - Auto-segmentation for e.g. unit selection TTS
- Annotation to be based on existing standards
with necessary adjustments - Exploit experience and specifications from
development of Norwegian speech synthesis
databases - Suitable level of detail Acoustic boundaries
should be labeled, but more phonemic than
phonetic - Consistency of utmost importance!
11Broad phonetic annotationSelected data
- 10 speakers (5 male and 5 female)
- Amount of speech per speaker
- app 5 min planned speech and 1 min spontaneous
speech - discard noisy parts (as far as possible)
- from more than one programme
- use turn segmentation from orthographic
annotation - All in all 1 hour of speech
- Approximately 1000 hours of work
12Broad phonetic annotationMain principles
- The annotation is mainly phonemic using the
phoneme symbols closest to the perceived sound - Acoustic boundaries should be marked some
acoustically motivated symbols are included - A transcription as close as possible to the
citation form is preferred - Norwegian standard SAMPA is preferred
- Some English phonemes included as well as dialect
variants - Example 3 variants of the /r/-sound/r/
(tap/trill)/R/ (uvular fricative)/r\/
(approximant)
13Broad phonetic annotationAnnotation procedure
- Conversion of orthographic transcription to a
format suitable for automatic transcription. - Automatic segmentation with a phonotypical
transcription using a speech recognizer. - Manual correction of both segments and labels by
four phonetics students using Praat. - Format check.
- Control of all annotation by one supervisor.
14Broad phonetic annotationComments on deviations
- Always cases of uncertainty, need a log for
these. - Problem will the log be read?
- Solution Codes for deviations!
- Additional Praat tier for deviations
- Synchronous with the phoneme tier
- Easy to utilize automatically
- Examples
- creaky voice
- unexpected voiced/unvoiced
- uncertain boundary or symbol
- ... in addition a log file with whatever
deviations left
15Example annotation in Praat
16Concluding remarks
- Availability
- Planned to be included for non-commercial use in
a future Norwegian language bank - Will complement other corpora also intended to be
included - To be validated by Spex
- Planned use at NTNU SIRKUS project
- Investigation in new paradigms for ASR
- Low-level phone recognition experiments initially
- multi-linguality aspects
- Spoken information retrieval