Title: Nincs diacm
1Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Integration of Supra-segmental Features into
Speech Syntheses and Speech Recognition
in SPEECH LABORATORIES Klara VICSI
vicsi_at_tmit.bme.hu Web http//alpha.ttt.bme.hu/spe
ech
INTERNATIONAL COST2102 WORKSHOP
VIETRI 2007
2- Topics
- Short introduction of Speech Laboratories of
BME TMIT - Integration of prosodic features into speech
recognition - in case of fixed stress languages, like
Hungarian - Perception of emotion in Hungarian speech
- Emotion in Speech synthesis for Hungarian
3Speech Laboratories of Dept. of
Telecommunications Media Informatics (TMIT)
Speech Technology Lab Telecommunications
Laboratory of Speech Acoustics
Signal Processing Lab
(STL) (TSP)
(LSA)
Németh Géza PhD Tatai Péter MSc
Vicsi Klára D.Sc.
3 researchworker 2 researchworker 3
researchworker 3 engineer 4 engineer 1
engineer 2PhD student 1PhD student 3PhD
student education 306 student in Speech
Information Systems subject (2006) Speech
synthesis ASR through ASR for
PC-s telecomm. Systems Database collection
4Speech Technology Laboratory Head Dr.
Géza Németh ( http//speechlab.ttt.bme.hu/postnuk
e/index.php) Multilingual Text-to-Speech
Synthesis MULTIVOX formant synthesis
text-to-speech system, multilingual, prosody
modelling, supporting 10 languages,
(1986-1996), PROFIVOX waveform, diphone /
triphone based TTS development environment,
scalable from mobile phones to PCs
multilingual, for Hungarian, German, Polish and
Spanish (1994- ) PROFIBOX corpus-based domain
specific TTS technology (ready for Hungarian
weather forecasts work on prompt generation
and railway timetables (2004- ) Emotion
expression in speech (2002- ) Speaking styles
in speech synthesis (e.g. book-reading, news,
etc., 2000- ) Text analysis and processing
for multilingual speech synthesis (e.g.
language detection, accent regeneration, ...,
1998- )
5Computer Telephony Integration (CTI), Dialogue
Systems Hungarian fully automatic medicine
information dialogue system (telephone, WAP
and Web interfaces, www.gyogyszervonal.hu,
(2004-6) Hungarian SMS-reading system, the phone
name and address reader (together with AITIA
Corp. M.I.T Systems Ltd., T-Com Hungary,
T-Mobile Hungary 2004-5) Worlds 1st Symbian
mobile phone based SMS-reader (together with
M.I.T. Systems Ltd., English product name
SMSrapper, (2003- ) 1st Hungarian VoiceXML
browser (based on OpenVXI, 2002-3) Applications
of speech synthesis for the Disabled and the
Elderly Speaking systems included in screen
readers for the blind (e.g. Hungarian Jaws
for Windows, Magic for Windows, Pac-Mate,
etc.), in speaking aids for the speech impaired
and as aids for speech perception problems
(e.g. aphasia, dyslexia...) (1984-)
6Laboratory of Speech Acoustics
Head Dr. Klára Vicsi
(http//alpha.tmit.bme.hu/speech/) Speech
recognition Continuous speech recognizers A
development tool (MKBF) under Windows XP. to
construct a middle-sized speech recognizer with a
vocabulary of 1000-20.000 words. The system
is based on a statistical approach (HMM
phoneme models, and bi-gram word/morpheme
language models) and works in real time.
New solutions for acoustical pre-processing,
for model building of phonemes, and for
language model (2002- ) Speech
Recognizer for Preparing Medical Reports
(2004-2005) Prosodic recogniser has been
developed. A cross-lingual study for
agglutinative, fixed stressed languages, like
Hungarian and Finnish was prepared by this
recognizer, about the segmentation of
continuous speech on word level by examination
of supra-segmental parameters. A word level
segmentation aligner has been developed which
can indicate the word boundaries (2002- ).
Speaker independent isolated word robust speech
recogniser (1989- ) Examination of
paralingual features of speech emotion, crying,
laughter (2005- )
7Database collection
- BABEL - free field - read speech, continuous,
sound connections, words, - multi level segmentation-100 sp.
- ELRA, EU INCO - COPERNICUS project (1995-99).
- SPEECHDAT_E - fixed network - read speech,
continuous, words, - orthography, noise annotation,2000 sp
(1999-2000). - MTBA - (SPEECHDAT like) - GSM, fixed network
- -continuous speech, words, orthography, noise
annotation, phoneme-segm. and labelling 500 sp
(2002-2004) - MRBA - 2 channel high-bandwidth, office
environment, continuous, read speech - orthography, noise annotation, phoneme-segmentatio
n and labelling 500 sp - (2004-2005)
- BROADCAST NEWS - audio-visual recordings -
reports, news, etc. 3hours - orthography, noise, music annotation with
pronouniation variation(2002-2004) - SPECO - children speech - high-bandwidth
application - orthography, noise annotation, phoneme
segmentation (1998-2001). - -http//alpha.tmit.bme.hu/speech/databases.php
Speech processing for speech handicapped
A Multilingual pronunciation teaching and
training method within the EU Copernicus
program, entitled SPECO (1998-2001).
Interactive hearing and speech perception therapy
through the Internet. (2003) Statistical
examination of the Hungarian language
8Telecommunications Signal Processing Laboratory
(http//www.tmit.bme.hu/labgroup!hun)
Head Péter
Tatai Presently the TSP Lab research activities
are concentrated on telecommunication testing
(traffic and protocol analysis, mobile services)
and Speech quality measurement Subjective
testing tools - absolute and comparison tests
Objective testing of mobile and VoIP
channels. speech processing for
telecommunication Improving the robustness of
recognizer front end. Word spotting for indexing
large speech data bases. Speech recognizer for
portable (PDA) devices. The results of the
research are incorporated into products, (such
as voice controlled call centers, voice portals,
location based services, etc.) and distributed
by Aitia International, Inc.
9Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Integration of Prosodic Features into Speech
Recognition in Case of Fixed Stress Languages,
like Hungarian György Szaszák - Klára
Vicsi Laboratory of Speech Acoustics
10 The aim to increase the robustness of speech
recognition by the detection of word
boundaries word boundary
segmenter and by the integration of this
segmenter into a speech recognizer, decrease
the searching space during the decoding process.
The importance of this searching space
reduction is highly important in case of
agglutinative languages.
11Linguistic Description of the Finno-Ugrian
Language Family These languages are highly
agglutinative The number of different word
forms is about hundreds of millions. Word
forms are composed by oblique stem and suffixes.
In addition, suffixes influence the form of
stem in many cases. Words are characterized
with longer average word length than English,
and with a relatively free word order. Due to
this, almost all words have some stress (a
stronger or a slighter stress depending on the
syntactical structure), normally on the first
syllable (fixed stress), except in case of
conjunctions or articles.
12- Word Level Prosodic Segmentation
- The peaks of energy and fundamental frequency
present well the first syllables of the words.
Fundamental frequency and energy levels measured
at the midlle of vowels and duration of the
vowels in the syllables in a Hungarian sentence
titkArul sErz2tete O f2konzul lAnyAt. The
numbers of the syllable sequence are presented at
the X axis.
13Database Processing for HMM Provided Word Level
Prosodic Segmentation 6 different HMM prosodic
models were constructed representing 6 type of
the intonation curve as fall, rise, flooting,
rise-fall, jumping intonation curves and the
silence
fall
silent
fall
flooting
jumping
Training examples for different models
14Word Level Prosodic Segmentation Versus Experts
Hand Segmentation on a Passage of 3 Hungarian
Sentences
s denote word boundaries and k phrase
boundaries, sil silence Exactness80
15Integration of Prosodic Segmenter into Speech
Recognition
bigram weights be even set to an equal value
37 phoneme HMM models Bigram language model - N
best latice
Segmental preprocessing 12 MFCC E1st 2nd
order deltas
word or word-chain candidates
Rescored latice FINALE RECOGNITION
TRADITIONAL RECOGNIZER
predicted boundaries
7 prosodic HMM Models sentence model
Supra-segmental Preprocessing F0E1st 2nd
order deltas
The principle of rescoring is to remunerate word
or word-chain candidates whose boundaries match
the prosodic segmentation
PROSODIC SEGMENTER
16Conclusion Word error rate decreased by
4 but allowing communication between
prosodic segmenter and first pass speech
recognizer modules (ie. by using phoneme and
hence syllable - alignment information in the
prosodic segmenter), this error can be further
minimized.
17Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Perception of Emotion in Hungarian Speech Klara
Vicsi Tóth Szabolcs Levente Laboratory of
speech Acoustics
18Listening tests
7 emotions Happiness, Sadness, Anger,
Surprises, Corn (disgust), Fear
Nervous-excited
Test 1. 3 professional actors - ACTORS spoke 3
sentences (without emotional meanings) with 7
emotions. Test 2. 8 colleagues from the lab.
HUMAN a. Spoke 3 sentences (without emotional
meanings) with the 7 emotions b. Spoke 3
sentences (without emotional meanings) with the 7
emotions together with a neutral
sentence, before it Test 3. From TV talk sow
TV a, 4 samples of each emotions (32
sentences) b. 4 samples of each emotions (32
sentences) togethe with a neutral sentence,
before it
19Listen by 20 students
Test 1 Test 2 Test 3
20Listen by 20 students
21Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Emotion in Speech Synthesis for Hungarian
- Márk Fék, Csaba Zainkó, Gábor Olaszy, and Géza
Németh - fek,zainko,olaszy,nemeth_at_tmit.bme.hu
- Speech Technology Laboratory
22Settings of the first Experiment
- 2 actors (1 male, 1 female)
- 7 emotions (surprise, anger, sadness, scorn,
fear, relief, happiness) neutral - 2 sentences
- Copying the prosody curves (F0, sound durations,
intensity) of emotional samples to neutral
(natural) samples - Voice quality is not modified
- Listening test 20 listeners, forced choice (1
emotion of 7)
23Results of the first Experiment
anger surprise
fear
24Settings of the second Experiment
- 1 actress
- 6 basic emotions (surprise, anger, sadness, fear,
disgust, happiness) neutral - 1 sentence
- Copying the prosody curves (F0, sound durations,
intensity) of emotional samples to synthesized
speech PROFIBOX - 7 different databases with 7 voice qualities
corresponding to the emotions - Listening test 80 listeners, forced choice (1
emotion of 7)
25Results of the second Experiment
Recognition ratio (in ) per emotions
26(No Transcript)
27(No Transcript)