Nincs diacm - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Nincs diacm

Description:

representing 6 type of the intonation curve as: fall, rise, flooting, rise-fall, jumping intonation curves and the silence. fall. silent ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 28

Provided by: erdohegy

Category:

more less

Transcript and Presenter's Notes

Title: Nincs diacm

1
Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Integration of Supra-segmental Features into
Speech Syntheses and Speech Recognition
in SPEECH LABORATORIES Klara VICSI
vicsi_at_tmit.bme.hu Web http//alpha.ttt.bme.hu/spe
ech
INTERNATIONAL COST2102 WORKSHOP

VIETRI 2007
2

Topics
Short introduction of Speech Laboratories of
BME TMIT
Integration of prosodic features into speech
recognition
in case of fixed stress languages, like
Hungarian
Perception of emotion in Hungarian speech
Emotion in Speech synthesis for Hungarian

3
Speech Laboratories of Dept. of
Telecommunications Media Informatics (TMIT)
Speech Technology Lab Telecommunications
Laboratory of Speech Acoustics
Signal Processing Lab

(STL) (TSP)
(LSA)
Németh Géza PhD Tatai Péter MSc
Vicsi Klára D.Sc.
3 researchworker 2 researchworker 3
researchworker 3 engineer 4 engineer 1
engineer 2PhD student 1PhD student 3PhD
student education 306 student in Speech
Information Systems subject (2006) Speech
synthesis ASR through ASR for
PC-s telecomm. Systems Database collection
4
Speech Technology Laboratory Head Dr.
Géza Németh ( http//speechlab.ttt.bme.hu/postnuk
e/index.php) Multilingual Text-to-Speech
Synthesis MULTIVOX formant synthesis
text-to-speech system, multilingual, prosody
modelling, supporting 10 languages,
(1986-1996), PROFIVOX waveform, diphone /
triphone based TTS development environment,
scalable from mobile phones to PCs
multilingual, for Hungarian, German, Polish and
Spanish (1994- ) PROFIBOX corpus-based domain
specific TTS technology (ready for Hungarian
weather forecasts work on prompt generation
and railway timetables (2004- ) Emotion
expression in speech (2002- ) Speaking styles
in speech synthesis (e.g. book-reading, news,
etc., 2000- ) Text analysis and processing
for multilingual speech synthesis (e.g.
language detection, accent regeneration, ...,
1998- )
5
Computer Telephony Integration (CTI), Dialogue
Systems Hungarian fully automatic medicine
information dialogue system (telephone, WAP
and Web interfaces, www.gyogyszervonal.hu,
(2004-6) Hungarian SMS-reading system, the phone
name and address reader (together with AITIA
Corp. M.I.T Systems Ltd., T-Com Hungary,
T-Mobile Hungary 2004-5) Worlds 1st Symbian
mobile phone based SMS-reader (together with
M.I.T. Systems Ltd., English product name
SMSrapper, (2003- ) 1st Hungarian VoiceXML
browser (based on OpenVXI, 2002-3) Applications
of speech synthesis for the Disabled and the
Elderly Speaking systems included in screen
readers for the blind (e.g. Hungarian Jaws
for Windows, Magic for Windows, Pac-Mate,
etc.), in speaking aids for the speech impaired
and as aids for speech perception problems
(e.g. aphasia, dyslexia...) (1984-)
6
Laboratory of Speech Acoustics
Head Dr. Klára Vicsi
(http//alpha.tmit.bme.hu/speech/) Speech
recognition Continuous speech recognizers A
development tool (MKBF) under Windows XP. to
construct a middle-sized speech recognizer with a
vocabulary of 1000-20.000 words. The system
is based on a statistical approach (HMM
phoneme models, and bi-gram word/morpheme
language models) and works in real time.
New solutions for acoustical pre-processing,
for model building of phonemes, and for
language model (2002- ) Speech
Recognizer for Preparing Medical Reports
(2004-2005) Prosodic recogniser has been
developed. A cross-lingual study for
agglutinative, fixed stressed languages, like
Hungarian and Finnish was prepared by this
recognizer, about the segmentation of
continuous speech on word level by examination
of supra-segmental parameters. A word level
segmentation aligner has been developed which
can indicate the word boundaries (2002- ).
Speaker independent isolated word robust speech
recogniser (1989- ) Examination of
paralingual features of speech emotion, crying,
laughter (2005- )
7
Database collection

BABEL - free field - read speech, continuous,
sound connections, words,
multi level segmentation-100 sp.
ELRA, EU INCO - COPERNICUS project (1995-99).
SPEECHDAT_E - fixed network - read speech,
continuous, words,
orthography, noise annotation,2000 sp
(1999-2000).
MTBA - (SPEECHDAT like) - GSM, fixed network
-continuous speech, words, orthography, noise
annotation, phoneme-segm. and labelling 500 sp
(2002-2004)
MRBA - 2 channel high-bandwidth, office
environment, continuous, read speech
orthography, noise annotation, phoneme-segmentatio
n and labelling 500 sp
(2004-2005)
BROADCAST NEWS - audio-visual recordings -
reports, news, etc. 3hours
orthography, noise, music annotation with
pronouniation variation(2002-2004)
SPECO - children speech - high-bandwidth
application
orthography, noise annotation, phoneme
segmentation (1998-2001).
-http//alpha.tmit.bme.hu/speech/databases.php

Speech processing for speech handicapped
A Multilingual pronunciation teaching and
training method within the EU Copernicus
program, entitled SPECO (1998-2001).
Interactive hearing and speech perception therapy
through the Internet. (2003) Statistical
examination of the Hungarian language
8
Telecommunications Signal Processing Laboratory
(http//www.tmit.bme.hu/labgroup!hun)
Head Péter
Tatai Presently the TSP Lab research activities
are concentrated on telecommunication testing
(traffic and protocol analysis, mobile services)
and Speech quality measurement Subjective
testing tools - absolute and comparison tests
Objective testing of mobile and VoIP
channels. speech processing for
telecommunication Improving the robustness of
recognizer front end. Word spotting for indexing
large speech data bases. Speech recognizer for
portable (PDA) devices. The results of the
research are incorporated into products, (such
as voice controlled call centers, voice portals,
location based services, etc.) and distributed
by Aitia International, Inc.
9
Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Integration of Prosodic Features into Speech
Recognition in Case of Fixed Stress Languages,
like Hungarian György Szaszák - Klára
Vicsi Laboratory of Speech Acoustics
10
The aim to increase the robustness of speech
recognition by the detection of word
boundaries word boundary
segmenter and by the integration of this
segmenter into a speech recognizer, decrease
the searching space during the decoding process.
The importance of this searching space
reduction is highly important in case of
agglutinative languages.
11
Linguistic Description of the Finno-Ugrian
Language Family These languages are highly
agglutinative The number of different word
forms is about hundreds of millions. Word
forms are composed by oblique stem and suffixes.
In addition, suffixes influence the form of
stem in many cases. Words are characterized
with longer average word length than English,
and with a relatively free word order. Due to
this, almost all words have some stress (a
stronger or a slighter stress depending on the
syntactical structure), normally on the first
syllable (fixed stress), except in case of
conjunctions or articles.
12

Word Level Prosodic Segmentation
The peaks of energy and fundamental frequency
present well the first syllables of the words.

Fundamental frequency and energy levels measured
at the midlle of vowels and duration of the
vowels in the syllables in a Hungarian sentence
titkArul sErz2tete O f2konzul lAnyAt. The
numbers of the syllable sequence are presented at
the X axis.
13
Database Processing for HMM Provided Word Level
Prosodic Segmentation 6 different HMM prosodic
models were constructed representing 6 type of
the intonation curve as fall, rise, flooting,
rise-fall, jumping intonation curves and the
silence
fall
silent
fall
flooting
jumping
Training examples for different models
14
Word Level Prosodic Segmentation Versus Experts
Hand Segmentation on a Passage of 3 Hungarian
Sentences
s denote word boundaries and k phrase
boundaries, sil silence Exactness80
15
Integration of Prosodic Segmenter into Speech
Recognition
bigram weights be even set to an equal value
37 phoneme HMM models Bigram language model - N
best latice

Segmental preprocessing 12 MFCC E1st 2nd
order deltas
word or word-chain candidates
Rescored latice FINALE RECOGNITION
TRADITIONAL RECOGNIZER
predicted boundaries
7 prosodic HMM Models sentence model
Supra-segmental Preprocessing F0E1st 2nd
order deltas
The principle of rescoring is to remunerate word
or word-chain candidates whose boundaries match
the prosodic segmentation
PROSODIC SEGMENTER
16
Conclusion Word error rate decreased by
4 but allowing communication between
prosodic segmenter and first pass speech
recognizer modules (ie. by using phoneme and
hence syllable - alignment information in the
prosodic segmenter), this error can be further
minimized.
17
Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Perception of Emotion in Hungarian Speech Klara
Vicsi Tóth Szabolcs Levente Laboratory of
speech Acoustics
18
Listening tests
7 emotions Happiness, Sadness, Anger,
Surprises, Corn (disgust), Fear
Nervous-excited
Test 1. 3 professional actors - ACTORS spoke 3
sentences (without emotional meanings) with 7
emotions. Test 2. 8 colleagues from the lab.
HUMAN a. Spoke 3 sentences (without emotional
meanings) with the 7 emotions b. Spoke 3
sentences (without emotional meanings) with the 7
emotions together with a neutral
sentence, before it Test 3. From TV talk sow
TV a, 4 samples of each emotions (32
sentences) b. 4 samples of each emotions (32
sentences) togethe with a neutral sentence,
before it
19
Listen by 20 students
Test 1 Test 2 Test 3
20
Listen by 20 students
21
Budapest University of Technology and Economics
(BME) Dept. of Telecommunications and
Mediainformatics (TMIT)
Emotion in Speech Synthesis for Hungarian

Márk Fék, Csaba Zainkó, Gábor Olaszy, and Géza
Németh
fek,zainko,olaszy,nemeth_at_tmit.bme.hu
Speech Technology Laboratory

22
Settings of the first Experiment

2 actors (1 male, 1 female)
7 emotions (surprise, anger, sadness, scorn,
fear, relief, happiness) neutral
2 sentences
Copying the prosody curves (F0, sound durations,
intensity) of emotional samples to neutral
(natural) samples
Voice quality is not modified
Listening test 20 listeners, forced choice (1
emotion of 7)

23
Results of the first Experiment
anger surprise
fear
24
Settings of the second Experiment

1 actress
6 basic emotions (surprise, anger, sadness, fear,
disgust, happiness) neutral
1 sentence
Copying the prosody curves (F0, sound durations,
intensity) of emotional samples to synthesized
speech PROFIBOX
7 different databases with 7 voice qualities
corresponding to the emotions
Listening test 80 listeners, forced choice (1
emotion of 7)