Create Photo-Realistic Talking Face - PowerPoint PPT Presentation

About This Presentation

Title:

Create Photo-Realistic Talking Face

Description:

Chinese phoneme has two basic groups: Initials and Finals. Initials: B, P, M, F, ... Chinese finals actually is not a basic elements of speech. ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 36

Provided by: chang

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Create Photo-Realistic Talking Face

1
Create Photo-Realistic Talking Face

Changbo Hu
2001.11.26
This work was done during visiting Microsoft
Research China with Baining Guo and Bo Zhang

2
Outline

Introduction of talking face
Motivations
System overview
Techniques
Conclusions

3
Introduction

What is a talking face
Face (lip) animation, driven by voice
Applications
The process of talking face
Face model
Motion capture
Mapping between
audio and video
Rendering,
Photo-realistic?

4
Literatures

Walter,93, DecFace, 2Dwire frame model
Terzopoulos,95, Skin and muscle model
Breglar,97, Video Rewrite, Sample image based
TS Huang,98,Mesh model from range data
Poggio,98, MikeTalk, Viseme morphing
Guenter,99, Making face, 3D from multicamera
Zhengyou Zhang, 00, 3D face modeling from video
through epipolar constraint
Cosatto,00, Planar quads model

5
Some Face models
6
Motivations

Aim a graphics interface for conversation agent
Photo-realistic
Driven by Chinese
Smooth connection between sentences
Extended from Video rewrite

7
System overviewPipeline of the system(1)
8
System overview Pipeline of the system(2)
New text
TTS system
Wav sound
Segmentation
Triphone sequence
Train database
Synthesized triphone sequence
Lip motion sequence
Background sequence
Rewrite to faces
9
Techniques

Analysis
Audio process
Image process
Synthesis
Lip image
Background image
Stitch together

10
Audio partSound Segmentation

Given the wav file and the script
Using HMM to train the segment system
Segment wav file to phoneme sequence
Example of the segmentation result

SILOPEN 0 23 SILOPEN 24 42 s 43 61 if4 62
74 j 75 80 ia1 81 97 sh 98 109 ang1 110 12
1 y 122 130 e4 131 133 y 134 145 in2 146 1
54 h 155 164 ang2 165 194
11
Annotation with Phoneme

Using phoneme to annotate video frames
Each phoneme in a sentence corresponds to a short
time of video sequence

12
Phoneme Distance Analysis

Phonemetriphone basics
Chinese Phoneme vs. English Phoneme
Distance Metrics definitions
Results

13
Phoneme Basics

Phonemes represents the basic elements in speech.
All possible speech can be represented by
combination of phonemes.
CH, JH, S, EH, EY, OY, AE, SIL
Triphone are three consecutive phonemes. It not
only represents pronounce characteristics but
also contains context information.
T-IY-P, IY-P-AA, P-AA-T

14
Chinese Phoneme vs. English

Chinese phoneme has two basic groups Initials
and Finals.
Initials B, P, M, F,
Finals a3, o1, e2, eng3, iang4, ue5,
Chinese finals each has 5 tones 1,2,3,4,5.
Different tones a1, a2, a3, a4, a5.
Chinese finals actually is not a basic elements
of speech.
For example iang1, iao1, uang1, iong1
Chinese phoneme set is much larger than English.

15
Phoneme Distance Analysis

Define the distance between any two phonemes.
Since we only synthesis video but not sound, so
tone is ignored
Lip shape motion is the core element for distance
metrics.

16
Phoneme Distance Analysis
Phoneme 1
Video 1
Video 2
Video 4
Video 3
Video 2
Video 3
Video 4
Video 1
Video Average
Time Align to an uniform length
Average the videos to get an average video
Phoneme 2
Video 1
Video 2
Video 2
Video 1
Video Average
By comparing the two aligned average videos, we
generate the distance matrix of the whole
phoneme set.
17
Image part Pose Tracking

Assume a plane model for face
Standard minimization method to find transform
matrix (affine transform)Black,95
Mask is used to constrain interests part of the
face

Template Picture
Mask Image
18
Pose tracking

Motion prediction using parameters with physical
meaning

19
Pose Tracking

Some tracking results

20
Lip Motion Tracking

Using Eigen Points (Covell, 91)
Feature Points include Jaw, lip and teeth
Training database specified manually
Auto tracking through all pose-tracked images

21
Lip motion tracking
22
Lip Motion Tracking
Train Database (hand-labeled)
Auto Tracking Results
23
Synthesis new sentences

New text converted by TTS system to wav
Wav is segmented to phoneme sequence
Using DP to find an optimal video sequence from
the training database
Time-align triphone videos and stitch them
together.
Transform the lip sequence and paste them to
background faces.

24
Lip sequence synthesis
New phoneme sequences
Optimal phoneme sequences
New phoneme sequences
Triphone 1
Triphone 4
Triphone 7
Triphone A
Triphone 2
Triphone 5
Triphone 8
Triphone B
Triphone 3
Triphone 6
Triphone 9
Triphone C
25
Dynamic Programming
Begin
End
Triphone1
Triphone3
Triphone2
Triphone4
Triphone5
26
Edge Cost Definition

Two parts
phoneme distance 3 phonemes distances added
together
Lip shape distance for the overlap portion of
triphone video
Weighted add together two part

27
Background video generation

Background is a video sequence when the virtual
character spoke something else
Similarity measurement of background
Select standard frame
The frame with maximal number of frames similar
to it
Filter out the frames with jerkiness

28
(No Transcript)
29
Stitch the time-aligned result to background faces

Write back with a mask
Transform the synthesized lip to the background
face

30
Mask image for write-back operation
Original background frame
Write-back result of the same frame
31
More video results
32
More video results
33
Conclusion and Future Work