Create Photo-Realistic Talking Face - PowerPoint PPT Presentation

About This Presentation
Title:

Create Photo-Realistic Talking Face

Description:

Chinese phoneme has two basic groups: Initials and Finals. Initials: B, P, M, F, ... Chinese finals actually is not a basic elements of speech. ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 36
Provided by: chang
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Create Photo-Realistic Talking Face


1
Create Photo-Realistic Talking Face
  • Changbo Hu
  • 2001.11.26
  • This work was done during visiting Microsoft
    Research China with Baining Guo and Bo Zhang

2
Outline
  • Introduction of talking face
  • Motivations
  • System overview
  • Techniques
  • Conclusions

3
Introduction
  • What is a talking face
  • Face (lip) animation, driven by voice
  • Applications
  • The process of talking face
  • Face model
  • Motion capture
  • Mapping between
  • audio and video
  • Rendering,
  • Photo-realistic?

4
Literatures
  • Walter,93, DecFace, 2Dwire frame model
  • Terzopoulos,95, Skin and muscle model
  • Breglar,97, Video Rewrite, Sample image based
  • TS Huang,98,Mesh model from range data
  • Poggio,98, MikeTalk, Viseme morphing
  • Guenter,99, Making face, 3D from multicamera
  • Zhengyou Zhang, 00, 3D face modeling from video
    through epipolar constraint
  • Cosatto,00, Planar quads model

5
Some Face models
6
Motivations
  • Aim a graphics interface for conversation agent
  • Photo-realistic
  • Driven by Chinese
  • Smooth connection between sentences
  • Extended from Video rewrite

7
System overviewPipeline of the system(1)
8
System overview Pipeline of the system(2)
New text
TTS system
Wav sound
Segmentation
Triphone sequence
Train database
Synthesized triphone sequence
Lip motion sequence
Background sequence
Rewrite to faces
9
Techniques
  • Analysis
  • Audio process
  • Image process
  • Synthesis
  • Lip image
  • Background image
  • Stitch together

10
Audio partSound Segmentation
  • Given the wav file and the script
  • Using HMM to train the segment system
  • Segment wav file to phoneme sequence
  • Example of the segmentation result

SILOPEN 0 23 SILOPEN 24 42 s 43 61 if4 62
74 j 75 80 ia1 81 97 sh 98 109 ang1 110 12
1 y 122 130 e4 131 133 y 134 145 in2 146 1
54 h 155 164 ang2 165 194
11
Annotation with Phoneme
  • Using phoneme to annotate video frames
  • Each phoneme in a sentence corresponds to a short
    time of video sequence

12
Phoneme Distance Analysis
  • Phonemetriphone basics
  • Chinese Phoneme vs. English Phoneme
  • Distance Metrics definitions
  • Results

13
Phoneme Basics
  • Phonemes represents the basic elements in speech.
    All possible speech can be represented by
    combination of phonemes.
  • CH, JH, S, EH, EY, OY, AE, SIL
  • Triphone are three consecutive phonemes. It not
    only represents pronounce characteristics but
    also contains context information.
  • T-IY-P, IY-P-AA, P-AA-T

14
Chinese Phoneme vs. English
  • Chinese phoneme has two basic groups Initials
    and Finals.
  • Initials B, P, M, F,
  • Finals a3, o1, e2, eng3, iang4, ue5,
  • Chinese finals each has 5 tones 1,2,3,4,5.
  • Different tones a1, a2, a3, a4, a5.
  • Chinese finals actually is not a basic elements
    of speech.
  • For example iang1, iao1, uang1, iong1
  • Chinese phoneme set is much larger than English.

15
Phoneme Distance Analysis
  • Define the distance between any two phonemes.
  • Since we only synthesis video but not sound, so
    tone is ignored
  • Lip shape motion is the core element for distance
    metrics.

16
Phoneme Distance Analysis
Phoneme 1
Video 1
Video 2
Video 4
Video 3
Video 2
Video 3
Video 4
Video 1
Video Average
Time Align to an uniform length
Average the videos to get an average video
Phoneme 2
Video 1
Video 2
Video 2
Video 1
Video Average
By comparing the two aligned average videos, we
generate the distance matrix of the whole
phoneme set.
17
Image part Pose Tracking
  • Assume a plane model for face
  • Standard minimization method to find transform
    matrix (affine transform)Black,95
  • Mask is used to constrain interests part of the
    face

Template Picture
Mask Image
18
Pose tracking
  • Motion prediction using parameters with physical
    meaning

19
Pose Tracking
  • Some tracking results

20
Lip Motion Tracking
  • Using Eigen Points (Covell, 91)
  • Feature Points include Jaw, lip and teeth
  • Training database specified manually
  • Auto tracking through all pose-tracked images

21
Lip motion tracking
22
Lip Motion Tracking
Train Database (hand-labeled)
Auto Tracking Results
23
Synthesis new sentences
  • New text converted by TTS system to wav
  • Wav is segmented to phoneme sequence
  • Using DP to find an optimal video sequence from
    the training database
  • Time-align triphone videos and stitch them
    together.
  • Transform the lip sequence and paste them to
    background faces.

24
Lip sequence synthesis
New phoneme sequences
Optimal phoneme sequences
New phoneme sequences
Triphone 1
Triphone 4
Triphone 7
Triphone A
Triphone 2
Triphone 5
Triphone 8
Triphone B
Triphone 3
Triphone 6
Triphone 9
Triphone C
25
Dynamic Programming
Begin
End
Triphone1
Triphone3
Triphone2
Triphone4
Triphone5
26
Edge Cost Definition
  • Two parts
  • phoneme distance 3 phonemes distances added
    together
  • Lip shape distance for the overlap portion of
    triphone video
  • Weighted add together two part

27
Background video generation
  • Background is a video sequence when the virtual
    character spoke something else
  • Similarity measurement of background
  • Select standard frame
  • The frame with maximal number of frames similar
    to it
  • Filter out the frames with jerkiness

28
(No Transcript)
29
Stitch the time-aligned result to background faces
  • Write back with a mask
  • Transform the synthesized lip to the background
    face

30
Mask image for write-back operation
Original background frame
Write-back result of the same frame
31
More video results
32
More video results
33
Conclusion and Future Work
  • Pose tracking and lip motion tracking
  • Size of the train database
  • Talking face with expression
  • Real-time generation?
  • Fast modeling for different person

34
Animation
35
Thank you
Write a Comment
User Comments (0)
About PowerShow.com