Title: Create Photo-Realistic Talking Face
1Create Photo-Realistic Talking Face
- Changbo Hu
- 2001.11.26
- This work was done during visiting Microsoft
Research China with Baining Guo and Bo Zhang
2Outline
- Introduction of talking face
- Motivations
- System overview
- Techniques
- Conclusions
3Introduction
- What is a talking face
- Face (lip) animation, driven by voice
- Applications
- The process of talking face
- Face model
- Motion capture
- Mapping between
- audio and video
- Rendering,
- Photo-realistic?
4Literatures
- Walter,93, DecFace, 2Dwire frame model
- Terzopoulos,95, Skin and muscle model
- Breglar,97, Video Rewrite, Sample image based
- TS Huang,98,Mesh model from range data
- Poggio,98, MikeTalk, Viseme morphing
- Guenter,99, Making face, 3D from multicamera
- Zhengyou Zhang, 00, 3D face modeling from video
through epipolar constraint - Cosatto,00, Planar quads model
-
5Some Face models
6Motivations
- Aim a graphics interface for conversation agent
- Photo-realistic
- Driven by Chinese
- Smooth connection between sentences
- Extended from Video rewrite
7System overviewPipeline of the system(1)
8System overview Pipeline of the system(2)
New text
TTS system
Wav sound
Segmentation
Triphone sequence
Train database
Synthesized triphone sequence
Lip motion sequence
Background sequence
Rewrite to faces
9Techniques
- Analysis
- Audio process
- Image process
- Synthesis
- Lip image
- Background image
- Stitch together
10Audio partSound Segmentation
- Given the wav file and the script
- Using HMM to train the segment system
- Segment wav file to phoneme sequence
- Example of the segmentation result
SILOPEN 0 23 SILOPEN 24 42 s 43 61 if4 62
74 j 75 80 ia1 81 97 sh 98 109 ang1 110 12
1 y 122 130 e4 131 133 y 134 145 in2 146 1
54 h 155 164 ang2 165 194
11Annotation with Phoneme
- Using phoneme to annotate video frames
- Each phoneme in a sentence corresponds to a short
time of video sequence
12Phoneme Distance Analysis
- Phonemetriphone basics
- Chinese Phoneme vs. English Phoneme
- Distance Metrics definitions
- Results
13Phoneme Basics
- Phonemes represents the basic elements in speech.
All possible speech can be represented by
combination of phonemes. - CH, JH, S, EH, EY, OY, AE, SIL
- Triphone are three consecutive phonemes. It not
only represents pronounce characteristics but
also contains context information. - T-IY-P, IY-P-AA, P-AA-T
14Chinese Phoneme vs. English
- Chinese phoneme has two basic groups Initials
and Finals. - Initials B, P, M, F,
- Finals a3, o1, e2, eng3, iang4, ue5,
- Chinese finals each has 5 tones 1,2,3,4,5.
- Different tones a1, a2, a3, a4, a5.
- Chinese finals actually is not a basic elements
of speech. - For example iang1, iao1, uang1, iong1
- Chinese phoneme set is much larger than English.
15Phoneme Distance Analysis
- Define the distance between any two phonemes.
- Since we only synthesis video but not sound, so
tone is ignored - Lip shape motion is the core element for distance
metrics.
16Phoneme Distance Analysis
Phoneme 1
Video 1
Video 2
Video 4
Video 3
Video 2
Video 3
Video 4
Video 1
Video Average
Time Align to an uniform length
Average the videos to get an average video
Phoneme 2
Video 1
Video 2
Video 2
Video 1
Video Average
By comparing the two aligned average videos, we
generate the distance matrix of the whole
phoneme set.
17Image part Pose Tracking
- Assume a plane model for face
- Standard minimization method to find transform
matrix (affine transform)Black,95 - Mask is used to constrain interests part of the
face
Template Picture
Mask Image
18Pose tracking
- Motion prediction using parameters with physical
meaning
19Pose Tracking
20Lip Motion Tracking
- Using Eigen Points (Covell, 91)
- Feature Points include Jaw, lip and teeth
- Training database specified manually
- Auto tracking through all pose-tracked images
21Lip motion tracking
22Lip Motion Tracking
Train Database (hand-labeled)
Auto Tracking Results
23Synthesis new sentences
- New text converted by TTS system to wav
- Wav is segmented to phoneme sequence
- Using DP to find an optimal video sequence from
the training database - Time-align triphone videos and stitch them
together. - Transform the lip sequence and paste them to
background faces.
24Lip sequence synthesis
New phoneme sequences
Optimal phoneme sequences
New phoneme sequences
Triphone 1
Triphone 4
Triphone 7
Triphone A
Triphone 2
Triphone 5
Triphone 8
Triphone B
Triphone 3
Triphone 6
Triphone 9
Triphone C
25Dynamic Programming
Begin
End
Triphone1
Triphone3
Triphone2
Triphone4
Triphone5
26Edge Cost Definition
- Two parts
- phoneme distance 3 phonemes distances added
together - Lip shape distance for the overlap portion of
triphone video - Weighted add together two part
27Background video generation
- Background is a video sequence when the virtual
character spoke something else - Similarity measurement of background
- Select standard frame
- The frame with maximal number of frames similar
to it - Filter out the frames with jerkiness
28(No Transcript)
29Stitch the time-aligned result to background faces
- Write back with a mask
- Transform the synthesized lip to the background
face
30Mask image for write-back operation
Original background frame
Write-back result of the same frame
31More video results
32More video results
33Conclusion and Future Work
- Pose tracking and lip motion tracking
- Size of the train database
- Talking face with expression
- Real-time generation?
- Fast modeling for different person
34Animation
35Thank you