Title: Face Animation Overview with Shameless Bias Toward MPEG4 Face Animation Tools
1Face Animation Overview with Shameless Bias
Toward MPEG-4 Face Animation Tools
Dr. Eric Petajan Chief Scientist and
Founder face2face animation, inc. eric_at_f2f-inc.com
2Computer-generated Face Animation Methods
- Morph targets/key frames (traditional)
- Speech articulation model (TTS)
- Facial Action Coding System (FACS)
- Physics-based (skin and muscle models)
- Marker-based (dots glued to face)
- Video-based (surface features)
3Morph targets/key frames
- Advantages
- Complete manual control of each frame
- Good for exaggerated expressions
- Disadvantages
- Hard to achieve good lipsync without manual
tweeking - Morph targets must be downloaded to terminal for
streaming animation (delay)
4Speech articulation model
- Advantages
- High level control of face
- Enables TTS
- Disadvantages
- Robotic character
- Hard to sync with real voice
5Facial Action Coding System
- Advantages
- Very high level control of face
- Maps to morph targets
- Explicit specification of emotional states
- Disadvantages
- Not good for speech
- Not quantified
6Physics-based
- Advantages
- Good for realistic skin, muscle and fat
- Collision detection
- Disadvantages
- High complexity
- Must be driven by high level articulation
parameters (TTS) - Hard to drive with motion capture data
7Marker-based
- Advantages
- Can provide accurate motion data from most of the
face - Face models can be animated directly from surface
feature point motion - Disadvantages
- Dots glued to face
- Dots must be manually registered
- Not good for accurate inner lip contour or eyelid
tracking
8Video-based
- Advantages
- Simple to capture video of face
- Face models can be animated directly from surface
feature motion - Disadvantages
- Must have good view of face
9What is MPEG-4 Multimedia?
- Natural audio and video objects
- 2D and 3D graphics (based on VRML)
- Animation (virtual humans)
- Synthetic speech and audio
10Samples versus Objects
- Traditional video coding is sample based (blocks
of pixels are compressed) - MPEG-4 provides visual object representation for
better compression and new functionalities - Objects are rendered in the terminal after
decoding object descriptors
11Object-based Functionalities
- User can choose display of content layers
- Individual objects (text, models) can be searched
or stored for later used - Content is independent of display resolution
- Content can be easily repurposed by provider for
different networks and users
12MPEG-4 Object Composition
- Objects are organized in a scene graph
- Scene graphs are specified using a binary format
called BIFS (based on VRML) - Both 2D and 3D objects, properties and transforms
are specified in BIFS - BIFS allows objects to be transmitted once and
instanced repeatedly in the scene after
transformations
13MPEG-4 Operation Sequence
14(No Transcript)
15Faces are Special
- Humans are hard-wired to respond to faces
- The face is the primary communication interface
- Human faces can be automatically analyzed and
parameterized for a wide variety of applications
16MPEG-4 Face and Body Animation Coding
- Face animation is in MPEG-4 version 1
- Body animation is in MPEG-4 version 2
- Face animation parameters displace feature points
from neutral position - Body animation parameters are joint angles
- Face and body animation parameter sequences are
compressed to low bitrates
17Neutral Face Definition
- Head axes parallel to the world axes
- Gaze is in direction of Z axis
- Eyelids tangent to the iris
- Pupil diameter is one third of iris diameter
- Mouth is closed and the upper and lower teeth are
touching - Tongue is flat, horizontal with the tip of tongue
touching the boundary between upper and lower
teeth
18Face Feature Points
Teeth
Feature points affected by FAPs
Other feature points
19Face Animation Parameter Normalization
- Face Animation Parameters (FAPs) are normalized
to facial dimensions - Each FAP is measured as a fraction of neutral
face mouth width, mouth-nose distance, eye
separation, or iris diameter - 3 Head and 2 eyeball rotation FAPs are Euler
angles
20Neutral Face Dimensions for FAP Normalization
21FAP Groups
22Lip FAPs
- Mouth closed if sum of upper and lower lip FAPs
0
23Face Model Independence
- FAPs are always normalized for model independence
- FAPs (and BAPs) can be used without MPEG-4
systems/BIFS - Private face models can be accurately animated
with FAPs - Face models can be simple or complex depending on
terminal resources
24MPEG-4 BIFS Face Node
- Face node contains FAP node, Face scene graph,
Face Definition Parameters (FDP), FIT,and FAT - FIT (Face Interpolation Table) specifies
interpolation of FAPs in terminal - FAT (Face Animation Table) maps FAPs to Face
model deformation - FDP information included face feature points
positions and texture map
25Face Model Download
- 3D graphical models (e.g. faces) can be
downloaded to the terminal with MPEG-4 - 3D model specification is based on VRML
- Face Animation Table( FAT) maps FAPs to face
model vertex displacements - Appearance and animation of downloaded face
models is exactly predictable
26FAP Compression
- FAPs are adaptively quantized to desired quality
level - Quantized FAPs are differentially coded
- Adaptive arithmetic coding further reduces
bitrate - Typical compressed FAP bitrate is less than 2
kilobits/second
27FAP Predictive Coding
FAP(t)
Q
-
Bitstream
Arithmetic Coder
Frame Delay
Q-1
28Face Analysis System
- MPEG-4 does not specify analysis systems
- face2face face analysis system tracks nostrils
for robust operation - Inner lip contour estimated using adaptive color
thresholding and lip modeling - Eyelids, eyebrows and gaze direction
29Nostril Tracking
30Inner Lip Contour Estimation
31FAP Estimation Algorithm
- Head scale is normalized based on neutral mouth
(closed mouth) width - Head pitch is approximated based on vertical
nostril deviation from neutral head position - Head roll is computed from smoothed eye or
nostril orientation depending on availability - Inner lip FAPs are measured directly from the
inner lips contour as deviations from the neutral
lip position (closed mouth)
32FAP Sequence Smoothing
33MPEG-4 Visemes and Expressions
- A weighted combination of 2 visemes and 2 facial
expressions for each frame - Decoder is free to interpret effect of visemes
and expressions after FAPs are applied - Definitions of visemes and expressions using FAPs
can also be downloaded
34Visemes
35Facial Expressions
36Free Face Model Software
- Wireface is an openGL-based, MPEG-4 compliant
face model - Good starting point for building high quality
face models for web applications - Reads FAP file and raw audio file
- Renders face and audio in real time
- Wireface source is freely available
37Body Animation
- Harmonized with VRML Hanim spec
- Body Animation Parameters (BAPs) are humanoid
skeleton joint Euler angles - Body Animation Table (BAT) can be downloaded to
map BAPs to skin deformation - BAPs can be highly compressed for streaming
38Body Animation Parameters (BAPs)
- 186 humanoid skeleton euler angles
- 110 free parameters for use with downloaded body
surface mesh - Coded using same codecs as FAPs
- Typical bitrates for coded BAPs is 5-10kbps
39Body Definition Parameters (BDPs)
- Humanoid joint center positions
- Names and hierarchy harmonized with VRML/Web3D
H-Anim working group - Default positions in standard for broadcast
applications - Download just BDPs to accurately animate unknown
body model
40Faces Enhance the User Experience
- Virtual call center agents
- News readers (e.g. Ananova)
- Story tellers for the child in all of us
- eLearning
- Program guide
- Multilingual (same face different voice)
- Entertainment animation
- Multiplayer games
41Visual Content for the Practical Internet
- Broadband deployment is happening slowly
- DSL availability is limited and cable is shared
- Talking heads need high frame-rate
- Consumer graphics hardware is cheap and powerful
- MPEG-4 SNHC/FBA tools are matched to available
bandwidth and terminals
42Visual Speech Processing
- FAPs can be used to improve speech recognition
accuracy - Text-to-speech systems can use FAPs to animate
face models - FAPs can be used in computer-human dialogue
systems to communicate emotions, intentions and
speech especially in noisy environments
43Video-driven Face Animation
- Facial expressions, lip movements and head motion
transferred to face model - FAPs extracted from talking head video with
special computer vision system - No face markers or lipstick is required
- Normal lighting is used
- Communicates lip movements and facial expressions
with visual anonymity
44Automatic Face Animation Demonstration
- FAPs extracted from camcorder video
- FAPs compressed to less than 2 kbits/sec
- 30 frames/sec animation generated automatically
- Face models animated with bones rig or fixed
deformable mesh (real-time)
45(No Transcript)
46What is easy, solved, or almost solved
- Can we do photorealistic non-animated face
models? YES - Can we do near-real-time lip sync'ing that is
indistinguishable from a human? NO
47What is really hard
- Synthesizing human speech and facial expressions
- Hair
48What we have assumed someone else is solving
- Graphics acceleration
- Video camera cost and resolution
- Multimedia communication infrastructure
49Where we need help
- We have a face with 68 parameters but we need
the psychologists to tell us how to drive it
autonomously - We need to embody our agents into graphical
models that have a couple of thousand parameters
to control gaze, gesture, body language, and do
collision detection-gt NEED MORE SPEED
50Core functionality of the face
- Speech
- Lips, teeth, tongue
- Emotional expressions
- Gaze, eyebrow, eyelids, head pose
- Non-verbal communication
- Sensory responsivity
- Technical requirements
- Framerate
- Synchronization
- Latency
- Bitrate
- Spatial resolution
- Complexity
- Common framework withbody
- Interaction
- Different faces should respond similarly to
common commands - Accessible to everyone
51Interaction with other components
- Language and discourse
- Phoneme to viseme mapping
- Given/new
- Action in the environment
- Global information
- Emotional state
- Personality
- Culture
- World knowledge
- Central time-base and timestamps
52Open questions
- Central vs peripheral functionality
- Degree of interface commonality
- Degree of agent autonomy
- What should the VH be capable of