Face Animation Overview with Shameless Bias Toward MPEG4 Face Animation Tools

1 / 52
About This Presentation
Title:

Face Animation Overview with Shameless Bias Toward MPEG4 Face Animation Tools

Description:

FAPs (and BAPs) can be used without MPEG-4 systems/BIFS ... Body Animation Parameters (BAPs) are humanoid skeleton joint Euler angles ... –

Number of Views:108
Avg rating:3.0/5.0
Slides: 53
Provided by: ericpe1
Category:

less

Transcript and Presenter's Notes

Title: Face Animation Overview with Shameless Bias Toward MPEG4 Face Animation Tools


1
Face Animation Overview with Shameless Bias
Toward MPEG-4 Face Animation Tools
Dr. Eric Petajan Chief Scientist and
Founder face2face animation, inc. eric_at_f2f-inc.com
2
Computer-generated Face Animation Methods
  • Morph targets/key frames (traditional)
  • Speech articulation model (TTS)
  • Facial Action Coding System (FACS)
  • Physics-based (skin and muscle models)
  • Marker-based (dots glued to face)
  • Video-based (surface features)

3
Morph targets/key frames
  • Advantages
  • Complete manual control of each frame
  • Good for exaggerated expressions
  • Disadvantages
  • Hard to achieve good lipsync without manual
    tweeking
  • Morph targets must be downloaded to terminal for
    streaming animation (delay)

4
Speech articulation model
  • Advantages
  • High level control of face
  • Enables TTS
  • Disadvantages
  • Robotic character
  • Hard to sync with real voice

5
Facial Action Coding System
  • Advantages
  • Very high level control of face
  • Maps to morph targets
  • Explicit specification of emotional states
  • Disadvantages
  • Not good for speech
  • Not quantified

6
Physics-based
  • Advantages
  • Good for realistic skin, muscle and fat
  • Collision detection
  • Disadvantages
  • High complexity
  • Must be driven by high level articulation
    parameters (TTS)
  • Hard to drive with motion capture data

7
Marker-based
  • Advantages
  • Can provide accurate motion data from most of the
    face
  • Face models can be animated directly from surface
    feature point motion
  • Disadvantages
  • Dots glued to face
  • Dots must be manually registered
  • Not good for accurate inner lip contour or eyelid
    tracking

8
Video-based
  • Advantages
  • Simple to capture video of face
  • Face models can be animated directly from surface
    feature motion
  • Disadvantages
  • Must have good view of face

9
What is MPEG-4 Multimedia?
  • Natural audio and video objects
  • 2D and 3D graphics (based on VRML)
  • Animation (virtual humans)
  • Synthetic speech and audio

10
Samples versus Objects
  • Traditional video coding is sample based (blocks
    of pixels are compressed)
  • MPEG-4 provides visual object representation for
    better compression and new functionalities
  • Objects are rendered in the terminal after
    decoding object descriptors

11
Object-based Functionalities
  • User can choose display of content layers
  • Individual objects (text, models) can be searched
    or stored for later used
  • Content is independent of display resolution
  • Content can be easily repurposed by provider for
    different networks and users

12
MPEG-4 Object Composition
  • Objects are organized in a scene graph
  • Scene graphs are specified using a binary format
    called BIFS (based on VRML)
  • Both 2D and 3D objects, properties and transforms
    are specified in BIFS
  • BIFS allows objects to be transmitted once and
    instanced repeatedly in the scene after
    transformations

13
MPEG-4 Operation Sequence
14
(No Transcript)
15
Faces are Special
  • Humans are hard-wired to respond to faces
  • The face is the primary communication interface
  • Human faces can be automatically analyzed and
    parameterized for a wide variety of applications

16
MPEG-4 Face and Body Animation Coding
  • Face animation is in MPEG-4 version 1
  • Body animation is in MPEG-4 version 2
  • Face animation parameters displace feature points
    from neutral position
  • Body animation parameters are joint angles
  • Face and body animation parameter sequences are
    compressed to low bitrates

17
Neutral Face Definition
  • Head axes parallel to the world axes
  • Gaze is in direction of Z axis
  • Eyelids tangent to the iris
  • Pupil diameter is one third of iris diameter
  • Mouth is closed and the upper and lower teeth are
    touching
  • Tongue is flat, horizontal with the tip of tongue
    touching the boundary between upper and lower
    teeth

18
Face Feature Points
Teeth
Feature points affected by FAPs
Other feature points
19
Face Animation Parameter Normalization
  • Face Animation Parameters (FAPs) are normalized
    to facial dimensions
  • Each FAP is measured as a fraction of neutral
    face mouth width, mouth-nose distance, eye
    separation, or iris diameter
  • 3 Head and 2 eyeball rotation FAPs are Euler
    angles

20
Neutral Face Dimensions for FAP Normalization
21
FAP Groups
22
Lip FAPs
  • Mouth closed if sum of upper and lower lip FAPs
    0

23
Face Model Independence
  • FAPs are always normalized for model independence
  • FAPs (and BAPs) can be used without MPEG-4
    systems/BIFS
  • Private face models can be accurately animated
    with FAPs
  • Face models can be simple or complex depending on
    terminal resources

24
MPEG-4 BIFS Face Node
  • Face node contains FAP node, Face scene graph,
    Face Definition Parameters (FDP), FIT,and FAT
  • FIT (Face Interpolation Table) specifies
    interpolation of FAPs in terminal
  • FAT (Face Animation Table) maps FAPs to Face
    model deformation
  • FDP information included face feature points
    positions and texture map

25
Face Model Download
  • 3D graphical models (e.g. faces) can be
    downloaded to the terminal with MPEG-4
  • 3D model specification is based on VRML
  • Face Animation Table( FAT) maps FAPs to face
    model vertex displacements
  • Appearance and animation of downloaded face
    models is exactly predictable

26
FAP Compression
  • FAPs are adaptively quantized to desired quality
    level
  • Quantized FAPs are differentially coded
  • Adaptive arithmetic coding further reduces
    bitrate
  • Typical compressed FAP bitrate is less than 2
    kilobits/second

27
FAP Predictive Coding

FAP(t)
Q
-
Bitstream
Arithmetic Coder
Frame Delay
Q-1
28
Face Analysis System
  • MPEG-4 does not specify analysis systems
  • face2face face analysis system tracks nostrils
    for robust operation
  • Inner lip contour estimated using adaptive color
    thresholding and lip modeling
  • Eyelids, eyebrows and gaze direction

29
Nostril Tracking
30
Inner Lip Contour Estimation
31
FAP Estimation Algorithm
  • Head scale is normalized based on neutral mouth
    (closed mouth) width
  • Head pitch is approximated based on vertical
    nostril deviation from neutral head position
  • Head roll is computed from smoothed eye or
    nostril orientation depending on availability
  • Inner lip FAPs are measured directly from the
    inner lips contour as deviations from the neutral
    lip position (closed mouth)

32
FAP Sequence Smoothing
33
MPEG-4 Visemes and Expressions
  • A weighted combination of 2 visemes and 2 facial
    expressions for each frame
  • Decoder is free to interpret effect of visemes
    and expressions after FAPs are applied
  • Definitions of visemes and expressions using FAPs
    can also be downloaded

34
Visemes
35
Facial Expressions
36
Free Face Model Software
  • Wireface is an openGL-based, MPEG-4 compliant
    face model
  • Good starting point for building high quality
    face models for web applications
  • Reads FAP file and raw audio file
  • Renders face and audio in real time
  • Wireface source is freely available

37
Body Animation
  • Harmonized with VRML Hanim spec
  • Body Animation Parameters (BAPs) are humanoid
    skeleton joint Euler angles
  • Body Animation Table (BAT) can be downloaded to
    map BAPs to skin deformation
  • BAPs can be highly compressed for streaming

38
Body Animation Parameters (BAPs)
  • 186 humanoid skeleton euler angles
  • 110 free parameters for use with downloaded body
    surface mesh
  • Coded using same codecs as FAPs
  • Typical bitrates for coded BAPs is 5-10kbps

39
Body Definition Parameters (BDPs)
  • Humanoid joint center positions
  • Names and hierarchy harmonized with VRML/Web3D
    H-Anim working group
  • Default positions in standard for broadcast
    applications
  • Download just BDPs to accurately animate unknown
    body model

40
Faces Enhance the User Experience
  • Virtual call center agents
  • News readers (e.g. Ananova)
  • Story tellers for the child in all of us
  • eLearning
  • Program guide
  • Multilingual (same face different voice)
  • Entertainment animation
  • Multiplayer games

41
Visual Content for the Practical Internet
  • Broadband deployment is happening slowly
  • DSL availability is limited and cable is shared
  • Talking heads need high frame-rate
  • Consumer graphics hardware is cheap and powerful
  • MPEG-4 SNHC/FBA tools are matched to available
    bandwidth and terminals

42
Visual Speech Processing
  • FAPs can be used to improve speech recognition
    accuracy
  • Text-to-speech systems can use FAPs to animate
    face models
  • FAPs can be used in computer-human dialogue
    systems to communicate emotions, intentions and
    speech especially in noisy environments

43
Video-driven Face Animation
  • Facial expressions, lip movements and head motion
    transferred to face model
  • FAPs extracted from talking head video with
    special computer vision system
  • No face markers or lipstick is required
  • Normal lighting is used
  • Communicates lip movements and facial expressions
    with visual anonymity

44
Automatic Face Animation Demonstration
  • FAPs extracted from camcorder video
  • FAPs compressed to less than 2 kbits/sec
  • 30 frames/sec animation generated automatically
  • Face models animated with bones rig or fixed
    deformable mesh (real-time)

45
(No Transcript)
46
What is easy, solved, or almost solved
  • Can we do photorealistic non-animated face
    models? YES
  • Can we do near-real-time lip sync'ing that is
    indistinguishable from a human? NO

47
What is really hard
  • Synthesizing human speech and facial expressions
  • Hair

48
What we have assumed someone else is solving
  • Graphics acceleration
  • Video camera cost and resolution
  • Multimedia communication infrastructure

49
Where we need help
  • We have a face with 68 parameters but we need
    the psychologists to tell us how to drive it
    autonomously
  • We need to embody our agents into graphical
    models that have a couple of thousand parameters
    to control gaze, gesture, body language, and do
    collision detection-gt NEED MORE SPEED

50
Core functionality of the face
  • Speech
  • Lips, teeth, tongue
  • Emotional expressions
  • Gaze, eyebrow, eyelids, head pose
  • Non-verbal communication
  • Sensory responsivity
  • Technical requirements
  • Framerate
  • Synchronization
  • Latency
  • Bitrate
  • Spatial resolution
  • Complexity
  • Common framework withbody
  • Interaction
  • Different faces should respond similarly to
    common commands
  • Accessible to everyone

51
Interaction with other components
  • Language and discourse
  • Phoneme to viseme mapping
  • Given/new
  • Action in the environment
  • Global information
  • Emotional state
  • Personality
  • Culture
  • World knowledge
  • Central time-base and timestamps

52
Open questions
  • Central vs peripheral functionality
  • Degree of interface commonality
  • Degree of agent autonomy
  • What should the VH be capable of
Write a Comment
User Comments (0)
About PowerShow.com