Title: How Machines Learn to Talk
1How Machines Learn to Talk
- Amitabha Mukerjee
- IIT Kanpur
- work done with
- Computer Vision Profs. C. Venkatesh, Pabitra
Mitra - Prithvijit Guha, A. Ramakrishna Rao, Pradeep
Vaghela - Natural Language Prof. Achla Raina, V.
Shreeniwas
2Robotics CollaborationsIGCAR KalpakkamSanjay
Gandhi PG Medical Hospital
3Visual Robot Navigation
Time-to-Collisionbased Robot Navigation
4Hyper-Redundant Manipulators
The same manipulator can work in changing
workspaces
- Reconfigurable Workspaces / Emergency Access
- Optimal Design of Hyper-Redundant Systems
Scara and 3D
5Planar Hyper-Redundancy
4-link PlanarRobot
Motion Planning
6Micro-Robots
- Micro Soccer Robots (1999-)
- 8cm Smart Surveillance Robot 1m/s
- Autonomous Flying Robot (2004)
- Omni-directional platform (2002)
Omni-Directional Robot
Sponsor DSTamit_at_iitk.ac.in
7Flying Robot
Start-Up at IIT Kanpur WhirligigRobotics
Test Flight of UAV. Inertial Meas Unit (IMU)
under commercial production
8Tracheal Intubation Device
Assists surgeon while inserting breathing tube
during general anaesthesia
Ball Socket joint
Aperture for Fibre optic video cable
Aperture for Oxygenation tube
Endotracheal tube Aperture
Hole for suction tube
Control cables Attachment Points
Device for Intubation during general Anesthesia
Sponsor DST / SGPGM sens_at_iitk.ac.in
9Draupadis Swayamvar
Can the Arrow hit the rotating mark?
Sponsor Media Lab Asia
10High DOF Motion Planning
- Accessing Hard to Reach spaces
- Design of Hyper-Redundant Systems
- Parallel Manipulators
- Sponsor BRNS / MHRD
- dasgupta_at_iitk.ac.in
10-link 3D Robot Optimal Design
11Multimodal Language Acquisition
- Consider a child observing a scene together with
adults talking about it - Grounded Language Symbols are grounded in
perceptual signals - Use of simple videos with boxes and simple shapes
standardly used in sociopsychology
12Objective
- To develop a computational framework
- for Multimodal Language Acquisition
- acquiring the perceptual structure corresponding
to verbs - using Recurrent Neural Networks as a biologically
plausible model for temporal abstraction - Adapt the learned model to interpret activities
in real videos
13(No Transcript)
14Alternate Views of Two subjects
Start Frame End Frame Subject One Subject Two
10 57 the door closed the square closes the other, the smaller square
172 177 The square and the circle move into the screen two more objects came in, a circle and a square
387 398 the big square moved out of the corner the square is moving around
487 540 the big square opened the door hes trying to push his way out of the square
617 635 the little square hit the big square theyre hitting each other
805 848 the big square hit the little square and they keep hitting each other
852 1100 the big square hit the little square again he little circle moves to the door the big square threatens the little circle now the circle is blocking the entrance for the big square now the circle is inside the square
1145 1202 the big square goes inside the box (and) the door closes another square went inside the big square
1270 1630 the big square approaches the little circle in the the little square opens the door the little squares trying to get inside the big square
1550 1658 the square scares the little circle in the corner and the objects inside are moving closer together
1752 1796 the little circle goes out of the room the circle just went out of the square
1780 1853 the door closes the little circle, the little square closes box
1967 1996 the big square hits the door down the big square just got out of the box
2197 2207 the little square and the circle go off the screen they left the picture
2207 2292 the big square circled around now the big square is there with the square box
2471 2501 big square knocks the door down (the big square) tried to force his way in
15Plausible Biological Framework
16Visually Grounded Corpus
- Two psychological research films, one based on
the classic Heider Simmel (1944) and other
based on Hide Seek - These animation portray motion paths of geometric
figures (Big Square, Small square Circle) - Chase Alt
17Cognate clustering
- Similarity Clustering Different expressoins for
same action, e.g. move away from center vs
go to a corner - Frequency Remove Infrequent lexical units
- Synonymy Set of lexical units being used
consistently in the same intervals, to mark the
same action, for the same set of agents.
18Clusters
Chase
Hide Seek
19Event Structure - Input
Chase
Hide Seek
Actions Come Together and Move Away from
video Hide Seek
20Perceptual Process
21Design of Feature Set
- The features selected here are related to spatial
aspects of conceptual primitives in children,
such as position, relative pose, velocity etc. - Use features that are kinematical in nature,
temporal derivations or simple transforms of the
basic ones.
22Monadic Features
23Dyadic Predicates
24VIdeo and Commentary for Event Structures VICES
25The classification problem
- The problem is of time series classification
- Possible methodologies include
- Logic based methods
- Hidden Markov Models
- Recurrent Neural Networks
26Elman Network
- Commonly a two-layer network with feedback from
the first-layer output to the first layer input - Elman Networks detect and generate time-varying
patterns - It is also able to learn spatial patterns
27Feature Extraction in Abstract Videos
- Each image is read into a 2D matrix
- Connected Component Analysis is performed
- Bounding box is computed for each such connected
component - Dynamic tracking is used to keep track of each
object
28(No Transcript)
29Working with Real Videos
- Challenges
- Noise in real world videos
- Illumination Changes
- Occlusions
- Extracting Depth Information
- Our Setup
- Camera is fixed at head height.
- Angle of depression is 0 degrees (approx.).
- Video
30Background Subtraction
- Background Subtraction
- Learn on still background images
- Find pixel intensity distributions
- Classify each pixel as background if
- Remove Shadows
- Special Case of Reduced Illumination
- S kP where klt1.0
31Contd..
- Extract Human Blobs
- By Connected Component Analysis
- Bounding box is computed for each person
- Track Human Blobs
- Each object is tracked using a mean-shift
tracking algorithm.
32Contd..
33Depth Estimation
- Two approximations
- Using Gibsons affordances
- Camera Geometry
- Affordances Visual Clues
- Action of a human is triggered by the environment
itself. - A floor offers walk-on ability
- Every object affords certain actions to perceive
along with anticipated effects - A cups handle affords grasping-lifting-drinking
34Contd..
- Gibsons model
- Horizon is fixed at the head height of the
observer. - Monocular Depth Cues
- Interposition
- An object that occludes another is closer.
- Height in the visual field
- Higher the object is the further it is.
35Depth Estimation
- Pin hole Camera Model
- Mapping (X,Y,Z) to (x,y)
- x X f / Z
- y Y f / Z
- For the point of contact with the ground
- Z ? 1 / y
- X ? x / y
36Depth plot for A chase B
37 Results (contd..)
38Results (contd..)
39Results (contd..)
40Results
- Separate-SRN-for-each-action
- Trained tested on different parts of the
abstract video - Trained on abstract video and tested on real
video - Single-SRN-for-all-actions
- Trained on synthetic video and tested on real
video
41Basis for Comparison
Let the total time of visual sequence for each
verb be t time units
42Separate SRN for each action Framework Abstract
video
Verb True Positives False Positives False Negatives Focus Mismatches Accuracy
hit 46.02 3.06 53.98 2.4 92.37
chase 24.44 0 75.24 0.72 93.71
come Closer 25.87 14.61 73.26 16.77 63.66
move Away 46.34 7.21 52.33 15.95 73.37
spins 82.54 0 16.51 24.7 97.03
moves 68.24 0.12 31.76 1.97 77.33
Verb True Positives False Positives False Negatives Focus Mismatches
hit 3 3 1 1
chase 6 0 3 4
come Closer 6 20 7 24
move Away 8 3 0 14
spins 22 0 1 9
moves 5 1 2 7
43Time Line comparison for Chase
44Separate SRN for each action Real video (action
recognition only)
Verb Retrieved Relevant True Positives False Positives False Negatives Precision Recall
A Chase B 237 140 135 96 5 58.4 96.4
B Chase A 76 130 76 0 56 100 58.4
45Single SRN for all actions Framework Real video
Verb Retrieved Relevant True Positives False Positives False Negatives Precision Recall
Chase 239 270 217 23 5 91.2 80.7
Going Away 21 44 13 8 31 61.9 29.5
46Conclusions Future Work
- Sparse nature of video provides for ease of
visual analysis - Directly learning event structures from
perceptual stream. - Extensions Learn fine nuances between event
structures of related action words. - Learn the Morphological variations.
- Extend the work towards using Long Short Term
Memory (LSTM). - Hierarchical acquisition of higher level action
verbs.