How Machines Learn to Talk

About This Presentation

Title:

How Machines Learn to Talk

Description:

How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 42

Provided by: VSh7

Category:

more less

Transcript and Presenter's Notes

Title: How Machines Learn to Talk

1
How Machines Learn to Talk

Amitabha Mukerjee
IIT Kanpur
work done with
Computer Vision Profs. C. Venkatesh, Pabitra
Mitra
Prithvijit Guha, A. Ramakrishna Rao, Pradeep
Vaghela
Natural Language Prof. Achla Raina, V.
Shreeniwas

2
Robotics CollaborationsIGCAR KalpakkamSanjay
Gandhi PG Medical Hospital
3
Visual Robot Navigation
Time-to-Collisionbased Robot Navigation
4
Hyper-Redundant Manipulators
The same manipulator can work in changing
workspaces

Reconfigurable Workspaces / Emergency Access
Optimal Design of Hyper-Redundant Systems
Scara and 3D

5
Planar Hyper-Redundancy
4-link PlanarRobot
Motion Planning
6
Micro-Robots

Micro Soccer Robots (1999-)
8cm Smart Surveillance Robot 1m/s
Autonomous Flying Robot (2004)
Omni-directional platform (2002)

Omni-Directional Robot
Sponsor DSTamit_at_iitk.ac.in
7
Flying Robot
Start-Up at IIT Kanpur WhirligigRobotics
Test Flight of UAV. Inertial Meas Unit (IMU)
under commercial production
8
Tracheal Intubation Device
Assists surgeon while inserting breathing tube
during general anaesthesia
Ball Socket joint
Aperture for Fibre optic video cable
Aperture for Oxygenation tube
Endotracheal tube Aperture
Hole for suction tube
Control cables Attachment Points
Device for Intubation during general Anesthesia
Sponsor DST / SGPGM sens_at_iitk.ac.in
9
Draupadis Swayamvar
Can the Arrow hit the rotating mark?
Sponsor Media Lab Asia
10
High DOF Motion Planning

Accessing Hard to Reach spaces
Design of Hyper-Redundant Systems
Parallel Manipulators
Sponsor BRNS / MHRD
dasgupta_at_iitk.ac.in

10-link 3D Robot Optimal Design
11
Multimodal Language Acquisition

Consider a child observing a scene together with
adults talking about it
Grounded Language Symbols are grounded in
perceptual signals
Use of simple videos with boxes and simple shapes
standardly used in sociopsychology

12
Objective

To develop a computational framework
for Multimodal Language Acquisition
acquiring the perceptual structure corresponding
to verbs
using Recurrent Neural Networks as a biologically
plausible model for temporal abstraction
Adapt the learned model to interpret activities
in real videos

13
(No Transcript)
14
Alternate Views of Two subjects
Start Frame End Frame Subject One Subject Two
10 57 the door closed the square closes the other, the smaller square
172 177 The square and the circle move into the screen two more objects came in, a circle and a square
387 398 the big square moved out of the corner the square is moving around
487 540 the big square opened the door hes trying to push his way out of the square
617 635 the little square hit the big square theyre hitting each other
805 848 the big square hit the little square and they keep hitting each other
852 1100 the big square hit the little square again he little circle moves to the door the big square threatens the little circle now the circle is blocking the entrance for the big square now the circle is inside the square
1145 1202 the big square goes inside the box (and) the door closes another square went inside the big square
1270 1630 the big square approaches the little circle in the the little square opens the door the little squares trying to get inside the big square
1550 1658 the square scares the little circle in the corner and the objects inside are moving closer together
1752 1796 the little circle goes out of the room the circle just went out of the square
1780 1853 the door closes the little circle, the little square closes box
1967 1996 the big square hits the door down the big square just got out of the box
2197 2207 the little square and the circle go off the screen they left the picture
2207 2292 the big square circled around now the big square is there with the square box
2471 2501 big square knocks the door down (the big square) tried to force his way in
15
Plausible Biological Framework
16
Visually Grounded Corpus

Two psychological research films, one based on
the classic Heider Simmel (1944) and other
based on Hide Seek
These animation portray motion paths of geometric
figures (Big Square, Small square Circle)
Chase Alt

17
Cognate clustering

Similarity Clustering Different expressoins for
same action, e.g. move away from center vs
go to a corner
Frequency Remove Infrequent lexical units
Synonymy Set of lexical units being used
consistently in the same intervals, to mark the
same action, for the same set of agents.

18
Clusters
Chase
Hide Seek
19
Event Structure - Input
Chase
Hide Seek
Actions Come Together and Move Away from
video Hide Seek
20
Perceptual Process
21
Design of Feature Set

The features selected here are related to spatial
aspects of conceptual primitives in children,
such as position, relative pose, velocity etc.
Use features that are kinematical in nature,
temporal derivations or simple transforms of the
basic ones.

22
Monadic Features
23
Dyadic Predicates
24
VIdeo and Commentary for Event Structures VICES
25
The classification problem

The problem is of time series classification
Possible methodologies include
Logic based methods
Hidden Markov Models
Recurrent Neural Networks

26
Elman Network

Commonly a two-layer network with feedback from
the first-layer output to the first layer input
Elman Networks detect and generate time-varying
patterns
It is also able to learn spatial patterns

27
Feature Extraction in Abstract Videos

Each image is read into a 2D matrix
Connected Component Analysis is performed
Bounding box is computed for each such connected
component
Dynamic tracking is used to keep track of each
object

28
(No Transcript)
29
Working with Real Videos

Challenges
Noise in real world videos
Illumination Changes
Occlusions
Extracting Depth Information
Our Setup
Camera is fixed at head height.
Angle of depression is 0 degrees (approx.).
Video

30
Background Subtraction

Background Subtraction
Learn on still background images
Find pixel intensity distributions
Classify each pixel as background if
Remove Shadows
Special Case of Reduced Illumination
S kP where klt1.0

31
Contd..

Extract Human Blobs
By Connected Component Analysis
Bounding box is computed for each person
Track Human Blobs
Each object is tracked using a mean-shift
tracking algorithm.

32
Contd..
33
Depth Estimation

Two approximations
Using Gibsons affordances
Camera Geometry
Affordances Visual Clues
Action of a human is triggered by the environment
itself.
A floor offers walk-on ability
Every object affords certain actions to perceive
along with anticipated effects
A cups handle affords grasping-lifting-drinking

34
Contd..

Gibsons model
Horizon is fixed at the head height of the
observer.
Monocular Depth Cues
Interposition
An object that occludes another is closer.
Height in the visual field
Higher the object is the further it is.

35
Depth Estimation

Pin hole Camera Model
Mapping (X,Y,Z) to (x,y)
x X f / Z
y Y f / Z
For the point of contact with the ground
Z ? 1 / y
X ? x / y

36
Depth plot for A chase B

Top view (Z-X plane)

37
Results (contd..)
38
Results (contd..)
39
Results (contd..)
40
Results

Separate-SRN-for-each-action
Trained tested on different parts of the
abstract video
Trained on abstract video and tested on real
video
Single-SRN-for-all-actions
Trained on synthetic video and tested on real
video

41
Basis for Comparison
Let the total time of visual sequence for each
verb be t time units
42
Separate SRN for each action Framework Abstract
video
Verb True Positives False Positives False Negatives Focus Mismatches Accuracy
hit 46.02 3.06 53.98 2.4 92.37
chase 24.44 0 75.24 0.72 93.71
come Closer 25.87 14.61 73.26 16.77 63.66
move Away 46.34 7.21 52.33 15.95 73.37
spins 82.54 0 16.51 24.7 97.03
moves 68.24 0.12 31.76 1.97 77.33
Verb True Positives False Positives False Negatives Focus Mismatches
hit 3 3 1 1
chase 6 0 3 4
come Closer 6 20 7 24
move Away 8 3 0 14
spins 22 0 1 9
moves 5 1 2 7
43
Time Line comparison for Chase
44
Separate SRN for each action Real video (action
recognition only)
Verb Retrieved Relevant True Positives False Positives False Negatives Precision Recall
A Chase B 237 140 135 96 5 58.4 96.4
B Chase A 76 130 76 0 56 100 58.4
45
Single SRN for all actions Framework Real video
Verb Retrieved Relevant True Positives False Positives False Negatives Precision Recall
Chase 239 270 217 23 5 91.2 80.7
Going Away 21 44 13 8 31 61.9 29.5
46
Conclusions Future Work

Sparse nature of video provides for ease of
visual analysis
Directly learning event structures from
perceptual stream.
Extensions Learn fine nuances between event
structures of related action words.
Learn the Morphological variations.
Extend the work towards using Long Short Term
Memory (LSTM).
Hierarchical acquisition of higher level action
verbs.

Write a Comment

User Comments (0)

About PowerShow.com

How Machines Learn to Talk - PowerPoint PPT Presentation

How Machines Learn to Talk

How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao ... – PowerPoint PPT presentation