How Machines Learn to Talk - PowerPoint PPT Presentation

About This Presentation
Title:

How Machines Learn to Talk

Description:

How Machines Learn to Talk Amitabha Mukerjee IIT Kanpur work done with: Computer Vision: Profs. C. Venkatesh, Pabitra Mitra Prithvijit Guha, A. Ramakrishna Rao ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 42
Provided by: VSh7
Category:

less

Transcript and Presenter's Notes

Title: How Machines Learn to Talk


1
How Machines Learn to Talk
  • Amitabha Mukerjee
  • IIT Kanpur
  • work done with
  • Computer Vision Profs. C. Venkatesh, Pabitra
    Mitra
  • Prithvijit Guha, A. Ramakrishna Rao, Pradeep
    Vaghela
  • Natural Language Prof. Achla Raina, V.
    Shreeniwas

2
Robotics CollaborationsIGCAR KalpakkamSanjay
Gandhi PG Medical Hospital
3
Visual Robot Navigation
Time-to-Collisionbased Robot Navigation
4
Hyper-Redundant Manipulators
The same manipulator can work in changing
workspaces
  • Reconfigurable Workspaces / Emergency Access
  • Optimal Design of Hyper-Redundant Systems
    Scara and 3D

5
Planar Hyper-Redundancy
4-link PlanarRobot
Motion Planning
6
Micro-Robots
  • Micro Soccer Robots (1999-)
  • 8cm Smart Surveillance Robot 1m/s
  • Autonomous Flying Robot (2004)
  • Omni-directional platform (2002)

Omni-Directional Robot
Sponsor DSTamit_at_iitk.ac.in
7
Flying Robot
Start-Up at IIT Kanpur WhirligigRobotics
Test Flight of UAV. Inertial Meas Unit (IMU)
under commercial production
8
Tracheal Intubation Device
Assists surgeon while inserting breathing tube
during general anaesthesia
Ball Socket joint
Aperture for Fibre optic video cable
Aperture for Oxygenation tube
Endotracheal tube Aperture
Hole for suction tube
Control cables Attachment Points
Device for Intubation during general Anesthesia
Sponsor DST / SGPGM sens_at_iitk.ac.in
9
Draupadis Swayamvar
Can the Arrow hit the rotating mark?
Sponsor Media Lab Asia
10
High DOF Motion Planning
  • Accessing Hard to Reach spaces
  • Design of Hyper-Redundant Systems
  • Parallel Manipulators
  • Sponsor BRNS / MHRD
  • dasgupta_at_iitk.ac.in

10-link 3D Robot Optimal Design
11
Multimodal Language Acquisition
  • Consider a child observing a scene together with
    adults talking about it
  • Grounded Language Symbols are grounded in
    perceptual signals
  • Use of simple videos with boxes and simple shapes
    standardly used in sociopsychology

12
Objective
  • To develop a computational framework
  • for Multimodal Language Acquisition
  • acquiring the perceptual structure corresponding
    to verbs
  • using Recurrent Neural Networks as a biologically
    plausible model for temporal abstraction
  • Adapt the learned model to interpret activities
    in real videos

13
(No Transcript)
14
Alternate Views of Two subjects
Start Frame End Frame Subject One Subject Two
10 57 the door closed the square closes the other, the smaller square
172 177 The square and the circle move into the screen two more objects came in, a circle and a square
387 398 the big square moved out of the corner the square is moving around
487 540 the big square opened the door hes trying to push his way out of the square
617 635 the little square hit the big square theyre hitting each other
805 848 the big square hit the little square and they keep hitting each other
852 1100 the big square hit the little square again he little circle moves to the door the big square threatens the little circle now the circle is blocking the entrance for the big square now the circle is inside the square
1145 1202 the big square goes inside the box (and) the door closes another square went inside the big square
1270 1630 the big square approaches the little circle in the the little square opens the door the little squares trying to get inside the big square
1550 1658 the square scares the little circle in the corner and the objects inside are moving closer together
1752 1796 the little circle goes out of the room the circle just went out of the square
1780 1853 the door closes the little circle, the little square closes box
1967 1996 the big square hits the door down the big square just got out of the box
2197 2207 the little square and the circle go off the screen they left the picture
2207 2292 the big square circled around now the big square is there with the square box
2471 2501 big square knocks the door down (the big square) tried to force his way in
15
Plausible Biological Framework
16
Visually Grounded Corpus
  • Two psychological research films, one based on
    the classic Heider Simmel (1944) and other
    based on Hide Seek
  • These animation portray motion paths of geometric
    figures (Big Square, Small square Circle)
  • Chase Alt

17
Cognate clustering
  • Similarity Clustering Different expressoins for
    same action, e.g. move away from center vs
    go to a corner
  • Frequency Remove Infrequent lexical units
  • Synonymy Set of lexical units being used
    consistently in the same intervals, to mark the
    same action, for the same set of agents.

18
Clusters
Chase
Hide Seek
19
Event Structure - Input
Chase
Hide Seek
Actions Come Together and Move Away from
video Hide Seek
20
Perceptual Process
21
Design of Feature Set
  • The features selected here are related to spatial
    aspects of conceptual primitives in children,
    such as position, relative pose, velocity etc.
  • Use features that are kinematical in nature,
    temporal derivations or simple transforms of the
    basic ones.

22
Monadic Features
23
Dyadic Predicates
24
VIdeo and Commentary for Event Structures VICES
25
The classification problem
  • The problem is of time series classification
  • Possible methodologies include
  • Logic based methods
  • Hidden Markov Models
  • Recurrent Neural Networks

26
Elman Network
  • Commonly a two-layer network with feedback from
    the first-layer output to the first layer input
  • Elman Networks detect and generate time-varying
    patterns
  • It is also able to learn spatial patterns

27
Feature Extraction in Abstract Videos
  • Each image is read into a 2D matrix
  • Connected Component Analysis is performed
  • Bounding box is computed for each such connected
    component
  • Dynamic tracking is used to keep track of each
    object

28
(No Transcript)
29
Working with Real Videos
  • Challenges
  • Noise in real world videos
  • Illumination Changes
  • Occlusions
  • Extracting Depth Information
  • Our Setup
  • Camera is fixed at head height.
  • Angle of depression is 0 degrees (approx.).
  • Video

30
Background Subtraction
  • Background Subtraction
  • Learn on still background images
  • Find pixel intensity distributions
  • Classify each pixel as background if
  • Remove Shadows
  • Special Case of Reduced Illumination
  • S kP where klt1.0

31
Contd..
  • Extract Human Blobs
  • By Connected Component Analysis
  • Bounding box is computed for each person
  • Track Human Blobs
  • Each object is tracked using a mean-shift
    tracking algorithm.

32
Contd..
33
Depth Estimation
  • Two approximations
  • Using Gibsons affordances
  • Camera Geometry
  • Affordances Visual Clues
  • Action of a human is triggered by the environment
    itself.
  • A floor offers walk-on ability
  • Every object affords certain actions to perceive
    along with anticipated effects
  • A cups handle affords grasping-lifting-drinking

34
Contd..
  • Gibsons model
  • Horizon is fixed at the head height of the
    observer.
  • Monocular Depth Cues
  • Interposition
  • An object that occludes another is closer.
  • Height in the visual field
  • Higher the object is the further it is.

35
Depth Estimation
  • Pin hole Camera Model
  • Mapping (X,Y,Z) to (x,y)
  • x X f / Z
  • y Y f / Z
  • For the point of contact with the ground
  • Z ? 1 / y
  • X ? x / y

36
Depth plot for A chase B
  • Top view (Z-X plane)

37
Results (contd..)
38
Results (contd..)
39
Results (contd..)
40
Results
  • Separate-SRN-for-each-action
  • Trained tested on different parts of the
    abstract video
  • Trained on abstract video and tested on real
    video
  • Single-SRN-for-all-actions
  • Trained on synthetic video and tested on real
    video

41
Basis for Comparison
Let the total time of visual sequence for each
verb be t time units
42
Separate SRN for each action Framework Abstract
video
Verb True Positives False Positives False Negatives Focus Mismatches Accuracy
hit 46.02 3.06 53.98 2.4 92.37
chase 24.44 0 75.24 0.72 93.71
come Closer 25.87 14.61 73.26 16.77 63.66
move Away 46.34 7.21 52.33 15.95 73.37
spins 82.54 0 16.51 24.7 97.03
moves 68.24 0.12 31.76 1.97 77.33
Verb True Positives False Positives False Negatives Focus Mismatches
hit 3 3 1 1
chase 6 0 3 4
come Closer 6 20 7 24
move Away 8 3 0 14
spins 22 0 1 9
moves 5 1 2 7
43
Time Line comparison for Chase
44
Separate SRN for each action Real video (action
recognition only)
Verb Retrieved Relevant True Positives False Positives False Negatives Precision Recall
A Chase B 237 140 135 96 5 58.4 96.4
B Chase A 76 130 76 0 56 100 58.4
45
Single SRN for all actions Framework Real video
Verb Retrieved Relevant True Positives False Positives False Negatives Precision Recall
Chase 239 270 217 23 5 91.2 80.7
Going Away 21 44 13 8 31 61.9 29.5
46
Conclusions Future Work
  • Sparse nature of video provides for ease of
    visual analysis
  • Directly learning event structures from
    perceptual stream.
  • Extensions Learn fine nuances between event
    structures of related action words.
  • Learn the Morphological variations.
  • Extend the work towards using Long Short Term
    Memory (LSTM).
  • Hierarchical acquisition of higher level action
    verbs.
Write a Comment
User Comments (0)
About PowerShow.com