Computational Architectures in Biological Vision, USC

About This Presentation

Title:

Computational Architectures in Biological Vision, USC

Description:

Title: USC Brain Project Specific Aims Author: Michael A. Arbib Last modified by: Laurent Itti Created Date: 3/18/1998 2:41:19 PM Document presentation format – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 100

Provided by: MichaelA212

Learn more at: http://ilab.usc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computational Architectures in Biological Vision, USC

1
Computational Architectures in Biological Vision,
USC

Lecture 13. Scene Perception
Reading Assignments
None

2
(No Transcript)
3
How much can we remember?

Incompleteness of memory
how many domes in the Taj Mahal?
despite conscious experience of picture-perfect,
iconic memorization.

4
(No Transcript)
5
Change blindness

Rensink, ORegan Clark 1996
See the demo!

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
But

We can recognize complex scenes which we have
seen before.
So, we do have some form of iconic memory.
In this lecture
examine how we can perceive scenes
what is the representation (that can be
memorized)
what are the mechanisms

10
Extended Scene Perception

Attention-based analysis Scan scene with
attention, accumulate evidence from detailed
local analysis at each attended location.
Main issues
what is the internal representation?
how detailed is memory?
do we really have a detailed internal
representation at all!!?
Gist Can very quickly (120ms) classify entire
scenes or do simple recognition tasks can only
shift attention twice in that much time!

11
Accumulating Evidence

Combine information across multiple eye
fixations.
Build detailed representation of scene in memory.

12
Eye Movements

1) Free examination
2) estimate material
circumstances of family
3) give ages of the people
4) surmise what family has
been doing before arrival
of unexpected visitor
5) remember clothes worn by
the people
6) remember position of people
and objects
7) estimate how long the unexpected
visitor has been away from family

13
Clinical Studies

Studies with patients with some visual deficits
strongly argue that tight interaction between
where and what visual streams are necessary for
scene interpretation.
Visual agnosia can see objects, copy drawings of
them, etc., but cannot recognize or name them!
Dorsal agnosia cannot recognize objects
if more than two are presented simulta-
neously problem with localization
Ventral agnosia cannot identify objects.

14
These studies suggest

We bind features of objects into objects (feature
binding)
We bind objects in space into some arrangement
(space binding)
We perceive the scene.
Feature binding what stream
Space binding where/how stream

15
Schema-based Approaches

Schema (Arbib, 1989) describes objects in terms
of their physical properties and spatial
arrangements.
Abstract representation of scenes, objects,
actions, and other brain processes. Intermediate
level between neural firing and overall behavior.
Schemas both cooperate and compete in describing
the visual world

16
(No Transcript)
17
VISOR

Leow Miikkulainen, 1994 low-level -gt
sub-schema activity maps (coarse description of
components of objects) -gt competition across
several candidate schemas -gt one schema wins and
is the percept.

18
Biologically-Inspired Models

Rybak et al, Vision Research, 1998.
What Where.
Feature-based frame of reference.

19
(No Transcript)
20
Algorithm

At each fixation, extract central edge
orientation, as well as a number of context
edges
Transform those low-level features into more
invariant second order features, represented in
a referential attached to the central edge
Learning manually select fixation points
store sequence of second-order
features found at each fixation
into what memory also store
vector for next fixation, based
on context points and in the
second-order referential

21
Algorithm

As a result, sequence of retinal images is stored
in what memory, and corresponding sequence of
attentional shifts in the where memory.

22
Algorithm

Search mode look
for an image patch that
matches one of the
patches stored in the
what memory
Recognition mode
reproduce scanpath
stored in memory and
determine whether we
have a match.

Robust to variations in
scale, rotation,
illumination, but not
3D pose.

24
Schill et al, JEI, 2001
25
(No Transcript)
26
Dynamic Scenes

Extension to moving objects and dynamic
environment.
Rizzolatti mirror neurons in monkey area F5
respond when monkey observes an action (e.g.,
grasping an object) as well as when he executes
the same action.
Computer vision models decompose complex actions
using grammars of elementary actions and precise
composition rules. Resembles temporal extension
of schema-based systems. Is this what the brain
does?

27
Several Problems

with the progressive visual buffer hypothesis
Change blindness
Attention seems to be required for us to perceive
change in images, while these could be easily
detected in a visual buffer!
Amount of memory required is huge!
Interpretation of buffer contents by high-level
vision is very difficult if buffer contains very
detailed representations (Tsotsos, 1990)!

28
The World as an Outside Memory

Kevin ORegan, early 90s
why build a detailed internal representation of
the world?
too complex
not enough memory
and useless?
The world is the memory. Attention and the eyes
are a look-up tool!

29
The Attention Hypothesis

Rensink, 2000
No integrative buffer
Early processing extracts information up to
proto-object complexity in massively parallel
manner
Attention is necessary to bind the different
proto-objects into complete objects, as well as
to bind object and location
Once attention leaves an object, the binding
dissolves. Not a problem, it can be formed
again whenever needed, by shifting attention back
to the object.
Only a rather sketchy virtual representation is
kept in memory, and attention/eye movements are
used to gather details as needed

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Back to accumulated evidence!

Hollingworth et al, 2000 argue against the
disintegration of coherent visual representations
as soon as attention is withdrawn.
Experiment
line drawings of natural scenes
change one object (target) during a saccadic eye
movement away from that object
instruct subjects to examine scene, and they
would later be asked questions about what was in
it
also instruct subjects to monitor for object
changes and press a button as soon as a change
detected
Hypothesis
It is known that attention will precede eye
movements. So the change is outside the focus of
attention. If subjects can notice it, it means
that some detailed memory of the object is
retained.

Hollingworth et
al, 2000
Subjects can see the
change (26 correct
overall)
Even if they only
notice it a long time
afterwards, at their
next visit of the
object

36
Hollingworth et al

So, these results suggest that
the online representation of a scene can contain
detailed visual information in memory from
previously attended objects.
Contrary to the proposal of the attention
hypothesis (see Rensink, 2000), the results
indicate that visual object representations do
not disintegrate upon the withdrawal of
attention.

37
Gist of a Scene

Biederman, 1981
from very brief exposure to a scene (120ms or
less), we can already extract a lot of
information about its global structure, its
category (indoors, outdoors, etc) and some of its
components.
riding the first spike 120ms is the time it
takes the first spike to travel from the retina
to IT!
Thorpe, van Rullen
very fast classification (down to 27ms exposure,
no mask), e.g., for tasks such as was there an
animal in the scene?

demo

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Gist of a Scene

Oliva Schyns, Cognitive Psychology, 2000
Investigate effect of color on fast scene
perception.
Idea Rather than looking at the properties of
the constituent objects in a given scene, look at
the global effect of color on recognition.
Hypothesis
diagnostic colors (predictive of scene category)
will help recognition.

48
Color Gist
49
Color Gist
50
(No Transcript)
51
Color Gist

Conclusion from Oliva Schyns study
colored blobs at a coarse spatial scale concur
with luminance cues to form the relevant spatial
layout that mediates express scene recognition.

52
(No Transcript)
53
Combining saliency and gist

Torralba, JOSA-A, 2003
Idea when looking for a specific object, gist
may combine with saliency in guiding attention

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
Application Beobots
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
Outlook

It seems unlikely that we perceive scenes by
building a progressive buffer and accumulating
detailed evidence into it. It would take to much
resources and be too complex to use.
Rather, we may only have an illusion of detailed
representation, and the availability of our
eyes/attention to get the details whenever they
are needed. The world as an outside memory.
In addition to attention-based scene analysis, we
are able to very rapidly extract the gist of a
scene much faster than we can shift attention
around.
This gist may be constructed by fairly simple
processes that operate in parallel. It can then
be used to prime memory and attention.

69
Goal-oriented scene understanding?

Question describe what is happening in the video
clip shown in the following slide.

70
(No Transcript)
71
Goal for our algorithms

Extract the minimal subscene, that is, the
smallest set of actors, objects and actions that
describe the scene under given task definition.
E.g.,
If who is doing what and to whom? task
And boy-on-scooter video clip
Then minimal subscene is a boy with a red shirt
rides a scooter around

72
Challenge

The minimal subscene in our example has
10 words, but
The video clip has over 74 million different
pixel values (about 1.8 billion bits once
uncompressed and displayed though with high
spatial and temporal correlation)

73
Starting point

Can attend to salient locations
Can identify those locations?
Can evaluate the task-relevance of those
locations, based on some general symbolic
knowledge about how various entities relate to
each other?

74
(No Transcript)
75
Task influences eye movements

Yarbus, 1967
Given one image,
An eye tracker,
And seven sets of instructions given to seven
observers,
Yarbus observed widely different eye movement
scanpaths depending on task.

76
Yarbus, 1967 Task influences human eye movements
1 A.Yarbus, Plenum Press, New York, 1967.
77
Towards a computational model

Consider the following scene (next slide)
Lets walk through a schematic (partly
hypothetical, partly implemented) diagram of the
sequence of steps that may be triggered during
its analysis.

78
(No Transcript)
79
Two streams

Not where/what
But attentional/non-attentional
Attentional local analysis of details of various
objects
Non-attentional rapid global analysis yields
coarse identification of the setting (rough
semantic category for the scene, e.g., indoors
vs. outdoors, rough layout, etc)

80
Setting pathway
Attentional pathway
Itti 2002, also see Rensink, 2000
81
Step 1 eyes closed

Given a task, determine objects that may be
relevant to it, using symbolic LTM (long-term
memory), and store collection of relevant objects
in symbolic WM (working memory).
E.g., if task is to find a stapler, symbolic LTM
may inform us that a desk is relevant.
Then, prime visual system for the features of the
most-relevant entity, as stored in visual LTM.
E.g., if most relevant entity is a red object,
boost red-selective neurons.
C.f. guided search, top-down attentional
modulation of early vision.

82
Navalpakkam Itti, in press
1. Eyes closed
83
Step 2 attend

The biased visual system yields a saliency map
(biased for features of most relevant entity)
See Itti Koch, 1998-2003, Navalpakkam Itti,
2003
The setting yields a spatial prior of where this
entity may be, based on very rapid and very
coarse global scene analysis here we use this
prior as an initializer for our task-relevance
map, a spatial pointwise filter that will be
applied to the saliency map
E.g., if scene is a beach and looking for humans,
look around where the sand is, not in the sky!
See Torralba, 2003 for computer implementation.

84
2. Attend
85
3. Recognize

Once the most (salient relevant) location has
been selected, it is fed (through Rensinks
nexus or Olshausen et al.s shifter circuit)
to object recognition.
If the recognized entity was not in WM, it is
added

86
3. Recognize
87
4. Update

As an entity is recognized, its relationships to
other entities in the WM are evaluated, and the
relevance of all WM entities is updated.
The task-relevance map (TRM) is also updated with
the computed relevant of the currently-fixated
entity. That will ensure that we will later come
back regularly to that location, if relevant, or
largely ignore it, if irrelevant.

88
4. Update
89
Iterate

The system keeps looping through steps 2-4
The current WM and TRM are a first approximation
to what may constitute the Minimal subscene
A set of relevant spatial locations with attached
object labels (see object files), and
A set of relevant symbolic entities with attached
relevance values

90
Prototype Implementation
91
Symbolic LTM
92
Simple hierarchical Representation of Visual
features of Objects
93
The visual features Of objects in visual LTM are
used to Bias attention Top-down
94
Once a location is attended to, its local visual
features Are matched to those in visual LTM, to
recognize the attended object
95
Learning object features And using them
for biasing
Naïve Looking for Salient objects
Biased Looking for a Coca-cola can
96
(No Transcript)
97
Exercising the model by requesting that it finds
several objects
98
Learning the TRM through sequences of attention
and recognition
99
Outlook

Open architecture model not in any way
dedicated to a specific task, environment,
knowledge base, etc. just like our brain probably
has not evolved to allow us to drive cars.
Task-dependent learning In the TRM, the
knowledge base, the object recognition system,
etc., guided by an interaction between attention,
recognition, and symbolic knowledge to evaluate
the task-relevance of attended objects
Hybrid neuro/AI architecture Interplay between
rapid/coarse learnable global analysis (gist),
symbolic knowledge-based reasoning, and
local/serial trainable attention and object
recognition
Key new concepts
Minimal subscene smallest task-dependent set of
actors, objects and actions that concisely
summarize scene contents
Task-relevance map spatial map that helps focus
computational resources on task-relevant scene
portions

Write a Comment

User Comments (0)

About PowerShow.com

Computational Architectures in Biological Vision, USC - PowerPoint PPT Presentation

Computational Architectures in Biological Vision, USC

Title: USC Brain Project Specific Aims Author: Michael A. Arbib Last modified by: Laurent Itti Created Date: 3/18/1998 2:41:19 PM Document presentation format – PowerPoint PPT presentation