Title: Computational Architectures in Biological Vision, USC
1Computational Architectures in Biological Vision,
USC
- Lecture 13. Scene Perception
- Reading Assignments
- None
2(No Transcript)
3How much can we remember?
- Incompleteness of memory
- how many domes in the Taj Mahal?
- despite conscious experience of picture-perfect,
iconic memorization.
4(No Transcript)
5Change blindness
- Rensink, ORegan Clark 1996
- See the demo!
6(No Transcript)
7(No Transcript)
8(No Transcript)
9But
- We can recognize complex scenes which we have
seen before. - So, we do have some form of iconic memory.
- In this lecture
- examine how we can perceive scenes
- what is the representation (that can be
memorized) - what are the mechanisms
10Extended Scene Perception
- Attention-based analysis Scan scene with
attention, accumulate evidence from detailed
local analysis at each attended location. - Main issues
- what is the internal representation?
- how detailed is memory?
- do we really have a detailed internal
representation at all!!? - Gist Can very quickly (120ms) classify entire
scenes or do simple recognition tasks can only
shift attention twice in that much time!
11Accumulating Evidence
- Combine information across multiple eye
fixations. - Build detailed representation of scene in memory.
12Eye Movements
- 1) Free examination
- 2) estimate material
- circumstances of family
- 3) give ages of the people
- 4) surmise what family has
- been doing before arrival
- of unexpected visitor
- 5) remember clothes worn by
- the people
- 6) remember position of people
- and objects
- 7) estimate how long the unexpected
- visitor has been away from family
13Clinical Studies
- Studies with patients with some visual deficits
strongly argue that tight interaction between
where and what visual streams are necessary for
scene interpretation. - Visual agnosia can see objects, copy drawings of
them, etc., but cannot recognize or name them! - Dorsal agnosia cannot recognize objects
- if more than two are presented simulta-
- neously problem with localization
- Ventral agnosia cannot identify objects.
14These studies suggest
- We bind features of objects into objects (feature
binding) - We bind objects in space into some arrangement
(space binding) - We perceive the scene.
- Feature binding what stream
- Space binding where/how stream
15Schema-based Approaches
- Schema (Arbib, 1989) describes objects in terms
of their physical properties and spatial
arrangements. - Abstract representation of scenes, objects,
actions, and other brain processes. Intermediate
level between neural firing and overall behavior. - Schemas both cooperate and compete in describing
the visual world
16(No Transcript)
17VISOR
- Leow Miikkulainen, 1994 low-level -gt
sub-schema activity maps (coarse description of
components of objects) -gt competition across
several candidate schemas -gt one schema wins and
is the percept.
18Biologically-Inspired Models
- Rybak et al, Vision Research, 1998.
- What Where.
- Feature-based frame of reference.
19(No Transcript)
20Algorithm
- At each fixation, extract central edge
orientation, as well as a number of context
edges - Transform those low-level features into more
invariant second order features, represented in
a referential attached to the central edge - Learning manually select fixation points
- store sequence of second-order
- features found at each fixation
- into what memory also store
- vector for next fixation, based
- on context points and in the
- second-order referential
21Algorithm
- As a result, sequence of retinal images is stored
in what memory, and corresponding sequence of
attentional shifts in the where memory.
22Algorithm
- Search mode look
- for an image patch that
- matches one of the
- patches stored in the
- what memory
- Recognition mode
- reproduce scanpath
- stored in memory and
- determine whether we
- have a match.
23- Robust to variations in
- scale, rotation,
- illumination, but not
- 3D pose.
24Schill et al, JEI, 2001
25(No Transcript)
26Dynamic Scenes
- Extension to moving objects and dynamic
environment. - Rizzolatti mirror neurons in monkey area F5
respond when monkey observes an action (e.g.,
grasping an object) as well as when he executes
the same action. - Computer vision models decompose complex actions
using grammars of elementary actions and precise
composition rules. Resembles temporal extension
of schema-based systems. Is this what the brain
does?
27Several Problems
- with the progressive visual buffer hypothesis
- Change blindness
- Attention seems to be required for us to perceive
change in images, while these could be easily
detected in a visual buffer! - Amount of memory required is huge!
- Interpretation of buffer contents by high-level
vision is very difficult if buffer contains very
detailed representations (Tsotsos, 1990)!
28The World as an Outside Memory
- Kevin ORegan, early 90s
- why build a detailed internal representation of
the world? - too complex
- not enough memory
- and useless?
- The world is the memory. Attention and the eyes
are a look-up tool!
29The Attention Hypothesis
- Rensink, 2000
- No integrative buffer
- Early processing extracts information up to
proto-object complexity in massively parallel
manner - Attention is necessary to bind the different
proto-objects into complete objects, as well as
to bind object and location - Once attention leaves an object, the binding
dissolves. Not a problem, it can be formed
again whenever needed, by shifting attention back
to the object. - Only a rather sketchy virtual representation is
kept in memory, and attention/eye movements are
used to gather details as needed
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Back to accumulated evidence!
- Hollingworth et al, 2000 argue against the
disintegration of coherent visual representations
as soon as attention is withdrawn. - Experiment
- line drawings of natural scenes
- change one object (target) during a saccadic eye
movement away from that object - instruct subjects to examine scene, and they
would later be asked questions about what was in
it - also instruct subjects to monitor for object
changes and press a button as soon as a change
detected - Hypothesis
- It is known that attention will precede eye
movements. So the change is outside the focus of
attention. If subjects can notice it, it means
that some detailed memory of the object is
retained.
35- Hollingworth et
- al, 2000
- Subjects can see the
- change (26 correct
- overall)
- Even if they only
- notice it a long time
- afterwards, at their
- next visit of the
- object
36Hollingworth et al
- So, these results suggest that
- the online representation of a scene can contain
detailed visual information in memory from
previously attended objects. - Contrary to the proposal of the attention
hypothesis (see Rensink, 2000), the results
indicate that visual object representations do
not disintegrate upon the withdrawal of
attention.
37Gist of a Scene
- Biederman, 1981
- from very brief exposure to a scene (120ms or
less), we can already extract a lot of
information about its global structure, its
category (indoors, outdoors, etc) and some of its
components. - riding the first spike 120ms is the time it
takes the first spike to travel from the retina
to IT! - Thorpe, van Rullen
- very fast classification (down to 27ms exposure,
no mask), e.g., for tasks such as was there an
animal in the scene?
38 39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Gist of a Scene
- Oliva Schyns, Cognitive Psychology, 2000
- Investigate effect of color on fast scene
perception. - Idea Rather than looking at the properties of
the constituent objects in a given scene, look at
the global effect of color on recognition. - Hypothesis
- diagnostic colors (predictive of scene category)
will help recognition.
48Color Gist
49Color Gist
50(No Transcript)
51Color Gist
- Conclusion from Oliva Schyns study
- colored blobs at a coarse spatial scale concur
with luminance cues to form the relevant spatial
layout that mediates express scene recognition.
52(No Transcript)
53Combining saliency and gist
- Torralba, JOSA-A, 2003
- Idea when looking for a specific object, gist
may combine with saliency in guiding attention
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61Application Beobots
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68Outlook
- It seems unlikely that we perceive scenes by
building a progressive buffer and accumulating
detailed evidence into it. It would take to much
resources and be too complex to use. - Rather, we may only have an illusion of detailed
representation, and the availability of our
eyes/attention to get the details whenever they
are needed. The world as an outside memory. - In addition to attention-based scene analysis, we
are able to very rapidly extract the gist of a
scene much faster than we can shift attention
around. - This gist may be constructed by fairly simple
processes that operate in parallel. It can then
be used to prime memory and attention.
69Goal-oriented scene understanding?
- Question describe what is happening in the video
clip shown in the following slide.
70(No Transcript)
71Goal for our algorithms
- Extract the minimal subscene, that is, the
smallest set of actors, objects and actions that
describe the scene under given task definition. - E.g.,
- If who is doing what and to whom? task
- And boy-on-scooter video clip
- Then minimal subscene is a boy with a red shirt
rides a scooter around
72Challenge
- The minimal subscene in our example has
- 10 words, but
- The video clip has over 74 million different
pixel values (about 1.8 billion bits once
uncompressed and displayed though with high
spatial and temporal correlation)
73Starting point
- Can attend to salient locations
- Can identify those locations?
- Can evaluate the task-relevance of those
locations, based on some general symbolic
knowledge about how various entities relate to
each other?
74(No Transcript)
75Task influences eye movements
- Yarbus, 1967
- Given one image,
- An eye tracker,
- And seven sets of instructions given to seven
observers, - Yarbus observed widely different eye movement
scanpaths depending on task.
76Yarbus, 1967 Task influences human eye movements
1 A.Yarbus, Plenum Press, New York, 1967.
77Towards a computational model
- Consider the following scene (next slide)
- Lets walk through a schematic (partly
hypothetical, partly implemented) diagram of the
sequence of steps that may be triggered during
its analysis.
78(No Transcript)
79Two streams
- Not where/what
- But attentional/non-attentional
- Attentional local analysis of details of various
objects - Non-attentional rapid global analysis yields
coarse identification of the setting (rough
semantic category for the scene, e.g., indoors
vs. outdoors, rough layout, etc)
80Setting pathway
Attentional pathway
Itti 2002, also see Rensink, 2000
81Step 1 eyes closed
- Given a task, determine objects that may be
relevant to it, using symbolic LTM (long-term
memory), and store collection of relevant objects
in symbolic WM (working memory). - E.g., if task is to find a stapler, symbolic LTM
may inform us that a desk is relevant. - Then, prime visual system for the features of the
most-relevant entity, as stored in visual LTM. - E.g., if most relevant entity is a red object,
boost red-selective neurons. - C.f. guided search, top-down attentional
modulation of early vision.
82Navalpakkam Itti, in press
1. Eyes closed
83Step 2 attend
- The biased visual system yields a saliency map
(biased for features of most relevant entity) - See Itti Koch, 1998-2003, Navalpakkam Itti,
2003 - The setting yields a spatial prior of where this
entity may be, based on very rapid and very
coarse global scene analysis here we use this
prior as an initializer for our task-relevance
map, a spatial pointwise filter that will be
applied to the saliency map - E.g., if scene is a beach and looking for humans,
look around where the sand is, not in the sky! - See Torralba, 2003 for computer implementation.
842. Attend
853. Recognize
- Once the most (salient relevant) location has
been selected, it is fed (through Rensinks
nexus or Olshausen et al.s shifter circuit)
to object recognition. - If the recognized entity was not in WM, it is
added
863. Recognize
874. Update
- As an entity is recognized, its relationships to
other entities in the WM are evaluated, and the
relevance of all WM entities is updated. - The task-relevance map (TRM) is also updated with
the computed relevant of the currently-fixated
entity. That will ensure that we will later come
back regularly to that location, if relevant, or
largely ignore it, if irrelevant.
884. Update
89Iterate
- The system keeps looping through steps 2-4
- The current WM and TRM are a first approximation
to what may constitute the Minimal subscene - A set of relevant spatial locations with attached
object labels (see object files), and - A set of relevant symbolic entities with attached
relevance values
90Prototype Implementation
91Symbolic LTM
92Simple hierarchical Representation of Visual
features of Objects
93The visual features Of objects in visual LTM are
used to Bias attention Top-down
94Once a location is attended to, its local visual
features Are matched to those in visual LTM, to
recognize the attended object
95Learning object features And using them
for biasing
Naïve Looking for Salient objects
Biased Looking for a Coca-cola can
96(No Transcript)
97Exercising the model by requesting that it finds
several objects
98Learning the TRM through sequences of attention
and recognition
99Outlook
- Open architecture model not in any way
dedicated to a specific task, environment,
knowledge base, etc. just like our brain probably
has not evolved to allow us to drive cars. - Task-dependent learning In the TRM, the
knowledge base, the object recognition system,
etc., guided by an interaction between attention,
recognition, and symbolic knowledge to evaluate
the task-relevance of attended objects - Hybrid neuro/AI architecture Interplay between
rapid/coarse learnable global analysis (gist),
symbolic knowledge-based reasoning, and
local/serial trainable attention and object
recognition - Key new concepts
- Minimal subscene smallest task-dependent set of
actors, objects and actions that concisely
summarize scene contents - Task-relevance map spatial map that helps focus
computational resources on task-relevant scene
portions