Computational Architectures in Biological Vision, USC - PowerPoint PPT Presentation

About This Presentation
Title:

Computational Architectures in Biological Vision, USC

Description:

Title: USC Brain Project Specific Aims Author: Michael A. Arbib Last modified by: Laurent Itti Created Date: 3/18/1998 2:41:19 PM Document presentation format – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 100
Provided by: MichaelA212
Learn more at: http://ilab.usc.edu
Category:

less

Transcript and Presenter's Notes

Title: Computational Architectures in Biological Vision, USC


1
Computational Architectures in Biological Vision,
USC
  • Lecture 13. Scene Perception
  • Reading Assignments
  • None

2
(No Transcript)
3
How much can we remember?
  • Incompleteness of memory
  • how many domes in the Taj Mahal?
  • despite conscious experience of picture-perfect,
    iconic memorization.

4
(No Transcript)
5
Change blindness
  • Rensink, ORegan Clark 1996
  • See the demo!

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
But
  • We can recognize complex scenes which we have
    seen before.
  • So, we do have some form of iconic memory.
  • In this lecture
  • examine how we can perceive scenes
  • what is the representation (that can be
    memorized)
  • what are the mechanisms

10
Extended Scene Perception
  • Attention-based analysis Scan scene with
    attention, accumulate evidence from detailed
    local analysis at each attended location.
  • Main issues
  • what is the internal representation?
  • how detailed is memory?
  • do we really have a detailed internal
    representation at all!!?
  • Gist Can very quickly (120ms) classify entire
    scenes or do simple recognition tasks can only
    shift attention twice in that much time!

11
Accumulating Evidence
  • Combine information across multiple eye
    fixations.
  • Build detailed representation of scene in memory.

12
Eye Movements
  • 1) Free examination
  • 2) estimate material
  • circumstances of family
  • 3) give ages of the people
  • 4) surmise what family has
  • been doing before arrival
  • of unexpected visitor
  • 5) remember clothes worn by
  • the people
  • 6) remember position of people
  • and objects
  • 7) estimate how long the unexpected
  • visitor has been away from family

13
Clinical Studies
  • Studies with patients with some visual deficits
    strongly argue that tight interaction between
    where and what visual streams are necessary for
    scene interpretation.
  • Visual agnosia can see objects, copy drawings of
    them, etc., but cannot recognize or name them!
  • Dorsal agnosia cannot recognize objects
  • if more than two are presented simulta-
  • neously problem with localization
  • Ventral agnosia cannot identify objects.

14
These studies suggest
  • We bind features of objects into objects (feature
    binding)
  • We bind objects in space into some arrangement
    (space binding)
  • We perceive the scene.
  • Feature binding what stream
  • Space binding where/how stream

15
Schema-based Approaches
  • Schema (Arbib, 1989) describes objects in terms
    of their physical properties and spatial
    arrangements.
  • Abstract representation of scenes, objects,
    actions, and other brain processes. Intermediate
    level between neural firing and overall behavior.
  • Schemas both cooperate and compete in describing
    the visual world

16
(No Transcript)
17
VISOR
  • Leow Miikkulainen, 1994 low-level -gt
    sub-schema activity maps (coarse description of
    components of objects) -gt competition across
    several candidate schemas -gt one schema wins and
    is the percept.

18
Biologically-Inspired Models
  • Rybak et al, Vision Research, 1998.
  • What Where.
  • Feature-based frame of reference.

19
(No Transcript)
20
Algorithm
  • At each fixation, extract central edge
    orientation, as well as a number of context
    edges
  • Transform those low-level features into more
    invariant second order features, represented in
    a referential attached to the central edge
  • Learning manually select fixation points
  • store sequence of second-order
  • features found at each fixation
  • into what memory also store
  • vector for next fixation, based
  • on context points and in the
  • second-order referential

21
Algorithm
  • As a result, sequence of retinal images is stored
    in what memory, and corresponding sequence of
    attentional shifts in the where memory.

22
Algorithm
  • Search mode look
  • for an image patch that
  • matches one of the
  • patches stored in the
  • what memory
  • Recognition mode
  • reproduce scanpath
  • stored in memory and
  • determine whether we
  • have a match.

23
  • Robust to variations in
  • scale, rotation,
  • illumination, but not
  • 3D pose.

24
Schill et al, JEI, 2001
25
(No Transcript)
26
Dynamic Scenes
  • Extension to moving objects and dynamic
    environment.
  • Rizzolatti mirror neurons in monkey area F5
    respond when monkey observes an action (e.g.,
    grasping an object) as well as when he executes
    the same action.
  • Computer vision models decompose complex actions
    using grammars of elementary actions and precise
    composition rules. Resembles temporal extension
    of schema-based systems. Is this what the brain
    does?

27
Several Problems
  • with the progressive visual buffer hypothesis
  • Change blindness
  • Attention seems to be required for us to perceive
    change in images, while these could be easily
    detected in a visual buffer!
  • Amount of memory required is huge!
  • Interpretation of buffer contents by high-level
    vision is very difficult if buffer contains very
    detailed representations (Tsotsos, 1990)!

28
The World as an Outside Memory
  • Kevin ORegan, early 90s
  • why build a detailed internal representation of
    the world?
  • too complex
  • not enough memory
  • and useless?
  • The world is the memory. Attention and the eyes
    are a look-up tool!

29
The Attention Hypothesis
  • Rensink, 2000
  • No integrative buffer
  • Early processing extracts information up to
    proto-object complexity in massively parallel
    manner
  • Attention is necessary to bind the different
    proto-objects into complete objects, as well as
    to bind object and location
  • Once attention leaves an object, the binding
    dissolves. Not a problem, it can be formed
    again whenever needed, by shifting attention back
    to the object.
  • Only a rather sketchy virtual representation is
    kept in memory, and attention/eye movements are
    used to gather details as needed

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Back to accumulated evidence!
  • Hollingworth et al, 2000 argue against the
    disintegration of coherent visual representations
    as soon as attention is withdrawn.
  • Experiment
  • line drawings of natural scenes
  • change one object (target) during a saccadic eye
    movement away from that object
  • instruct subjects to examine scene, and they
    would later be asked questions about what was in
    it
  • also instruct subjects to monitor for object
    changes and press a button as soon as a change
    detected
  • Hypothesis
  • It is known that attention will precede eye
    movements. So the change is outside the focus of
    attention. If subjects can notice it, it means
    that some detailed memory of the object is
    retained.

35
  • Hollingworth et
  • al, 2000
  • Subjects can see the
  • change (26 correct
  • overall)
  • Even if they only
  • notice it a long time
  • afterwards, at their
  • next visit of the
  • object

36
Hollingworth et al
  • So, these results suggest that
  • the online representation of a scene can contain
    detailed visual information in memory from
    previously attended objects.
  • Contrary to the proposal of the attention
    hypothesis (see Rensink, 2000), the results
    indicate that visual object representations do
    not disintegrate upon the withdrawal of
    attention.

37
Gist of a Scene
  • Biederman, 1981
  • from very brief exposure to a scene (120ms or
    less), we can already extract a lot of
    information about its global structure, its
    category (indoors, outdoors, etc) and some of its
    components.
  • riding the first spike 120ms is the time it
    takes the first spike to travel from the retina
    to IT!
  • Thorpe, van Rullen
  • very fast classification (down to 27ms exposure,
    no mask), e.g., for tasks such as was there an
    animal in the scene?

38
  • demo

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Gist of a Scene
  • Oliva Schyns, Cognitive Psychology, 2000
  • Investigate effect of color on fast scene
    perception.
  • Idea Rather than looking at the properties of
    the constituent objects in a given scene, look at
    the global effect of color on recognition.
  • Hypothesis
  • diagnostic colors (predictive of scene category)
    will help recognition.

48
Color Gist
49
Color Gist
50
(No Transcript)
51
Color Gist
  • Conclusion from Oliva Schyns study
  • colored blobs at a coarse spatial scale concur
    with luminance cues to form the relevant spatial
    layout that mediates express scene recognition.

52
(No Transcript)
53
Combining saliency and gist
  • Torralba, JOSA-A, 2003
  • Idea when looking for a specific object, gist
    may combine with saliency in guiding attention

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
Application Beobots
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
Outlook
  • It seems unlikely that we perceive scenes by
    building a progressive buffer and accumulating
    detailed evidence into it. It would take to much
    resources and be too complex to use.
  • Rather, we may only have an illusion of detailed
    representation, and the availability of our
    eyes/attention to get the details whenever they
    are needed. The world as an outside memory.
  • In addition to attention-based scene analysis, we
    are able to very rapidly extract the gist of a
    scene much faster than we can shift attention
    around.
  • This gist may be constructed by fairly simple
    processes that operate in parallel. It can then
    be used to prime memory and attention.

69
Goal-oriented scene understanding?
  • Question describe what is happening in the video
    clip shown in the following slide.

70
(No Transcript)
71
Goal for our algorithms
  • Extract the minimal subscene, that is, the
    smallest set of actors, objects and actions that
    describe the scene under given task definition.
  • E.g.,
  • If who is doing what and to whom? task
  • And boy-on-scooter video clip
  • Then minimal subscene is a boy with a red shirt
    rides a scooter around

72
Challenge
  • The minimal subscene in our example has
  • 10 words, but
  • The video clip has over 74 million different
    pixel values (about 1.8 billion bits once
    uncompressed and displayed though with high
    spatial and temporal correlation)

73
Starting point
  • Can attend to salient locations
  • Can identify those locations?
  • Can evaluate the task-relevance of those
    locations, based on some general symbolic
    knowledge about how various entities relate to
    each other?

74
(No Transcript)
75
Task influences eye movements
  • Yarbus, 1967
  • Given one image,
  • An eye tracker,
  • And seven sets of instructions given to seven
    observers,
  • Yarbus observed widely different eye movement
    scanpaths depending on task.

76
Yarbus, 1967 Task influences human eye movements
1 A.Yarbus, Plenum Press, New York, 1967.
77
Towards a computational model
  • Consider the following scene (next slide)
  • Lets walk through a schematic (partly
    hypothetical, partly implemented) diagram of the
    sequence of steps that may be triggered during
    its analysis.

78
(No Transcript)
79
Two streams
  • Not where/what
  • But attentional/non-attentional
  • Attentional local analysis of details of various
    objects
  • Non-attentional rapid global analysis yields
    coarse identification of the setting (rough
    semantic category for the scene, e.g., indoors
    vs. outdoors, rough layout, etc)

80
Setting pathway
Attentional pathway
Itti 2002, also see Rensink, 2000
81
Step 1 eyes closed
  • Given a task, determine objects that may be
    relevant to it, using symbolic LTM (long-term
    memory), and store collection of relevant objects
    in symbolic WM (working memory).
  • E.g., if task is to find a stapler, symbolic LTM
    may inform us that a desk is relevant.
  • Then, prime visual system for the features of the
    most-relevant entity, as stored in visual LTM.
  • E.g., if most relevant entity is a red object,
    boost red-selective neurons.
  • C.f. guided search, top-down attentional
    modulation of early vision.

82
Navalpakkam Itti, in press
1. Eyes closed
83
Step 2 attend
  • The biased visual system yields a saliency map
    (biased for features of most relevant entity)
  • See Itti Koch, 1998-2003, Navalpakkam Itti,
    2003
  • The setting yields a spatial prior of where this
    entity may be, based on very rapid and very
    coarse global scene analysis here we use this
    prior as an initializer for our task-relevance
    map, a spatial pointwise filter that will be
    applied to the saliency map
  • E.g., if scene is a beach and looking for humans,
    look around where the sand is, not in the sky!
  • See Torralba, 2003 for computer implementation.

84
2. Attend
85
3. Recognize
  • Once the most (salient relevant) location has
    been selected, it is fed (through Rensinks
    nexus or Olshausen et al.s shifter circuit)
    to object recognition.
  • If the recognized entity was not in WM, it is
    added

86
3. Recognize
87
4. Update
  • As an entity is recognized, its relationships to
    other entities in the WM are evaluated, and the
    relevance of all WM entities is updated.
  • The task-relevance map (TRM) is also updated with
    the computed relevant of the currently-fixated
    entity. That will ensure that we will later come
    back regularly to that location, if relevant, or
    largely ignore it, if irrelevant.

88
4. Update
89
Iterate
  • The system keeps looping through steps 2-4
  • The current WM and TRM are a first approximation
    to what may constitute the Minimal subscene
  • A set of relevant spatial locations with attached
    object labels (see object files), and
  • A set of relevant symbolic entities with attached
    relevance values

90
Prototype Implementation
91
Symbolic LTM
92
Simple hierarchical Representation of Visual
features of Objects
93
The visual features Of objects in visual LTM are
used to Bias attention Top-down
94
Once a location is attended to, its local visual
features Are matched to those in visual LTM, to
recognize the attended object
95
Learning object features And using them
for biasing
Naïve Looking for Salient objects
Biased Looking for a Coca-cola can
96
(No Transcript)
97
Exercising the model by requesting that it finds
several objects
98
Learning the TRM through sequences of attention
and recognition
99
Outlook
  • Open architecture model not in any way
    dedicated to a specific task, environment,
    knowledge base, etc. just like our brain probably
    has not evolved to allow us to drive cars.
  • Task-dependent learning In the TRM, the
    knowledge base, the object recognition system,
    etc., guided by an interaction between attention,
    recognition, and symbolic knowledge to evaluate
    the task-relevance of attended objects
  • Hybrid neuro/AI architecture Interplay between
    rapid/coarse learnable global analysis (gist),
    symbolic knowledge-based reasoning, and
    local/serial trainable attention and object
    recognition
  • Key new concepts
  • Minimal subscene smallest task-dependent set of
    actors, objects and actions that concisely
    summarize scene contents
  • Task-relevance map spatial map that helps focus
    computational resources on task-relevant scene
    portions
Write a Comment
User Comments (0)
About PowerShow.com