Title: Computer Vision and Human Perception
1Computer Vision and Human Perception
- A brief intro from an AI perspective
2Computer Vision and Human Perception
- What are the goals of CV?
- What are the applications?
- How do humans perceive the 3D world via images?
- Some methods of processing images.
3Goal of computer vision
- Make useful decisions about real physical objects
and scenes based on sensed images.
- Alternative (Aloimonos and Rosenfeld) goal is
the construction of scene descriptions from
images.
- How do you find the door to leave?
- How do you determine if a person is friendly or
hostile? .. an elder? .. a possible mate?
4Critical Issues
- Sensing how do sensors obtain images of the
world?
- Information how do we obtain color, texture,
shape, motion, etc.?
- Representations what representations should/does
a computer or brain use?
- Algorithms what algorithms process image
information and construct scene descriptions?
5Images 2D projections of 3D
- 3D world has color, texture, surfaces, volumes,
light sources, objects, motion, betweeness,
adjacency, connections, etc.
- 2D image is a projection of a scene from a
specific viewpoint many 3D features are
captured, some are not.
- Brightness or color g(x,y) or f(row, column)
for a certain instant of time
- Images indicate familiar people, moving objects
or animals, health of people or machines
6Image receives reflections
- Light reaches surfaces in 3D
- Surfaces reflect
- Sensor element receives light energy
- Intensity matters
- Angles matter
- Material maters
7CCD Camera has discrete elts
- Lens collects light rays
- CCD elts replace chemicals of film
- Number of elts less than with film (so far)
8Intensities near center of eye
9Camera Programs Display
- Camera inputs to frame buffer
- Program can interpret data
- Program can add graphics
- Program can add imagery
10Some image format issues
- Spatial resolution intensity resolution image
file format
11Resolution is pixels per unit of length
- Resolution decreases by one half in cases at
left
- Human faces can be recognized at 64 x 64 pixels
per face
12Features detected depend on the resolution
- Can tell hearts from diamonds
- Can tell face value
- Generally need 2 pixels across line or small
region (such as eye)
13Human eye as a spherical camera
- 100M sensing elts in retina
- Rods sense intensity
- Cones sense color
- Fovea has tightly packed elts, more cones
- Periphery has more rods
- Focal length is about 20mm
- Pupil/iris controls light entry
- Eye scans, or saccades to image details on
fovea
- 100M sensing cells funnel to 1M optic nerve
connections to the brain
14Image processing operations
- Thresholding
- Edge detection
- Motion field computation
15Find regions via thresholding
- Region has brighter or darker or redder color,
etc.
- If pixel threshold
- then pixel 1 else pixel 0
16Example red blood cell image
- Many blood cells are separate objects
- Many touch bad!
- Salt and pepper noise from thresholding
- How useable is this data?
17Robot vehicle must see stop sign
sign imread('Images/stopSign.jpg','jpg') red
(sign(, , 1)120) (sign(,,2)(sign(,,3)'Images/stopRed120.jpg', 'jpg')
18Thresholding is usually not trivial
19Can cluster pixels by color similarity and by
adjacency
Original RGB Image
Color Clusters by K-Means
20Some image processing ops
- Finding contrast in an image using neighborhoods
of pixels detecting motion across 2 images
21Differentiate to find object edges
- For each pixel, compute its contrast
- Can use max difference of its 8 neighbors
- Detects intensity change across boundary of
adjacent regions
LOG filter later on
224 and 8 neighbors of a pixel
- 4 neighbors are at multiples of 90 degrees
- . N .
- W E
- . S .
- 8 neighbors are at every multiple of 45 degrees
-
- NW N NE
- W E
- SW S SE
23Detect Motion via Subtraction
- Constant background
- Moving object
- Produces pixel differences at boundary
- Reveals moving object and its shape
Differences computed over time rather than over
space
24Two frames of aerial imagery
Video frame N and N1 shows slight movement most
pixels are same, just in different locations.
25Best matching blocks between video frames N1 to
N (motion vectors)
The bulk of the vectors show the true motion of
the airplane taking the pictures. The long
vectors are incorrect motion vectors, but they do
work well for compression of image I2!
Best matches from 2nd to first image shown as
vectors overlaid on the 2nd image. (Work by Dina
Eldin.)
26Gradient from 3x3 neighborhood
Estimate both magnitude and direction of the edge.
27Prewitt versus Sobel masks
Sobel mask uses weights of 1,2,1 and -1,-2,-1 in
order to give more weight to center estimate. The
scaling factor is thus 1/8 and not 1/6.
28Computational short cuts
292 rows of intensity vs difference
30Masks show how to combine neighborhood values
Multiply the mask by the image neighborhood to
get first derivatives of intensity versus x and
versus y
31Curves of contrasting pixels
32Boundaries not always found well
33Canny boundary operator
34LOG filter creates zero crossing at step edges
(2nd der. of Gaussian)
3x3 mask applied at each image position
Detects spots
Detects steps edges
Marr-Hildreth theory of edge detection
35G(x,y) Mexican hat filter
36Positive center
Negative surround
37Properties of LOG filter
- Has zero response on constant region
- Has zero response on intensity ramp
- Exaggerates a step edge by making it larger
- Responds to a step in any direction across the
receptive field.
- Responds to a spot about the size of the center
38Human receptive field is analogous to a mask
Xj are the image intensities.
Wj are gains (weights) in the mask
39Human receptive fields amplify contrast
403D neural network in brain
Level j
Level j1
41Mach band effect shows human bias
42Human bias and illusions supports receptive field
theory of edge detection
43Human brain as a network
- 100B neurons, or nodes
- Half are involved in vision
- 10 trillion connections
- Neuron can have fanout of 10,000
- Visual cortex highly structured to process 2D
signals in multiple ways
44Color and shading
- Used heavily in human vision
- Color is a pixel property, making some
recognition problems easy
- Visible spectrum for humans is 400nm (blue) to
700 nm (red)
- Machines can see much more ex. X-rays,
infrared, radio waves
45Imaging Process (review)
46Factors that Affect Perception
- Light the spectrum of energy that
- illuminates the object
surface
- Reflectance ratio of reflected light to
incoming light
- Specularity highly specular (shiny) vs.
matte surface
- Distance distance to the light source
- Angle angle between surface normal
and light
- source
- Sensitivity how sensitive is the sensor
47Some physics of color
- White light is composed of all visible
frequencies (400-700)
- Ultraviolet and X-rays are of much smaller
wavelength
- Infrared and radio waves are of much longer
wavelength
48Models of Reflectance
We need to look at models for the physics of
illumination and reflection that will
1. help computer vision algorithms extract
information about the 3D world,
and 2. help computer graphics algorithms render
realistic images of model scenes.
Physics-based vision is the subarea of computer
vision that uses physical models to understand im
age formation in order to better analyze real-wor
ld images.
49The Lambertian ModelDiffuse Surface Reflection
A diffuse reflecting surface reflects light unif
ormly in all directions
Uniform brightness for all viewpoints of a planar
surface.
50Real matte objects
Light from ring around camera lens
51Specular reflection is highly directional and
mirrorlike.
R is the ray of reflection V is direction fr
om the surface toward the viewpoint ? is
the shininess
parameter
52CV Perceiving 3D from 2D
- Many cues from 2D images enable interpretation of
the structure of the 3D world producing them
53Many 3D cues
How can humans and other machines reconstruct the
3D nature of a scene from 2D images?
What other world knowledge needs to be added in
the process?
54Labeling image contours interprets the 3D scene
structure
Logo on cup is a mark on the material
shadow relates to illumination, not material
An egg and a thin cup on a table top lighted from
the top right
55Intrinsic Image stores 3D info in pixels and
not intensity.
For each point of the image, we want depth to the
3D surface point, surface normal at that point,
albedo of the surface material, and illumination
of that surface point.
563D scene versus 2D image
- Creases
- Corners
- Faces
- Occlusions (for some viewpoint)
- Edges
- Junctions
- Regions
- Blades, limbs, Ts
57Labeling of simple polyhedra
Labeling of a block floating in space. BJ and KI
are convex creases. Blades AB, BC, CD, etc model
the occlusion of the background. Junction K is a
convex trihedral corner. Junction D is a
T-junction modeling the occlusion of blade CD by
blade JE.
58Trihedral Blocks World Image Junctions only 16
cases!
Only 16 possible junctions in 2D formed by
viewing 3D corners formed by 3 planes and viewed
from a general viewpoint! From top to bottom
L-junctions, arrows, forks, and T-junctions.
59How do we obtain the catalog?
- think about solid/empty assignments to the 8
octants about the X-Y-Z-origin
- think about non-accidental viewpoints
- account for all possible topologies of junctions
and edges
- then handle T-junction occlusions
60Blocks world labeling
Left block floating in space
Right block glued to a wall at the back
61Try labeling these interpret the 3D structure,
then label parts
What does it mean if we cant label them? If we
can label them?
62 1975 researchers very excited
- very strong constraints on interpretations
- several hundred in catalogue when cracks and
shadows allowed (Waltz) algorithm works very
well with them
- but, world is not made of blocks!
- later on, curved blocks world work done but not
as interesting
63Backtracking or interpretation tree
64Necker cube has multiple interpretations
Label the different interpretations
A human staring at one of these cubes typically
experiences changing interpretations. The
interpretation of the two forks (G and H)
flip-flops between front corner and back
corner. What is the explanation?
65Depth cues in 2D images
66Interposition cue
Def Interposition occurs when one object
occludes another object, thus indicating that the
occluding object is closer to the viewer than the
occluded object.
67interposition
- T-junctions indicate occlusion top is occluding
edge while bar is the occluded edge
- Bench occludes lamp post
- leg occludes bench
- lamp post occludes fence
- railing occludes trees
- trees occlude steeple
68 - Perspective scaling railing looks smaller at
the left bench looks smaller at the right 2
steeples are far away
- Forshortening the bench is sharply angled
relative to the viewpoint image length is
affected accordingly
69Texture gradient reveals surface orientation
( In East Lansing, we call it corn not maize.
)
Note also that the rows appear to converge in 2D
Texture Gradient change of image texture along
some direction, often corresponding to a change
in distance or orientation in the 3D world
containing the objects creating the texture.
703D Cues from Perspective
713D Cues from perspective
72More 3D cues
Virtual lines
Falsely perceived interposition
73Irving Rock The Logic of Perception 1982
- Summarized an entire career in visual psychology
- Concluded that the human visual system acts as a
problem-solver
- Triangle unlikely to be accidental must be
object in front of background must be brighter
since its closer
74More 3D cues
2D alignment usually means 3d alignment
2D image curves create perception of 3D surface
75structured light can enhance surfaces in
industrial vision
Potatoes with light stripes
Sculpted object
76Models of Reflectance
We need to look at models for the physics of
illumination and reflection that will
1. help computer vision algorithms extract
information about the 3D world,
and 2. help computer graphics algorithms render
realistic images of model scenes.
Physics-based vision is the subarea of computer
vision that uses physical models to understand im
age formation in order to better analyze real-wor
ld images.
77The Lambertian ModelDiffuse Surface Reflection
A diffuse reflecting surface reflects light unif
ormly in all directions
Uniform brightness for all viewpoints of a planar
surface.
78Shape (normals) from shading
Clearly intensity encodes shape in this case
Cylinder with white paper and pen stripes
Intensities plotted as a surface
79Shape (normals) from shading
Plot of intensity of one image row reveals the 3D
shape of these diffusely reflecting objects.
80Specular reflection is highly directional and
mirrorlike.
R is the ray of reflection V is direction fr
om the surface toward the viewpoint ? is
the shininess
parameter
81What about models for recognition
- recognition to know again
- How does memory store models of faces, rooms,
chairs, etc.?
82Human capability extensive
- Child age 6 might recognize 3000 words
- And 30,000 objects
- Junkyard robot must recognize nearly all objects
- Hundreds of styles of lamps, chairs, tools,
83Some methods recognize
- Via geometric alignment CAD
- Via trained neural net
- Via parts of objects and how they join
- Via the function/behavior of an object
84Side view classes of Ford Taurus (Chen and
Stockman)
These were made in the PRIP Lab from a scale
model. Viewpoints in between can be generated fr
om x and y curvature stored on boundary.
Viewpoints matched to real image boundaries via
optimization.
85Matching image edges to model limbs
Could recognize car model at stoplight or gate.
86Object as parts relations
- Parts have size, color, shape
- Connect together at concavities
- Relations are connect, above, right of, inside
of,
87Functional models
- Inspired by JJ Gibsons Theory of Affordances.
- An object is what an object does
- container holds stuff
- club hits stuff
- chair supports humans
88Louise Stark chair model
- Dozens of CAD models of chairs
- Program analyzed for
- stable pose
- seat of right size
- height off ground right size
- no obstruction to body on seat
- program would accept a trash can
- (which could also pass as a container)
89Minskis theory of frames(Schanks theory of
scripts)
- Frames are learned expectations frame for a
room, a car, a party, an argument,
- Frame is evoked by current situation how?
(hard)
- Human fills in the details of the current frame
(easier)
90summary
- Images have many low level features
- Can detect uniform regions and contrast
- Can organize regions and boundaries
- Human vision uses several simultaneous channels
color, edge, motion
- Use of models/knowledge diverse and difficult
- Last 2 issues difficult in computer vision