Title: Noah Snavely
1Photo Tourism and IM2GPS 3D Reconstruction and
Geolocalization from Internet Photo Collections
- Noah Snavely
- Cornell University
James Hays CMU MIT (Fall 2009) Brown (Spring
2010-)
CVPR 09, June 20, 2009
2The world in photos
- There are billions of photos online
- Photographic record of the surface of the earth
- Photo sharing on a massive scale
3Flickr
3 billion
2 billion
1 billion
gt 3.6 billion photos on Flickr, gt 7.2 billion on
Photobucket, gt 15 billion on Facebook
4Applications of Internet photo collections
Hays and Efros, Scene completion using millions
of photographs
Crandall et al., Mapping the worlds photos
Simon et al., Scene summarization
5Applications of Internet photo collections
Simon and Seitz, Scene segmentation
Kuthirummal et al., Camera calibration
- See more cool work tomorrow at the Internet
Vision Workshop
6Todays agenda
- Part I Photo Tourism 3D reconstruction
and visualization from Internet photo collections
- Part II IM2GPS the Internet as a data source
automatic geolocation of single images
7Rough Schedule
- 830 - 840 Introduction
- Part I Photo Tourism
- 840 - 1000 Image matching and structure from
motion (Snavely) - 1000 - 1020 Break
- 1020 - 1100 3D visualization of photo
collections (Snavely) - Part II IM2GPS
- 1100 - 1230 Geolocalization of images use
of Internet as a data source (Hays)
8Part I Photo Tourism
9Traditional structure from motion
- Input video sequence (handheld, mounted to a
mechanical arm, or attached to a robot) - Output 3D model
Beardsley et al., 3D model acquisition from
Extended Image Sequences, ECCV 96
Pollefeys et al., Visual modeling with a
handheld camera, IJCV 04
David Nister, Ph.D. Thesis
10Traditional structure from motion
- Input video sequence (handheld, mounted to a
mechanical arm, or attached to a robot) - Output 3D model
Commercial SfM software from 2d3
11Traditional structure from motion
- Input video sequence (handheld, mounted to an
arm, or attached to a robot) - Output 3D model
- Input video characteristics
- Images are taken by a single camera
- In a short amount of time
- Moving continuously
- Given in a logical temporal order
12Internet structure from motion
- Input collection of photos resulting from
Internet search
13Internet structure from motion
- Input collection of photos resulting from
Internet search - Input characteristics
- Taken by many different people and cameras
Motorola RAZR
Nikon D3
14Internet structure from motion
- Input collection of photos resulting from
Internet search - Input characteristics
- Taken at many different times of day, year,
century
15Internet structure from motion
- Input collection of photos resulting from
Internet search - Input characteristics
- Given in essentially random order
16SfM for unordered photo collections
- Very different from traditional video sequences
- Early work in this area by Schaffalitzky and
Zisserman
Multi-view matching for unordered image sets, or
How do I organize my holiday snaps?, ECCV 02
17SfM for unordered photo collections
Vergauwen and Van Gool, Web-based
reconstruction service, Machine Vision
Applications 2006
Brown and Lowe, Unsupervised 3D Object
Recognition and Reconstruction in Unordered
Datasets, 3DIM 06
http//www.arc3d.be/
18Two important breakthroughs
- Advances in wide-baseline feature matching (e.g.,
SIFT) - Advances in multi-view geometry techniques
19Overview of Part I
- Basic SfM pipeline
- Feature detection
- Feature matching and track generation
- Structure from motion (SfM)
- Faster matching and SfM
- Problem cases
20Feature detection
Detect features using SIFT Lowe, IJCV 2004
21Feature detection
Detect features using SIFT Lowe, IJCV 2004
22Feature detectors
- SIFT Lowe, IJCV 04
- Binary available at http//www.cs.ubc.ca/lowe/key
points/ - C implementation (by Andrea Vedaldi) available
at http//www.vlfeat.org/ (also implements MSER) - Other implementations http//people.csail.mit.edu
/albert/ladypack/wiki/index.php/Known_implementati
ons_of_SIFT - SURF Bay et al., CVIU 08
- http//www.vision.ee.ethz.ch/surf/
- Many others
23Feature detection
Detect features using SIFT Lowe, IJCV 2004
24Wide-baseline feature matching
- Match features between each pair of images
25Wide-baseline feature matching
- Standard approach for pairwise matching
- For each feature in image A
- Find the feature with the closest descriptor in
image B
From Schaffalitzky and Zisserman 02
26Wide-baseline feature matching
- Compare the distance to the closest feature to
the distance to the second closest feature - If the ratio of distances is less than a
threshold, keep the feature - Why the ratio test?
- Eliminates hard-to-match repeated features
- Distances in SIFT space seem to be non-uniform
27Wide-baseline feature matching
- Because of the high dimensionality of features,
approximate nearest neighbors are necessary for
efficient performance - See ANN package, Mount and Arya
- http//www.cs.umd.edu/mount/ANN/
28Wide-baseline feature matching
Refine matching using RANSAC 8-point algorithm
to estimate fundamental matrices between pairs
29The power of SIFT
30Image connectivity graph
(graph layout produced using the Graphviz
toolkit http//www.graphviz.org/)
31From pairwise matches to tracks
- Once we have pairwise matches, next step is to
link up matches to form tracks
Image 2
- Each track is a connected component of the
pairwise feature match graph - Each track will eventually grow up to become a 3D
point
Image 1
Image 3
32From pairwise matches to tracks
- Once we have pairwise matches, next step is to
link up matches to form tracks
Image 2
- Some tracks might be inconsistent
- We remove the features from the troublesome images
Image 1
Image 3
33Image connectivity post track generation
Image matches after track generation
Raw image matches
34The power of transitivity
35 but most tracks are short
- Example image collection with 3,000 images
- 1,546,612 total tracks
- 79 have length 2
- 90 have length lt 3
- 98 have length lt 10
- Longest track 385 features
36The story so far
Input images
Feature detection
Matching track generation
Images with feature correspondence
37The story so far
- Next step
- Use structure from motion to solve for geometry
(cameras and points) - First what are cameras and points?
38Points and cameras
- Point 3D position in space ( )
- Camera ( )
- A 3D position ( )
- A 3D orientation ( )
- Intrinsic parameters
(focal length, aspect ratio,
) - 7 parameters (331) in total
39Structure from motion
Camera 1
Camera 3
R1,c1,f1
R3,c3,f3
Camera 2
R2,c2,f2
40Solving structure from motion
Inputs feature tracks
Outputs 3D cameras and points
- How do we solve the SfM problem?
- Challenges
- Large number of parameters (1000s of cameras,
millions of points) - Very non-linear objective function
41Solving structure from motion
Inputs feature tracks
Outputs 3D cameras and points
- Important tool Bundle Adjustment Triggs et al.
00 - Joint non-linear optimization of both cameras and
points - Very powerful, elegant tool
- The bad news
- Starting from a random initialization is very
likely to give the wrong answer - Difficult to initialize all the cameras at once
42Solving structure from motion
Inputs feature tracks
Outputs 3D cameras and points
- The good news
- Structure from motion with two cameras is
(relatively) easy - Once we have an initial model, its easy to add
new cameras - Idea
- Start with a small seed reconstruction, and grow
43Incremental SfM
- Automatically select an initial pair of images
44Incremental SfM
45Incremental SfM
46Incremental SfM Algorithm
- Pick a strong initial pair of images
- Initialize the model using two-frame SfM
- While there are connected images remaining
- Pick the image which sees the most existing 3D
points - Estimate the pose of that camera
- Triangulate any new points
- Run bundle adjustment
471. Picking the initial pair
- We want a pair with many matches, but which has
as large a baseline as possible
large baseline
very few matches
lots of matches
small baseline
large baseline
lots of matches
481. Picking the initial pair
- Many possible heuristics
- Ours
- Choose the pair with at least 100 matches, such
that the ratio - is as small as possible
- A homography will be a bad fit if there is
sufficient parallax (and the scene is not planar)
492. Two-frame reconstruction
- Input two images with correspondence
- Output camera parameters, 3D points
- In general, there can be ambiguities if the
cameras are uncalibrated (camera intrinsics are
unknown) - We assume that the only intrinsic parameter is an
unknown focal length
50Finding calibration information
- Many cameras list the focal length of a photo in
its Exif metadata
File size 85111 bytes File date
20051216 041712 Camera make
Panasonic Camera model DMC-FZ20 Date/Time
20050319 125233 Resolution 450 x
600 Flash used No Focal length
6.0mm Exposure time 0.0012 s (1/800) Aperture
f/5.6 ISO equiv. 80 Whitebalance
Auto Metering Mode matrix Exposure program
(auto)
51http//www.dpreview.com/reviews/specs/Panasonic/pa
nasonic_dmcfz20.asp
52Finding calibration information
File size 85111 bytes File date
20051216 041712 Camera make
Panasonic Camera model DMC-FZ20 Date/Time
20050319 125233 Resolution 450 x
600 Flash used No Focal length
6.0mm Exposure time 0.0012 s (1/800) Aperture
f/5.6 ISO equiv. 80 Whitebalance
Auto Metering Mode matrix Exposure program
(auto) Sensor size 5.75mm
Focal length (pixels) Focal length (mm) x Image
width (pixels) / Sensor size (mm)
6.0 mm x 600 pixels /
5.75 mm 626.1 pixels
532. Two-view reconstruction
- Two-view SfM Given two calibrated images with
corresponding points, compute the camera and
point positions - Solved by finding the essential matrix between
the images - Best approach is the 5-point algorithm (as
opposed to the 6-, 7-, or 8- point algorithms)
54Five-point algorithm
Image 1
Image 2
Camera 2
Camera 1
55Five-point algorithm
- First practical solution to the 5-point
algorithm Nister, An efficient solution to the
5-point relative pose problem, PAMI 04 - See also
- Li and Hartley, Five-Point Motion Estimation
Made Easy, ICPR 06
56Two-view reconstruction
Camera 2
Camera 1
57Two-view reconstruction
Camera 2
Camera 1
583bc. Pose estimation and Triangulation
- Next step grow the reconstruction by adding
another image, triangulating new points
n-view triangulation
593bc. Pose estimation and triangulation
- Next step grow the reconstruction by adding
another image, triangulating new points - Both of these problems can be solved
approximately using linear systems
(Direct Linear Transformation (DLT))
603b. Pose estimation
- Choose the image with the most matches to
existing 3D points - Linear 6-point algorithm for finding the 3x4
projection matrix ? - ? can then be decomposed into KRt (intrinsics
rotation and translation) using RQ
decomposition - Use non-linear polishing to snap the camera into
place - For calibrated cameras, there is also a 3-point
algorithm
613c. n-view triangulation
- Objective function sum of squared reprojection
errors - Also solvable (approximately) using a simple
linear system - Follow with a non-linear polishing
623bc. Pose estimation and triangulation
- In practice, multiple images can be added at once
- If the highest-matching image has N matches, add
all images with at least 0.75 N matches (or at
least 500 matches)
633d. Bundle adjustment
Camera 1
Camera 3
R1,c1,f1
R3,c3,f3
Camera 2
R2,c2,f2
643d. Bundle adjustment
- Given
- Vectors of cameras and 3D points
- A set of observed point projections
- the observed 2D location
of point j in image i -
- adjust the cameras and points to minimize g, the
sum of squared reprojection errors
65Reprojection error
Xj
reprojection error
qij
objective function
indicator variable 1 if point j is visible
in camera i 0 otherwise
66Objective function
Projection equation (simplified version)
67Bundle adjustment
- Minimizing g is a sparse non-linear least squares
problem - Usual approach approximate P with a linear
function , minimize using linear least
squares, and repeat until convergence
68Bundle adjustment
- Usual approach approximate P by linearizing
around a current guess C0, X0 - where J is the Jacobian (matrix of partials),
69Bundle adjustment
- Linearized problem find the step
that minimizes - Then set
and repeat
70Bundle adjustment
- How do we minimize
-
- Least-squares solution to the overconstrained
linear system
?
71Bundle adjustment
- (Over-constrained as long as
- 2 x numObservations gt 7 x numCameras 3 x
numPoints) - Solved using the normal equations
72Bundle adjustment
- Guess an answer
- Linearize and compute an optimal step
- Relinearize and repeat
- This algorithm is known as Gauss-Newton
- In practice, a modified algorithm known as
Levenberg-Marquardt is used
73Bundle adjustment
7 points 3 cameras 21 observations 21 21 42
variables 21 x 2 42 equations
74(No Transcript)
75(No Transcript)
76Typical problem (6 cameras, 100 points)
77Other tricks
- Many approaches to bundle adjustment use the
Schur complement to reduce the size of the linear
system - Schur complement factors out points to form a
reduced system that is just the size of the
number of camera parameters - Bundle adjustment then takes time O(n3) in the
number of cameras (less if the reduced camera
system is sparse) - See Triggs et al., Bundle Adjustment A Modern
Synthesis 00 for more details
78Other tricks
- Many packages use direct methods (e.g., Cholesky
factorization, QR factorization) to solve the
linear system - Recently, weve been trying iterative methods
(i.e., conjugate gradient) to good effect
(faster, smaller memory footprint)
79Sparse bundle adjustment packages
- Sparse Bundle Adjustment (SBA)
- Lourakis and Argyros, http//www.ics.forth.gr/lou
rakis/sba/ - Simple Sparse Bundle Adjustment (SSBA)
- Christopher Zach, http//www.cs.unc.edu/cmzach/op
ensource.html
80The problem of outliers
- In spite of our best efforts to get clean
matches, outliers remain - The sum-of-squared residuals objective function
is statistically correct given a Gaussian noise
model - Unfortunately, outliers break the Gaussian
assumption
81The problem of outliers
- Possible solutions
- After each run of bundle adjustment, remove
outliers and rerun - Use a robust objective function
Credit Triggs, et al. Bundle adjustment a
modern synthesis
82Radial distortion
- In practice, radial distortion is a significant
issue
83Radial distortion
- Typically modeled as a low-order polynomial in
the distance from a pixel to the center of
distortion (often assumed to be the image center)
84Radial distortion
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89Timing information
90Timing breakdown
Matching O(n2) in the number of input images
(but easily parallelizable)
SfM worst-case O(n4) in the number of
reconstructed images
91SfM complexity
- Dominated by the cost of bundle adjustment
- If we add a constant k number of images in each
round, then we do work proportional to -
- k3 (2k)3 (3k)3 n3
- O(n4)
92Timing historical comparison
Ours about 0.002 frames per second
from David Nisters CVPR 2005 tutorial on
real-time 3D reconstruction
93Faster image matching
- Recent techniques are based on ideas from text
retrieval (applying Google to images) - Create a vocabulary of visual features
(words) - Given a database of images, represent each image
as a collection of visual words (or a histogram
of word frequencies) - Create an inverted file mapping visual words -gt
images - Compute histogram distances using the inverted
file
94Faster image matching
- Idea first appeared in Sivic and
Zisserman, Video Google A Text Retrieval
Approach to Object Matching in Videos, ICCV 03
95Faster image matching
Nister and Stewenius, Scalable Recognition with
a Vocabulary Tree, CVPR 06
Chum et al., Total Recall Automatic Query
Expansion with a Generative Feature Model for
Object Retrieval, ICCV 07
Real time visual image search with 50,000-image
database
Introduced the idea of query expansion for
increasing recall
96Faster SfM
- SfM is also very computationally intensive
- How can we make it faster? We need either
- Faster algorithms
- Fewer images
- Observation Internet collections represent very
non-uniform samplings of viewpoint - Snavely, Seitz, Szeliski, CVPR 2008
- Idea remove redundant images
97The Pantheon
98Stonehenge
99Stonehenge
Full graph
Skeletal graph
100Skeletal set
- Goal given an image graph ,
- select a small set S of important images to
reconstruct, bounding the loss in quality of the
reconstruction - Reconstruct the skeletal set S
- Estimate the remaining images with much faster
pose estimation steps
101Properties of the skeletal set
- Should touch all parts of
- Dominating set
- Should form a single reconstruction
- Connected dominating set
- Should result in an accurate reconstruction
?
102Representing information in a graph
103Representing information in a graph
104Representing information in a graph
105Representing information in a graph
- Want to find a subgraph with
- many leaves
- small growth in estimated uncertainty between any
pair of nodes
106t-spanner problem
- Given a graph , find a spanning subgraph
such that, for every pair of vertices (P,Q),
the distance between P and Q in is at most
t times the distance between P and Q in
t the stretch factor
Applications in wireless ad hoc networking
Peleg Schäffer 1989, Althöfer, et al, 1993,
Li, et al 2000, Alzoubi 2003
4-spanner
3-spanner
107Stonehenge
Skeletal graph (t16) (leaves omitted)
Full graph
108Properties of approach
- Results in a connected reconstruction (when
possible) - Bounds expected increase in uncertainty of
reconstructed model (bound is defined by t) - Remaining information can be used to refine the
model after the initial reconstruction
109Results
110Pantheon
Full graph
Skeletal graph (t16)
111Skeletal reconstruction 101 images
After adding leaves 579 images
After final optimization 579 images
112Pisa
1093 images registered (352 in skeletal set)
113Trafalgar Square
2973 images registered (277 in skeletal set)
114(No Transcript)
115Statue of Liberty
7834 images registered (322 in skeletal set)
116(No Transcript)
117Running time
(10 days)
(50 days)
hours
118Structure from Motion Failure cases
- Images too far apart
- Some points need to be successfully matched in at
least three images (the Rule of 3)
images courtesy Yasutaku Furukawa
119Structure from Motion Failure cases
120SfM Failure cases
121SfM Failure cases
122Gauge ambiguity
- Without extra information, can only reconstruct
scene up to an unknown similarity transform
(translation, rotation, and scale).
- We dont know where the scene is located, how it
is oriented, or how big it is (is the cube 10 cm
across or 1,000,000 km?) - (im2gps will help with this)
123Gauge ambiguity
7 points 3 cameras 21 observations 21 21 42
variables
21 21 - 7 35 variables
124Gauge ambiguity
- Often possible to estimate one of these
parameters (the up vector) after reconstruction - Usually many cameras a parallel to a ground plane
- Most people capture images with little camera
twist
125How good are Exif tags?
126Dense 3D Modeling
Michael Goesele, Noah Snavely, Brian Curless,
Hugues Hoppe, Steve Seitz, ICCV 2007
127References for Part I
- Code available at http//phototour.cs.washington.
edu/bundler - Image Matching
- F. Schaffalitzky, A. Zisserman. Multi-view
Matching for Unordered Image Sets, or How do I
Organize my Holiday Snaps? ECCV 02. - Sivic and Zisserman, Video Google A Text
Retrieval Approach to Object Matching in Videos,
ICCV 03. - D. Nister and H. Stewenius. Scalable Recognition
with a Vocabulary Tree. CVPR 06. - O. Chum et al. Total Recall Automatic Query
Expansion with a Generative Feature Model for
Object Retrieval. ICCV 07.
128References for Part I
- Code available at http//phototour.cs.washington.
edu/bundler - Structure from Motion
- N. Snavely, S. Seitz, R. Szeliski. Modeling the
World from Internet Photo Collections. IJCV 08. - N. Snavely, S. Seitz, R. Szeliski. Skeletal Sets
for Efficient Structure from Motion. CVPR 08. - B. Triggs, P. MacLauchlan, R. Hartley, A.
Fitzgibbon. Bundle Adjustment A Modern
Synthesis. ECCV 00.
129Part I Photo Tourism(continued)
130(No Transcript)
131Photo Tourism
132Prague Old Town Square
133Rendering
- What can we use for rendering?
- A sparse set of points
- A sparse set of images
- Representation too sparse for traditional 3D
rendering algorithms (geometry too sparse) or
image-based rendering (images too sparse) - Our approach
- Assume that the scene consists of 3D planes,
treat images as projectors onto these planes
134Rendering transitions
135Rendering transitions
136Rendering transitions
137Rendering transitions
Camera A
Camera B
For each image / pair of images, the projection
plane is computed as a best-fit plane to the set
of points
138Yosemite
1393D navigation Photo Tourism
Demo
140Continuous navigation
Demo