Title: Parallel Image Registration using
1Parallel Image Registration using Graphical
Processing Unit
CS 387 Semester Project Srinivasa Shivakar
Vulli Singiresu Dheeraj
2Problem Statement
3Image Registration
- Image Registration, also called as Image
Mosaicing or Image Stitching, involves the
estimation of transformation from one image
coordinate system to another image coordinate
system - Transformation elements could include any/all of
the following - Translation
- Rotation
- Scale
- Shear
- Warp
4Data Description
5Data Collection
Step-Stare Pattern
6Multi Spectral Imagery (MSI) Data
BLUE
NIR
GREEN
RED
7Medium Wave Infrared (MWIR) Data
Each image is called a frame
8Data files
Each file, called a Segment, consists of 21
contiguous frames
9Motivation
10Need to move to a real time based system
- Existing code base is in Matlab
- Registering a segment takes around 3-5 minutes
- Done as post processing long after the data is
collected - Need to speed up registration process for
onboard data processing - Data is collected at the rate of 2 segments
- ( 1 IR and 1 MSI) per 3 sec
- Real time requirement is therefore 7 frames per
sec
11Registration Procedure
12Phase correlation
- Used frequently for estimating the translation
between two images - Advantages
- Operation performed in frequency domain,
potentially faster than spatial processing - Provides an approximate translation between
image coordinates - Reduces the effort needed for other estimations
(rotation, scaling, feature tracking) - Disadvantages
- Not very robust, sensitive to low frequency
noise - bad results when signal-to-noise ratio is low
(plain featureless backgrounds)
13Phase correlation
14Feature selection
RX
Select features from ROI
15Feature Tracking
- Selected features are tracked in second frame.
- Sum of Absolute Differences (SAD) block matching
algorithm is used.
- Most computationally expensive task.
16Feature Tracking
17Refining Locations
- Bad pairs should be removed to avoid
mis-registration. - Euclidian distance measure is used to weed out
bad pairs.
18Refining Locations
19Affine Transform
- Affine transform is calculated from the refined
location pairs as
- Takes into account translation, rotation, scale
and shear - Does not take into account warping not handled
in the current project - Singular Value Decomposition (SVD) is used to
calculate pseudo inverse
20Serial Code
- Serial version of code is written using the best
available tools for image processing on CPU. - Intel Performance Primitives (IPP) is used for
low level routines. - Provides most of the image processing functions
such as FFT, Convolution, Matrix Multiplication,
Matrix Conjugate Multiplication, Element-wise
division. - OpenCV is used for high level functions like GUI
display and window handling. - The serial code was tested on a Pentium
processor with 4Gb RAM.
21Parallel Code
- Parallel version of the code is written in
C/CUDA. - OpenGL is used for GUI and window handling.
- The code was tested on Nvidia GTX 280, which has
240 stream processors and 1Gb of DDR3 memory.
22Graphics Processing Unit (GPU)
- Graphics Processing Unit (GPU) is a dedicated
graphics rendering device used to offload work
from the CPU. - Traditionally, GPU hardware had vertex shaders,
to process 3D geometry and pixel shaders to
handle scene lighting and color toning. - A modern GPU has as many as 240 stream
processors, which can be either used as pixel or
vertex shaders, thus increasing the hardware
utilization.
23Compute Unified Device Architecture (CUDA)
- CUDA is parallel programming model and software
environment designed to use GPUs for general
purpose computing. - Stream processors on the GPU can be programmed
in C using the APIs provided. - GPUs having G80 or newer core architecture can
be programmed using CUDA.
24Task Partitioning
- Not all tasks can be parallelized.
- Tasks done on CPU
- Refining of correlation pairs
- Affine Transformation
- Tasks done on GPU
- Phase correlation
- Feature selection
- Feature tracking
- Image transformation
25Data Partitioning
- Each thread handles one pixel.
- Each multiprocessor is allotted 64 threads (8
threads/proc) - Number of thread blocks widthheight/64
- 5120 thread blocks
26Libraries Used
- CUFFT, CUDA FFT
- Powers of 2 not required.
- Faster when size of input matrix is a power of a
single factor - CUDPP, CUDA Data Parallel Primitives
- has routines for parallel sorting and parallel
reduction - used to calculate image mean
27Sample Code
textureltfloat, 2, cudaReadModeElementTypegt tex
tex.addressMode0 cudaAddressModeWrap tex.addr
essMode1 cudaAddressModeWrap tex.filterMode
cudaFilterModeLinear tex.normalized true
// access with normalized texture coordinates
28Sample Code
dim3 dimBlock(8,16, 1) dim3 dimGrid(fwidth /
dimBlock.x, fheight / dimBlock.y,
1) CUDA_SAFE_CALL(cudaMemcpyToArray(cuarray2,0,0
, img2, widthheightsizeof(float),cudaMemcpyDevic
eToDevice)) registrltltlt dimGrid, dimBlock, 0 gtgtgt
(final,affMatD,min(dl.x,ul.x),min(dl.y,dr.y),fx,w
idth,height)
29Sample Code
__global__ void registr( float final, float
affMatD, int xshift, int yshift, int fx, int
width, int height) unsigned int x
blockIdx.xblockDim.x threadIdx.x xshift
unsigned int y blockIdx.yblockDim.y
threadIdx.y yshift float
ux1/(float)(width) float uy1/(float)(height
) ux(affMatD1yaffMatD13xaffMatD16
1)ux uy(affMatD0yaffMatD03xaffMa
tD061)uy if(uxlt1 ux gt0 uylt1
uygt0) finaly(fxwidth)xtex2D(tex2,
ux, uy)
30Results
31Comparison
32Conclusions
- We presented a parallel implementations of image
registration and anomaly detection tools using
GPU - Parallel code on GPU was significantly faster
than C/IPP for all the applications - Current implementation of registration tool is
not robust as existing Matlab code - Need to survey new data structures and
algorithms for faster processing on GPUs
33Thank you
Questions
34References
- Elsen, E., Houston, M., Vishal, V., Darve, E.,
Hanrahan, P., and Pande, V. 2006. N-Body
simulation on GPUs. In Proceedings of the 2006
ACM/IEEE Conference on Supercomputing (Tampa,
Florida, November 11 - 17, 2006). SC '06. ACM,
New York, NY, 188. - Griesser, A. Aug. 2005. Real-time, GPU-based
foreground-background segmentation. Tech. Rep.
BIWI-TR-269. Computer Vision Lab, ETH Zurich. - Govindaraju, N. K., Lloyd, B., Dotsenko, Y.,
Smith, B., and Manferdelli, J. 2008. High
performance discrete Fourier transforms on
graphics processors. In Proceedings of the 2008
ACM/IEEE Conference on Supercomputing (Austin,
Texas, November 15 - 21, 2008). Conference on
High Performance Networking and Computing. IEEE
Press, Piscataway, NJ, 1-12. - Garcia, V., Debreuve, E., and  Barlaud, M. Apr.
2008. Fast k nearest neighbor search using gpu.
Online. Available http//arxiv.org/abs/0804.144
8 - Govindaraju, N. K., Lloyd, B., Wang, W., Lin, M.,
Manocha, D. 2004. Fast computation of database
operations using graphics processors. In
Proceedings of the 2004 ACM SIGMOD International
Conference on Management of Data, 215-226. - Zitova, B., and Flusser, J. 2003. Image
registration methods A survey. Image and Vision
computing, Vol. 21, 977-1000. - Crosby, F. 2007. Adaptive correlation analysis
with non-overlapping imagery indication.
Photogrammetric Engineering and Remote Sensing,
Vol. 73, No. 9, 1-7. - Reed, I. S., and Yu, X. 1990. Adaptive multiband
CFAR detection of an optical pattern with unknown
spectral distribution. IEEE Transactions on
Acoustics, Speech and Signal Processing, Vol. 38. - Chandola, V., Banerjee, A., and Kumar, V. 2009.
Anomaly detection A survey. To appear in ACM
Computing Surveys. Available http//www-users.cs.
umn.edu/kumar/papers/anomaly-survey.php - CUDA Programming Guide v2.0, Online. Available
http//developer.download.nvidia.com/compute/cuda/
2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf