Title: A Parallel Implementation of MSER detection
1A Parallel Implementation of MSER detection
- GPGPU Final Project
- Lin Cao
2Review
Invariant to affine transformation, such as
rotation, translation, and scale change
Denotes a set of stable connected components
that are detected in gray scale image
3Review
- MSER is a stable Connected Component of
thresholded image - All pixels inside the MSER have higher or lower
intensities than in the surrounding regions - Regions are selected to be stable over intensity
range
4Sequential and Parallel Approach
- Sequential Parallel
- bucketSort()
buildDirectedGraph( ) - Find ( )
blockReduction( ) - Union( )
parentCompression( ) - Update( ) // already
get regions - GetRegion( )
computeVariation( ) computeVariation( )
findRoot( ) - leastVariation( )
-
leastVariation( ) -
5buildDirectedGraph
75 78 56 62
50 58 55 53
80 65 64 60
65 55 50 55
A parents value of each pixel should no less
than its current value.
local memory visited, members Shared memory
6buildDirectedGraph
75 78 56 62
50 58 55 53
80 65 64 60
65 55 50 55
Memory Usage local memory visited,
members Shared memory
Also process edge for next step
7Block Reduction
1616, 88
8Block Reduction
1616, 88
9Block Reduction
1616, 88
10Block Reduction
log 24
log 22
totally 3 iterations are needed
11Block Reduction
Load edge information to each pixel
65 70 65 63 75
58 60 59 58 57
55 65 66 62
55 55 54 52
58 59
62
60
80
70
55
50
57
80
60
If (horizontal_pixelUpdate)
12Block Reduction
History buffer
13Parent Compression
75 78 56 62
50 58 56 58
80 58 54 58
65 55 58 55
Shared memory based on parent locality
14FindRegion
- FindRoot, so that we can process each regions
tree respectively - Find regions parent and child based on the
delta, so that variation can be computed. - var (area(parent) area(child))/area(current
region) - Send the region information to CPU
- Scan every regions tree, find the minival
variation, which is MSER regions. - Filter the region
15Performance Analysis
16Performance Analysis
17Performance Analysis
- Why 88 better than 1616?
- local memory usage
- recursion times
- block execution
- block reduction times
- parent locality
18Performance Analysis
- GPU vs CPU timing
- intermidiate values
- Synchronization
- record information
- memory transfer
19Conclusion
- Very large data dependancy, still can be solved.
- Should be suitable to multicore microprocessor,
whose individual core is strong enough than the
single thread in GPU. - The bottenleck is still memory.
20Future Work
65 70 65 63 75
58 60 59 58 57
55 65 66 62
55 55 54 52
60
80
70
13
50
57
80
60
- More efficient block
- reduction. (decoder
- and encoder)
- Memory random access
- GPU code effciency
-