Title: A Musicdriven Video Summarization System Using Contentaware Mechanisms
1A Music-driven Video Summarization System Using
Content-aware Mechanisms ????????????????????
- CMLab of CSIE, NTU
- ??????????????
- ??? ???
- ???? ??? ??
2Outline
- Introduction
- The problem / Proposed solution
- Related works
- System framework
- Media analysis
- Audio/video analysis
- Importance function
- Synchronization (combining video with audio)
- Profile Rhythmic Medium
- Parameter Sequential non-Sequential
- Demonstration
- Experimental results
- Conclusions future work
3Introduction The Problem
- The digital video capture devices such as DVs are
made more affordable for end users. - Theres still a tremendous barrier between
amateurs (home users) and powerful video editing
softwares (Adobe Premiere, CyberLink
PowerDirector). - Its interesting to shoot videos but frustrating
to edit them.
- Finally people leave their precious shots in
piles of DV tapes without editing and management.
4Introduction Users Impatience
- According to a survey on DVworld, the relations
between the video length and how many times will
user review them after days - Video clips with no more then 5 minutes are best
for humans concentration. - People are impatient for videos without scenario
or voice-over, especially for those with no music.
http//www.DVworld.com.tw/
5Introduction Proposed Solution
- The music-driven video as summarization
- One study at MIT showed that the improved
soundtrack quality improve perceived video image
quality. - Synchronizing video and audio segments enhance
the perception of both. - Proposed solution
- Create a musical video from home videos.
- The synchronization is done by making the rhythm
of the video fits that of the audio. - ?Because of users direct sympathetic response to
music, the created musical video is professional
looking and more entertaining.
6Introduction Related works
- Literature
- Jonathan Foote, Matthew D. Cooper, Andreas
Girgensohn, "Creating music videos using
automatic media analysis," ACM Multimedia 2002
553-560 - A consumer product called muvee autoProducer
has been announced to ease the burden of
professional video editing.
- The content-analysis technologies are developed
for years can we use them to help auto-creation
of musical videos? - The content-aware mechanisms
7System Framework
Volume ZCR Brightness Bandwidth
Proposed Framework
Human face Flash light Motion strength Color
variance Camera Operation ...
8Media Analysis Audio Features
- Frame-level features
- Time-domain features
- Volume defined as the MSR of audio samples
- ZCR the number of times that the audio waveform
crosses the zero axis in each frame. - Frequency-domain features
- Brightness the centroid of frequency spectrum
- Bandwidth the standard deviation of frequency
spectrum
0s
30s
60s
90s
9Media Analysis Audio Analysis
- Generally the brightness distribution curve is
almost the same as that if the ZCR curve, so here
we use ZCR feature only. - Bandwidth is an important audio feature but we
can not easily tell whats the real physical
meaning of it in music when the bandwidth reaches
its high/low values. - Furthermore, the relations between musical
perceptual and bandwidth values are not clear and
not regular.
Brightness
ZCR
12s
34s
10Media Analysis Audio Segmentation
- First we cut the input audio into clips when the
volume changes dramatically. - For each clip, we define the burst of ZCR as an
attack, which may be beat of a base drum or
voice of a singer.
11Media Analysis Audio Segmentation
- The dramatic volume change defines the audio clip
boundary, while the burst of ZCR (attack) in each
clip defines the granular sub-segment within it. - Besides, we define the dynamic of an audio clip
as our clip-level feature - Faster tempo music usually have clips with higher
audio dynamics
12Media Analysis Video Analysis
- First we apply shot change detection to segment
video into shots - Here we use the combination of pixel MAD (Minimal
Absolute Difference) and pixel histogram
difference methods to detect shot change - The hybrid method performs well for home videos!
13Media Analysis Video Analysis
- Shots Heterogeneity
- Here we use MPEG-7 ColorLayout descriptor to
measure each frames similarity - Used to measure video shots variety
high heterogeneity
low heterogeneity
14Media Analysis Camera Operation
- Camera operations such as pan or zoom are widely
used in amateur home videos. By detection those
camera operations can help catching the video
takers intention. - Our camera operation detection is performed on
the basis of block based motion vectors. - This method is simple and efficient.
Otherwise, no camera operation
15Media Analysis Video Features
- High-level features
- Human face feature
- Use the face detector in the OpenCV library
- Face feature ratio
Flashlight feature
16Media Analysis Video Features
- Medium-level features
- Medium-level features represent frames that are
dynamic (higher motion activities) in nature. - Motion Strength
- Static frames tend to cause people lose their
patience when watching videos - Camera Motion Types
- None, Pan, Zoom
- Importance Zoom Pan None
17Media Analysis Video Features
- Low-level features
- Modeling frames which are better to be seen,
i.e., used for selecting high quality frames in
the final production. - Frame brightness (luminance)
18Media Analysis Video Features
- Color-variance
- We use histogram distributions to model the color
variances
19Media Analysis Importance Functions
- Video frame-level importance
A scaling factor, Sa, defined with the
accompanied audio clips dynamics (Adynamic)
20Media Analysis Importance Functions
- Video segments with higher scores may have human
faces resided in, or have higher motion strength,
or contain zooms and pans depending on which
features that make them reach high values.
21Media Analysis Importance Functions
- Shot-level importance
- The shot-level importance is motivated by
observing that - Shots with larger motion intensity take longer
duration. - The presence of face attracts viewer.
- Shots of higher heterogeneity can taker longer
playing time. - Shots with more camera operations are more
important. - Of course, shots with longer length are more
important. - Static shots takes shorter, while dynamic shots
can take longer.? Gets better results after
editing
22Synchronization Profiles
- General Properties of Home Videos
?The proposed four profiles
23Synchronization Mechanisms
- Before we talk about the synchronization process,
first we introduce the video reduction rate Rva - Basic synchronization mechanisms
Original Videos Time-line
n shots
24Synchronization Rhythmic Profile
- A basic synchronization unit (BSU)
- consists of a starting time and a stopping time
in audios - e.g., an audio segment starts from the 25th
second to 31st second. - In medium profile, we use the LBSU, Larger BSU,
which may be 2 or 3 BSUs in length
25Synchronization Rhythmic Profile
- For each BSU, the starting and stopping points of
BSU will be projected back to the video timeline.
- Search the projected range to find candidate
shots with the same length as BSU - We apply an audio scaling coefficient in the
synchronization stage. The weight of motion
intensity of video shots will be decreased when
aligned with a slow audio clip while nearly be
preserved when synchronized with fast audio clip.
Video timeline
26Synchronization Medium Profile
- Each shot will be reassigned to a new length
according to its shot importance, shots may
becomes longer or shorter in proportion to the
total length. - After projecting to the video space, the length
budget is calculated according to the reduction
rate then allocate the budget to each inner
shots according to its length. - If the allocated shot length is too short (frames), then its budget will be transfer to
neighboring shots.
Video timeline
Audio timeline
27Demonstration Sample Videos
28Experimental Results
The users patience test result
- We have invited 20 people to join this subjective
test, 10 of them are with computer science
background and 10 of them are not.
The performance result of music-driven
summarization
29Experimental Results
Answers about the comparisons of rhythmic and
medium profiles
Answers about the matching of video with audio
tempos
30Conclusions
- We have proposed and implemented a music-driven
video summarization system that can help home
users to post-process their creations in a fully
automatic way. - Many content-aware mechanisms are also proposed
to analyze the input media. We combine the input
video and audio according to their content
features to form our musical videos. - According to our subjective tests, all of the
testers amaze about our system and feel very
impressive. Most of the testers are glad to have
such a tool to help them editing their creations.
- Besides, our proposed system and content-aware
mechanisms are also adopted by CyberLink Corp and
have a planned commercialized scheduled.
31Future Work
- Its better to have users feedback, telling us
which shots are must have which shots are
better to have and which shots should be
dropped - In our work, we include proper transition effects
between video shots. But we think the transition
effect should consider both of the
characteristics of the accompanying audio clip
and video content. - By exploiting more audio and video features and
having more understanding about digital contents
semantics, we can get even better results and the
automatic video editing system can get closer to
professional editors