We propose a novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio. Our approach consists of a two-step spatiotemporal segmentation mechanism that relies on velocity and acceleration of moving objects as visual features. Each frame of the video is segmented into regions based on motion and appearance cues using the QuickShift algorithm, which are then clustered over time using K-means, so as to obtain a spatiotemporal video segmentation. The video is represented by motion features computed over individual segments. The Mel-Frequency Cepstral Coefficients (MFCC) of the audio signal, and their first order derivatives are exploited to represent audio. The proposed framework assumes there is a non-trivial correlation between these audio features and the velocity and acceleration of the moving and sounding objects. The canonical correlation analysis (CCA) is utilized to identify the moving objects which are most correlated to the audio signal. In addition to moving-sounding object identification, the same framework is also exploited to solve the problem of audio-video synchronization, and is used to aid interactive segmentation. We evaluate the performance of our proposed method on challenging videos. Our experiments demonstrate significant increase in performance over the state-of-the-art both qualitatively and quantitatively, and validate the feasibility and superiority of our approach.
A very simplified algorithmic overview of the proposed approach can be seen in the figure below, and involves the following key steps:
(1) Given a video, optical flow and its temporal derivative are first computed.
(2) For each frame, the image intensity (RGB), the optical flow, and its derivative, are used to obtain a segmentation by clustering of pixels using the QuickShift algorithm.
(3) Segments thus obtained are then clustered over multiple frames to obtain spatiotemporal regions. The K-means algorithm is used for this step.
(4) A video is then represented as a matrix which has as many columns as the number of frames, and the number of rows is equal to the number of spatiotemporal regions. The element (i, j) of the matrix is a concatenation of the per-frame mean velocity and acceleration of the ith spatiotemporal region in the jth frame.
(5) The MFCC features and derivatives are similarly computed from the audio signal and represented as a matrix whose columns correspond to frames and the rows correspond to the MFCC and its derivative features.
(6) Canonical correlation is performed between the video and audio matrices to obtain the visual and audio canonical bases which maximize the correlation between the two signals.
(7) The elements of the first visual canonical basis vector are used to indentify spatiotemporal regions that are highly correlated to the audio signal.
(8) The same process is also used to find the offset between audio and video signals of unsynchronized videos by choosing the offset which maximizes the canonical correlation between them.
Canonical Correlation Analysis (CCA) proposed by Hotelling determines the correlation between two multi-dimensional random variables by finding
a linear transformation of the first variable that is most correlated to some linear transformation of the second variable. As illustrated in the figure below, the model reveals how well two random variables can be transformed to a common source. We use CCA to find pairs of canonical bases in visual and auditory domains respectively, that maximize the correlation between their respective projections.
CCA is not only used to obtain the value of maximum correlation between the two signals, but also the corresponding vectors which project these signals into a space where the correlation is maximized. This is important in our case because our goal is not to obtain the correlation, rather the specific objects (or regions) which are actually related to (or contribute towards) the audio signal.
Specifically, audio source localization is performed by considering the first visual canonical basis. The elements of the basis vector correspond to the different spatiotemporal regions in the video. The higher the values of the elements the more the corresponding regions contribute to the canonical correlation. By thresholding these values, we can identify the spatiotemporal regions and consequently the pixels in the video that are highly likely to be the sources of the audio signal.
The video below shows multiple examples of audio source localization along with the corresponding ground truths
The following video contains examples of unsynchronized clips whose audio-video offset has been automatically determined using the proposed method.
Some of the quantitative results of the proposed method are shown in the figure below. These have been generated by comparing the output of each experiment with the corresponding ground truth. More details can be found in the related publication linked below.
Correlation Sequences [87 MB]
Audio Video Correlation [65 MB]