Video Classification using Semantic Concept Co-occurrences
Classification of complex videos is an active area of research in computer vision. Despite the complicated nature of unconstrained videos, they can be described as a collection of simpler lower-level concepts, such as candle blowing, walking, clapping, etc. Therefore, a typical approach to video categorization is to first apply concept detectors to different segments of the test video and form a histogram of concepts occurring therein. Next, a trained classifier determines which class the histogram may belong to. In this paper, we propose an approach to complex video classification that models the context using the pairwise co-occurrence of concepts. Generalized Maximum Clique Problem is useful in situations where there are multiple potential solutions for a number of subproblems, along with a global criterion to satisfy. We use GMCP in order to select a set of concepts in different clips of the video in a way that they are holistically in agreement. Thus, a concept that is out of context in the whole video does not appear in our results, while they are common when the concept detection is done in an individual manner. Also, we propose a new solution to GMCP using Mixed Binary Integer Programming. We develop a class specific co-occurrence model and propose a method which uses the GMCP as the classifier and the class-specific co-occurrence models learnt from a training set as the representation of the classes. We argue that this representation is essentially more semantically meaningful and fast in computation compared to the traditional representations, such as the collection of concept histograms of class videos. We show that the proposed classification method significantly outperforms the baseline in particular for videos which include enough contextual cues. We classify a video directly based on discovering the underlying co-occurrence pattern therein and fitting it to the learnt co-occurrence patterns of different classes.
We migrate from the conventional vector representations to the richer matrix-representation which is fundamental to the rest of our clique-based framework. Our method not only incorporates the relationship of all concepts in one clip, but also it fuses the information among different clips of the video. This is in particular important for contextual concept detection in long videos. Here you can see the class representation of a few events.
TRECVID11-MED and TRECVID12-MED  are currently among the most challenging datasets of complex events. We evaluate the proposed framework on EC11, EC12 and DEVT datasets. DEVT (8100 videos) is part of TRECVID-MED 2011 with of fifteen complex events of Boarding trick, Feeding animal, Landing fish, Wedding, Wood working project, Birthday party, Changing tire, Flash mob, Vehicle unstuck, Grooming animal, Making sandwich, Parade, Parkour, Repairing appliance, and Sewing project. EC11 and EC12 are subsets of TRECVID-MED 2011 and 2012 datasets and include 2,062 videos with annotated clips; in each video, the beginning and end of the video segments in which one of our 93 concepts occur are marked manually resulting in total number of 10,950 annotated clips. EC12 includes additional ten events of TRECVIDMED 2012. Note that the annotated clips (shots) are used only for training concept detectors and evaluating the concept detection results. The annotated clips in query videos are not used during test; we employ a sliding window approach for detecting the concepts in them (see sec. 5.2). In order to train concept detectors, we extracted Motion Boundary Histogram (MBH)  features from the annotated clips and computed a histogram of visual words for each. Then, we trained 93 binary SVMs  with χ 2 kernel using the computed histogram of visual words.
We evaluated the proposed GMCP-based concept detection method on EC11 and EC12 using 10-fold cross validation scenario. We extracted the reference co-occurrence matrix. Utilizing the annotated clips of 9 folds and used the rest of the videos for testing.
In this experiment, we evaluate the performance of the method described in section 3 where both concept detection and event classification are performed by GMCP. In order to keep the test and training set totally disjoint, we extracted the event-specific co-occurrence matrices, samples shown in fig. 4, from the annotations of EC11 clips, and used DEVT videos as the test set. We applied the SVM concept detectors to sliding windows of 180 frames (average size of clips in the annotated set) with displacement size of 30 frames in each step. Therefore, each uniform clip of 180 frames has over 50% of overlap with six windows on which the SVM concept detectors were applied. We pick the window in which the highest SVM confidence value falls to represent the clip. We employ this approach since the beginning and end of the concept shots in a test video are unknown, and they are often overlapping. We ignore the clips for which the highest SVM confidence is less than 10%.
Video Classification using Semantic Concept Co-occurrences,
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [Pdf] [BibTeX]