Center for Research in Comptuer Vision
Center for Research in Comptuer Vision



MVA

Volume 25, Issue 1


This issue features the following special issue and original papers.



Multimedia Event Detection
Thomas B. Moeslund, Omar Javed, Yu-Gang Jiang, R. Manmatha

Editorial.



Special Issue Paper
Special Issue Paper E-LAMP: integration of innovative ideas for multimedia event detection
Wei Tong, Yi Yang, Lu Jiang, Shoou-I Yu, ZhenZhong Lan, Zhigang Ma, Waito Sze, Ehsan Younessian, Alexander G. Hauptmann

Detecting multimedia events in web videos is an emerging hot research area in the fields of multimedia and computer vision. In this paper, we introduce the core methods and technologies of the framework we developed recently for our Event Labeling through Analytic Media Processing (E-LAMP) system to deal with different aspects of the overall problem of event detection. More specifically, we have developed efficient methods for feature extraction so that we are able to handle large collections of video data with thousands of hours of videos. Second, we represent the extracted raw features in a spatial bag-ofwords model with more effective tilings such that the spatial layout information of different features and different events can be better captured, thus the overall detection performance can be improved. Third, different from widely used early and late fusion schemes, a novel algorithm is developed to learn a more robust and discriminative intermediate feature representation from multiple features so that better event models can be built upon it. Finally, to tackle the additional challenge of event detection with only very few positive exemplars, we have developed a novel algorithm which is able to effectively adapt the knowledge learnt from auxiliary sources to assist the event detection. Both our empirical results and the official evaluation results on TRECVID MED'11 and MED'12 demonstrate the excellent performance of the integration of these ideas.



Special Issue Paper
Evaluating multimedia features and fusion for example-based event detection
Gregory K. Myers, Ramesh Nallapati, Julien van Hout, Stephanie Pancoast, Ramakant Nevatia, Chen Sun, Amirhossein Habibian, Dennis C. Koelma, Koen E. A. van de Sande, Arnold W. M. Smeulders, Cees G. M. Snoek

Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-ofwords event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the lowlevel visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME's performance in the 2012 TRECVID MED evaluation was one of the best reported.



Special Issue Paper
Discovering joint audio–visual codewords for video event detection
I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, Shih-Fu Chang

Detecting complex events in videos is intrinsically a multimodal problem since both audio and visual channels provide important clues. While conventional methods fuse both modalities at a superficial level, in this paper we propose a new representation—called bi-modal words—to explore representative joint audio–visual patterns. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to produce the bi-modal words that reveal the joint patterns across modalities. Different pooling strategies are then employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations. Since it is difficult to predict the suitable number of bi-modal words, we generate bimodal words at different levels (i.e., codebooks with different sizes), and use multiple kernel learning to combine the resulting multiple representations during event classifier learning. Experimental results on three popular datasets show that the proposed method achieves statistically significant performance gains over methods using individual visual and audio feature alone and existing popular multi-modal fusion methods. We also find that average pooling is particularly suitable for bimodal representation, and using multiple kernel learning to combine multi-modal representations at various granularities is helpful.



Special Issue Paper
Multimedia event detection with multimodal feature fusion and temporal concept localization
Sangmin Oh, Scott McCloskey, Ilseo Kim, Arash Vahdat, Kevin J. Cannons, Hossein Hajimirsadeghi, Greg Mori, A. G. Amitha Perera, Megha Pandey, Jason J. Corso

We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, midlevel and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.



Special Issue Paper
Human interaction categorization by using audio-visual cues
M. J. Marín-Jiménez, R. Muñoz-Salinas, E. Yeguas-Bolivar, N. Pérez de la Blanca

Human Interaction Recognition (HIR) in uncontrolled TV video material is a very challenging problem because of the huge intra-class variability of the classes (due to large differences in the way actions are performed, lighting conditions and camera viewpoints, amongst others) as well as the existing small inter-class variability (e.g., the visual difference between hug and kiss is very subtle). Most of previous works have been focused only on visual information (i.e., image signal), thus missing an important source of information present in human interactions: the audio. So far, such approaches have not shown to be discriminative enough. This work proposes the use of Audio-Visual Bag of Words (AVBOW) as a more powerful mechanism to approach the HIR problem than the traditional Visual Bag of Words (VBOW). We show in this paper that the combined use of video and audio information yields to better classification results than video alone. Our approach has been validated in the challenging TVHID dataset showing that the proposed AVBOW provides statistically significant improvements over the VBOW employed in the related literature..



Special Issue Paper
Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos
G. J. Burghouts, K. Schutte, H. Bouma, R. J. M. den Hollander

In this paper, a system is presented that can detect 48 human actions in realistic videos, ranging from simple actions such as ‘walk’ to complex actions such as ‘exchange’. We propose a method that gives a major contribution in performance. The reason for this major improvement is related to a different approach on three themes: sample selection, twostage classification, and the combination of multiple features. First, we show that the sampling can be improved by smart selection of the negatives. Second, we show that exploiting all 48 actions’ posteriors by two-stage classification greatly improves its detection. Third, we show how low-level motion and high-level object features should be combined. These three yield a performance improvement of a factor 2.37 for human action detection in the visint.org test set of 1,294 realistic videos. In addition, we demonstrate that selective sampling and the two-stage setup improve on standard bag-of-feature methods on the UT-interaction dataset, and our method outperforms state-of-the-art for the IXMAS dataset.



Special Issue Paper
A rule-based event detection system for real-life underwater domain
Concetto Spampinato, Emmanuelle Beauxis-Aussalet, Simone Palazzo, Cigdem Beyan, Jacco van Ossenbruggen, Jiyin He, Bas Boom, Xuan Huang

Understanding and analyzing fish behaviour is a fundamental task for biologists that study marine ecosystems because the changes in animal behaviour reflect environmental conditions such as pollution and climate change. To support investigators in addressing these complex questions, underwater cameras have been recently used. They can continuously monitor marine life while having almost no influence on the environment under observation, which is not the case with observations made by divers for instance. However, the huge quantity of recorded data make the manual video analysis practically impossible. Thus machine vision approaches are needed to distill the information to be investigated. In this paper, we propose an automatic event detection system able to identify solitary and pairing behaviours of the most common fish species of the Taiwanese coral reef. More specifically, the proposed system employs robust low-level processing modules for fish detection, tracking and recognition that extract the raw data used in the event detection process. Then each fish trajectory is modeled and classified using hidden Markov models. The events of interest are detected by integrating end-user rules, specified through an ad hoc user interface, and the analysis of fish trajectories. The system was tested on 499 events of interest, divided into solitary and pairing events for each fish species. It achieved an average accuracy of 0.105, expressed in terms of normalized detection cost. The obtained results are promising, especially given the difficulties occurring in underwater environments. And moreover, it allows marine biologists to speed up the behaviour analysis process, and to reliably carry on their investigations.



Special Issue Paper
Charting-based subspace learning for video-based human action classification
Vijay John, Emanuele Trucco

Free-space detection is a primary task for car navigation. We use charting, a non-linear dimensionality reduction algorithm, for articulated human motion classification in multi-view sequences or 3D data. Charting estimates automatically the intrinsic dimensionality of the latent subspace and preserves local neighbourhood and global structure of high-dimensional data. We classify human actions sub-sequences of varying lengths of skeletal poses, adopting a multi-layered subspace classification scheme with layered pruning and search. The subsequences of varying lengths of skeletal poses can be extracted using either markerless articulated tracking algorithms or markerless motion capture systems. We present a qualitative and quantitative comparison of single-subspace and multiple-subspace classification algorithms. We also identify the minimum length of action skeletal poses, required for accurate classification, using competing classification systems as the baseline. We test our motion classification framework on HumanEva, CMU, HDM05 and ACCAD mocap datasets and achieve similar or better classification accuracy than various comparable systems.



Special Issue Paper
Hierarchical abnormal event detection by real time and semi-real time multi-tasking video surveillance system
Sung Chun Lee, Ram Nevatia

In this paper, we describe how to detect abnormal human activities taking place in an outdoor surveillance environment. Human tracks are provided in real time by the baseline video surveillance system. Given trajectory information, the event analysis module will attempt to determine whether or not a suspicious activity is currently being observed. However, due to real-time processing constrains, there might be false alarms generated by video image noise or non-human objects. It requires further intensive examination to filter out false event detections which can be processed in an off-line fashion. We propose a hierarchical abnormal event detection system that takes care of real time and semireal time as multi-tasking. In low level task, a trajectory-based method processes trajectory data and detects abnormal events in real time. In high level task, an intensive video analysis algorithm checks whether the detected abnormal event is triggered by actual humans or not.



Special Issue Paper
Key observation selection-based effective video synopsis for camera network
Xiaobin Zhu, Jing Liu, Jinqiao Wang, Hanqing Lu

Nowadays, tremendous amount of video is captured endlessly from increased numbers of video cameras distributed around the world. Since needless information is abundant in the raw videos, making video browsing and retrieval is inefficient and time consuming. Video synopsis is an effective way to browse and index such video, by producing a short video representation, while keeping the essential activities of the original video. However, video synopsis for single camera is limited in its view scope, while understanding and monitoring overall activity for large scenarios is valuable and demanding. To solve the above issues, we propose a novel video synopsis algorithm for partially overlapping camera network. Our main contributions reside in three aspects: First, our algorithm can generate video synopsis for large scenarios, which can facilitate understanding overall activities. Second, for generating overall activity, we adopt a novel unsupervised graph matching algorithm to associate trajectories across cameras. Third, a novel multiple kernel similarity is adopted in selecting key observations for eliminating content redundancy in video synopsis. We have demonstrated the effectiveness of our approach on real surveillance videos captured by our camera network.



Action recognition using 3D DAISY descriptor
Xiaochun Cao, Hua Zhang, Chao Deng, Qiguang Liu, Hanyu Liu

In this paper we propose a novel spatial-temporal descriptor for action recognition. We extend a recent image local descriptor, DAISY, to three dimensions to deal with the information in the additional temporal domain in videos. The new 3D DAISY descriptor is both functionally discriminative and computationally efficient. We use the bag-of-words framework and non-linear SVM for classification. The experiments on public action datasets, KTH, WEIZMANN, YouTube, and UT-Interaction, demonstrate the promising results of our method.



Active tracking and pursuit under different levels of occlusion: a two-layer approach
Tomer Baum, Idan Izhaki, Ehud Rivlin, Gadi Katzir

We present an algorithm for real-time, robust, vision-based active tracking and pursuit. The algorithm was designed to overcome problems arising from active vision-based pursuit, such as target occlusion. Our method employs two layers to deal with occlusions of different lengths. The first layer is for short- or medium-term occlusions: those where a known method—such as mean shift combined with a Kalman filter—fails. For this layer we designed the hybrid filter for active pursuit (HAP). HAP utilizes a Kalman filter modified to respond to two different modes of action: one in which the target is positively identified and one in which the target identification is uncertain. For long-term occlusions we use the second layer. This layer is a decision algorithm that follows a learning procedure and is based on game theory-related reinforcement (Cesa- Bianchi and Lugosi, Prediction Learning and Games, 2006). The learning process is based on trial and error and is designed to perform adequately with a small number of samples. The algorithm produces a data structure that can be shared among agents or sent to a central control of a multi-agent system. The learning process is designed so that agents perform tasks according to their skills: an efficient agent will pursue targets while an inefficient agent will search for entering targets. These capacities make this system well suited for embedding in a multiagent control system.



Image-based magnification calibration for electron microscope
Koichi Ito, Ayako Suzuki, Takafumi Aoki, Ruriko Tsuneta

Magnification calibration is a crucial task for the electron microscope to achieve accurate measurement of the target object. In general, magnification calibration is performed to obtain the correspondence between the scale of the electron microscope image and the actual size of the target object using the standard calibration samples. However, the current magnification calibration method mentioned above may include a maximum of 5 % scale error, since an alternative method has not yet been proposed. Addressing this problem, this paper proposes an imagebased magnification calibration method for the electron microscope. The proposed method employs a multi-stage scale estimation approach using phase-based correspondence matching. Consider a sequence of microscope images of the same target object, where the image magnification is gradually increased so that the final image has a very large scale factor S (e.g., S=1,000) with respect to the initial image. The problem considered in this paper is to estimate the overall scale factor S of the given image sequence. The proposed scale estimation method provides a new methodology for high-accuracy magnification calibration of the electron microscope. This paper also proposes a quantitative performance evaluation method of scale estimation algorithms using Mandelbrot images which are precisely scale-controlled images. Experimental evaluation using Mandelbrot images shows that the proposed scale estimation algorithm can estimate the overall scale factor S=1,000 with approximately 0.1 % scale error. Also, a set of experiments using image sequences taken by an actual scanning transmission electron microscope (STEM) demonstrates that the proposed method is more effective for magnification calibration of a STEM compared with a conventional method.



Evaluating the effect of diffuse light on photometric stereo reconstruction
Maria E. Angelopoulou, Maria Petrou

Photometric stereo surface reconstruction requires each input image to be associated with a particular 3D illumination vector. This signifies that the subject should be illuminated in turn by various directional illumination sources. In real life, this directionality may be reduced by ambient illumination, which is typically present as a diffuse component of the incident light. This work assesses the photometric stereo reconstruction quality for various ratios of ambient to directional illuminance and provides a reference for the robustness of photometric stereo with respect to that illuminance ratio. In our analysis, we focus on the face reconstruction application of photometric stereo, as faces are convex objects with rich surface variation, thus providing a suitable platform for photometric stereo reconstruction quality evaluation. Results demonstrate that photometric stereo renders realistic reconstructions of the given surface for ambient illuminance as high as nine times the illuminance of the directional light component.



Real-time landing place assessment in man-made environments
Xiaolu Sun, C. Mario Christoudias, Vincent Lepetit, Pascal Fua

We propose a novel approach to the real-time landing site detection and assessment in unconstrained man-made environments using passive sensors. Because this task must be performed in a few seconds or less, existing methods are often limited to simple local intensity and edge variation cues. By contrast, we show how to efficiently take into account the potential sites’ global shape, which is a critical cue in man-made scenes. Our method relies on a new segmentation algorithm and shape regularity measure to look for polygonal regions in video sequences. In this way, we enforce both temporal consistency and geometric regularity, resulting in very reliable and consistent detections. We demonstrate our approach for the detection of landable sites such as rural fields, building rooftops and runways from color and infrared monocular sequences significantly outperforming the state-of-the-art.



Summarizing high-level scene behavior
Kevin Streib, James W. Davis

We present several novel techniques to summarize the high-level behavior in surveillance video. Our proposed methods can employ either optical flow or trajectories as input, and incorporate spatial and temporal information together, which improve upon existing approaches for summarization. To begin, we extract common pathway regions by performing graph-based clustering on similarity matrices describing the relationships between location/orientation states. We then employ the activities along the pathway regions to extract the aggregate behavioral patterns throughout scenes. We show how our summarization methods can be applied to detect anomalies, retrieve video clips of interest, and generate adaptive-speed summary videos. We examine our approaches on multiple complex urban scenes and present experimental results.



Thermal cameras and applications: a survey
Rikke Gade, Thomas B. Moeslund

Thermal cameras are passive sensors that capture the infrared radiation emitted by all objects with a temperature above absolute zero. This type of camera was originally developed as a surveillance and night vision tool for the military, but recently the price has dropped, significantly opening up a broader field of applications. Deploying this type of sensor in vision systems eliminates the illumination problems of normal greyscale and RGB cameras. This survey provides an overview of the current applications of thermal cameras. Applications include animals, agriculture, buildings, gas detection, industrial, and military applications, as well as detection, tracking, and recognition of humans. Moreover, this survey describes the nature of thermal radiation and the technology of thermal cameras.



Biometric template protection with DCT-based watermarking
Mita Paunwala, S. Patnaik

In this paper we have addressed a solution of two big issues in design of multimodal system: template protection and fusion strategy. A robust biometric watermarking algorithm is proposed for biometric template protection. The fingerprint feature vector and iris features are used as watermark. Proposed DCT-based watermarking technique embeds watermark in low-frequency AC coefficients of selected 8×8 DCT smoother blocks. Blocks are classified based on human visual system. The robustness of the proposed algorithm is compared with the few state-of-art literature when watermarked image is subjected to possible channel attacks. Decision level fusion strategy is used to improve the overall performance of multimodal system. That is achieved by conditionally limiting the threshold of the fingerprint system to a maximum value, obtained by projecting 50 % of the cross over error rate on to the FRR curve of the iris system.