Center for Research in Comptuer Vision
Center for Research in Comptuer Vision



MVA

Volume 25, Issue 7


This issue features the following special issue and original papers.



Contextual Vision Computing
Richang Hong, Qi Tian, Nicu Sebe

The popularity of web 2.0 content brings the proliferation of social media in recent years. The intrinsic attributes of social media are to facilitate interactive information sharing, interoperability and collaboration on the internet. By virtue of that, web images and videos are generally accompanied by user-contributed contextual information such as tags, comments, etc. Massive emerging social media data offer new opportunities for resolving the long-standing challenges in computer vision. Fox example, how to jointly represent the visual aspect and user annotation of multimedia data and how can we build video indexing and enable search to benefit from contextual information? So we face both challenges and opportunities in the research on contextual vision computing. This special issue is organized with the purpose of introducing novel research work on contextual vision computing. Submissions are from an open call for paper. With the assistance of professional referees, ten papers out from seventeen submissions are accepted after two rounds of rigorous reviews. These papers cover a wide range of subtopics of contextual vision computing, including visual representation, image classification, tag localization, saliency detection, pedestrian detection, and so on.



Special Issue Paper
Semi-supervised Unified Latent Factor learning with multi-view data
Yu Jiang, Jing Liu, Zechao Li, Hanqing Lu

Explosive multimedia resources are generated on web, which can be typically considered as a kind of multi-view data in nature. In this paper, we present a Semi-supervised Unified Latent Factor learning approach (SULF) to learn a predictive unified latent representation by leveraging both complementary information among multiple views and the supervision from the partially label information. On one hand, SULF employs a collaborative Nonnegative Matrix Factorization formulation to discover a unified latent space shared across multiple views. On the other hand, SULF adopts a regularized regression model to minimize a prediction loss on partially labeled data with the latent representation. Consequently, the obtained parts-based representation can have more discriminating power. In addition, we also develop a mechanism to learn the weights of different views automatically. To solve the proposed optimization problem, we design an effective iterative algorithm. Extensive experiments are conducted for both classification and clustering tasks on three real-world datasets and the compared results demonstrate the superiority of our approach.



Special Issue Paper
Inductive hierarchical nonnegative graph embedding for “verb–object” image classification
Chao Sun, Bing-Kun Bao, Changsheng Xu

Most existing image classification algorithms mainly focus on dealing with images with only “object” concepts. However, in real-world cases, a great variety of images contain “verb–object” concepts, rather than only “object” ones. The hierarchical structure embedded in these “verb–object” concepts can help to enhance classification. However, traditional feature representation methods cannot utilize it. To tackle this problem, we present in this paper a novel approach, called inductive hierarchical nonnegative graph embedding. By assuming that those “verb–object” concept images which share the same “object” part but different “verb” part have a specific hierarchical structure, we integrate this hierarchical structure into the nonnegative graph embedding technique, together with the definition of inductive matrix, to (1) conduct effective feature extraction from hierarchical structure, (2) easily transfer each new testing sample into its low-dimensional nonnegative representation, and (3) perform image classification of “verb–object” concept images. Extensive experiments compared with the state-of-the-art algorithms on nonnegative data factorization demonstrate the classification power of proposed approach on “verb–object” concept images classification.



Special Issue Paper
Localizing relevant frames in web videos using topic model and relevance filtering
Haojie Li, Lei Yi, Bin Liu, Yi Wang

Numerous web videos associated with rich metadata are available on the Internet today. While such metadata like video tags bring us facilitation and opportunities for video search and multimedia content understanding, some challenges also arise due to the fact that those video tags are usually annotated at the video level, while many tags actually only describe parts of the video content. How to localize the relevant parts or frames of web video for given tags is the key to many applications and research tasks. In this paper we propose combining topic model and relevance filtering to localize relevant frames. Our method is designed in three steps. First, we apply relevance filtering to assign relevance scores to video frames and a raw relevant frame set is obtained by selecting the top ranked frames. Then, we separate the frames into topics by mining the underlying semantics using latent Dirichlet allocation and use the raw relevance set as validation set to select relevant topics. Finally, the topical relevances are used to refine the raw relevant frame set and the final results are obtained. Experiment results on two real web video databases validate the effectiveness of the proposed approach.



Special Issue Paper
Image visual attention computation and application via the learning of object attributes
Junwei Han, Dongyang Wang, Ling Shao, Xiaoliang Qian, Gong Cheng, Jungong Han

Visual attention aims at selecting a salient subset from the visual input for further processing while ignoring redundant data. The dominant view for the computation of visual attention is based on the assumption that bottom-up visual saliency such as local contrast and interest points drives the allocation of attention in scene viewing. However, we advocate in this paper that the deployment of attention is primarily and directly guided by objects and thus propose a novel framework to explore image visual attention via the learning of object attributes from eye-tracking data. We mainly aim to solve three problems: (1) the pixel-level visual attention computation (the saliency map); (2) the image-level visual attention computation; (3) the application of the computation model in image categorization. We first adopt the algorithm of object bank to acquire the responses to a number of object detectors at each location in an image and thus form a feature descriptor to indicate the occurrences of various objects at a pixel or in an image. Next, we integrate the inference of interesting objects from fixations in eye-tracking data with the competition among surrounding objects to solve the first problem. We further propose a computational model to solve the second problem and estimate the interestingness of each image via the mapping between object attributes and the inter-observer visual congruency obtained from eye-tracking data. Finally, we apply the proposed pixel-level visual attention model to the image categorization task. Comprehensive evaluations on publicly available benchmarks and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed models.



Special Issue Paper
A new closed loop method of super-resolution for multi-view images
Jing Zhang, Yang Cao, Zhigang Zheng, Changwen Chen, Zengfu Wang

In this paper, we propose a closed loop method to resolve the multi-view super-resolution problem. For the mixed-resolution multi-view case, where the input is one high-resolution view along with its neighboring low-resolution views, our method can give the super-resolution results and obtain a high-quality depth map simultaneously. The closed loop method consists of two parts: part I, stereo matching and depth maps fusion; and part II, super-resolution. Under the guidance of the estimated depth information, the super-resolution problem can be formulated as an optimization problem. It can be solved approximately by a three-step method, which involves disparity-based pixel mapping, nonlocal construction and final fusion. Based on the super-resolution results, we can update the disparity maps and fuse them into a more reliable depth map. We repeat the loop several times until obtaining stable super-resolution results and depth maps simultaneously. The experimental results on public dataset show that the proposed method can achieve high-quality performance at different scale factors.



Special Issue Paper
Pedestrian detection based on sparse coding and transfer learning
Feidie Liang, Sheng Tang, Yongdong Zhang, Zuoxin Xu, Jintao Li

Pedestrian detection is a fundamental problem in video surveillance and has achieved great progress in recent years. However, training a generic detector performing well in a great variety of scenes has proved to be very difficult. On the other hand, exhausting manual labeling efforts for each specific scene to achieve high accuracy of detection is not acceptable especially for video surveillance applications. To alleviate the manual labeling efforts without scarifying accuracy of detection, we propose a transfer learning framework based on sparse coding for pedestrian detection. In our method, generic detector is used to get the initial target samples, and then several filters are used to select a small part of samples (called as target templates) from the initial target samples which we are very sure about their labels and confidence values. The relevancy between source samples and target templates and the relevancy between target samples and target templates are estimated by sparse coding and later used to calculate the weights for source samples and target samples. By adding the sparse coding-based weights to all these samples during re-training process, we can not only exclude outliers in the source samples, but also tackle the drift problem in the target samples, and thus get a well scene-specific pedestrian detector. Our experiments on two public datasets show that our trained scene-specific pedestrian detector performs well and is comparable with the detector trained on a large number of training samples manually labeled from the target scene.



Special Issue Paper
Context-based person identification framework for smart video surveillance
Liyan Zhang, Dmitri V. Kalashnikov, Sharad Mehrotra, Ronen Vaisenberg

Smart video surveillance (SVS) applications enhance situational awareness by allowing domain analysts to focus on the events of higher priority. SVS approaches operate by trying to extract and interpret higher “semantic” level events that occur in video. One of the key challenges of SVS is that of person identification where the task is for each subject that occurs in a video shot to identify the person it corresponds to. The problem of person identification is especially challenging in resource-constrained environments where transmission delay, bandwidth restriction, and packet loss may prevent the capture of high-quality data. Conventional person identification approaches which primarily are based on analyzing facial features are often not sufficient to deal with poor-quality data. To address this challenge, we propose a framework that leverages heterogeneous contextual information together with facial features to handle the problem of person identification for low-quality data. We first investigate the appropriate methods to utilize heterogeneous context features including clothing, activity, human attributes, gait, people co-occurrence, and so on. We then propose a unified approach for person identification that builds on top of our generic entity resolution framework called RelDC, which can integrate all these context features to improve the quality of person identification. This work thus links one well-known problem of person identification from the computer vision research area (that deals with video/images) with another well-recognized challenge known as entity resolution from the database and AI/ML areas (that deals with textual data). We apply the proposed solution to a real-world dataset consisting of several weeks of surveillance videos. The results demonstrate the effectiveness and efficiency of our approach even on low-quality video data.



Special Issue Paper
A refined particle filter based on determined level set model for robust contour tracking
Xin Sun, Hongxun Yao

Traditional particle filter which uses simple geometric shapes for representation cannot track objects with complex shape accurately. In this paper, we propose a refined particle filter method for contour tracking based on a determined binary level set model (DBLSM). In contrast with other previous work, the computational efficiency is greatly improved due to the simple form of the level set function. The DBLSM adds prior knowledge of the target model to the implementation of curve evolution which improves the curve acting principle and ensures a more accurate convergence to the target. Finally, we perform curve evolution in the update step of particle filter to make good use of the observation at current time. Some appearance information are considered together with the energy function to measure weights for particles, which can identify the target more accurately. Experiment results on several challenging video sequences have verified the proposed algorithm is efficient and effective in many complicated scenes.



Special Issue Paper
Free-viewpoint video relighting from multi-view sequence under general illumination
Guannan Li, Yebin Liu, Qionghai Dai

We proposed an approach to create plausible free-viewpoint relighting video using multi-view cameras array under general illumination. Given the multi-view video dataset recorded using a set of industrial cameras under general uncontrolled and unknown illumination, we first reconstruct 3D model of the captured target using existing multi-view stereo approach. Using the coarse geometry reconstruction, we estimate the spatially varying surface reflectance in the spherical harmonics domain considering the spatial and temporal coherence. With the estimated geometry and reflectance, the 3D target is relit to the novel illumination with the environment map of the target environment. Relit performance is enhanced using a flow- and quotient-based transfer strategy to achieve detailed and plausible performance relighting. Finally, the free-viewpoint video is generated using a view-dependent rendering strategy. Experimental results on various dataset show that our approach enables plausible free-view relighting, and opens up a path towards relightable free-viewpoint video using less complex acquisition setups.



Special Issue Paper
Detail-generating geometry completion for point-sampled geometry
Ren-fang Wang, Yun-peng Liu, De-chao Sun, Hui-xia Xu, Ji-fang Li

In this paper, we present a novel method for detail-generating geometry completion over point-sampled geometry. The main idea consists of converting the context-based geometry completion into the detail-based texture completion on the surface. According to the influence region of boundary points surrounding a hole, a smooth patch covering the hole is first constructed using radial base functions. By applying region-growing clustering to the patch, the patching units for further completion with geometry details is then produced, and using the trilateral filtering operator formulated by us, the geometry-detail texture of each sample point on the input geometry is determined. The geometry details on the smooth completed patch are finally generated by optimizing a constrained global texture energy function on the point-sampled surfaces. Experimental results demonstrate that the method can achieve efficient completed patches that not only conform with their boundaries, but also contain the plausible 3D surface details.



A computationally efficient importance sampling tracking algorithm
Rana Farah, Qifeng Gan, J. M. Pierre Langlois, Guillaume-Alexandre Bilodeau, Yvon Savaria

This paper proposes a computationally efficient importance sampling algorithm applicable to computer vision tracking. The algorithm is based on the CONDENSATION algorithm, but it avoids expensive operations that are costly in real-time embedded systems. It also includes a method that reduces the number of particles during execution and a new resampling scheme. Our experiments demonstrate that the proposed algorithm is as accurate as the CONDENSATION algorithm. Depending on the processed sequence, the acceleration with respect to CONDENSATION can reach 7 × for 50 particles, 12 × for 100 particles and 58 × for 200 particles.



Discriminative vessel segmentation in retinal images by fusing context-aware hybrid features
Erkang Cheng, Liang Du, Yi Wu, Ying J. Zhu, Vasileios Megalooikonomou, Haibin Ling

Vessel segmentation is an important problem in medical image analysis and is often challenging due to large variations in vessel appearance and profiles, as well as image noises. To address these challenges, we propose a solution by combining heterogeneous context-aware features with a discriminative learning framework. Our solution is characterized by three key ingredients: First, we design a hybrid feature pool containing recently invented descriptors including the stroke width transform (SWT) and Weber’s local descriptors (WLD), as well as classical local features including intensity values, Gabor responses and vesselness measurements. Second, we encode context information by sampling the hybrid features from an orientation invariant local context. Third, we treat pixel-level vessel segmentation as a discriminative classification problem, and use a random forest to fuse the rich information encoded in the hybrid context-aware features. For evaluation, the proposed method is applied to retinal vessel segmentation using three publicly available benchmark datasets. On the DRIVE and STARE datasets, our approach achieves average classification accuracies of 0.9474 and 0.9633, respectively. On the high-resolution dataset HRFID, our approach achieves average classification accuracies of 0.9647, 0.9561 and 0.9634 on three different categories, respectively. Experiments are also conducted to validate the superiority of hybrid feature fusion over each individual component.



Realistic human action recognition by Fast HOG3D and self-organization feature map
Nijun Li, Xu Cheng, Suofei Zhang, Zhenyang Wu

Nowadays, local features are very popular in vision-based human action recognition, especially in “wild” or unconstrained videos. This paper proposes a novel framework that combines Fast HOG3D and self-organization feature map (SOM) network for action recognition from unconstrained videos, bypassing the demanding preprocessing such as human detection, tracking or contour extraction. The contributions of our work not only lie in creating a more compact and computational effective local feature descriptor than original HOG3D, but also lie in first successfully applying SOM to realistic action recognition task and studying its training parameters’ influence. We mainly test our approach on the UCF-YouTube dataset with 11 realistic sport actions, achieving promising results that outperform local feature-based support vector machine and are comparable with bag-of-words. Experiments are also carried out on KTH and UT-Interaction datasets for comparison. Results on all the three datasets confirm that our work has comparable, if not better, performance comparing with state-of-the-art.



Normalized Cut optimization based on color perception findings. A comparative study
Aurora Sáez, Carmen Serrano, Begoña Acha

This paper proposes a methodology to obtain a fully automatic color segmentation algorithm based on the Normalized Cut (Ncut) proposed by Shi and Malik, using recent findings in color perception. A weighting matrix computed using a perceptually uniform color space (CIE L∗a∗b∗) and color distance formulae correlated with the visually perceived color differences (CIE94 and CIEDE2000); a stopping condition related to perceptual criteria; an automatic parameters setting required to compute the affinity matrix are proposed. To test the proposed methodology, a wide study about the influence of the color space choice, different stopping conditions, and different similarity measurements is carried out. These alternatives are exhaustively evaluated using perception-related measurements (S-CIELAB) and general segmentation evaluation metrics applied to the 500 images of the Berkeley database. The results showed that the proposed method outperforms Ncut based on other color spaces, similarity measure or stopping conditions. Furthermore, the usability of the method is increased by replacing the manual parameter setting for an automatic.



Factored particle filtering with dependent and constrained partition dynamics for tracking deformable objects
M. Taner Eskil

In particle filtering, dimensionality of the state space can be reduced by tracking control (or feature) points as independent objects, which are traditionally named as partitions. Two critical decisions have to be made in implementation of reduced state-space dimensionality. First is how to construct a dynamic (transition) model for partitions that are inherently dependent. Second critical decision is how to filter partition states such that a viable and likely object state is achieved. In this study, we present a correlation-based transition model and a proposal function that incorporate partition dependency in particle filtering in a computationally tractable manner. We test our algorithm on challenging examples of occlusion, clutter and drastic changes in relative speeds of partitions. Our successful results with as low as 10 particles per partition indicate that the proposed algorithm is both robust and efficient.



Automatic inpainting by removing fence-like structures in RGBD images
Qin Zou, Yu Cao, Qingquan Li, Qingzhou Mao, Song Wang

Recent inpainting techniques usually require human interactions which are labor intensive and dependent on the user experiences. In this paper, we introduce an automatic inpainting technique to remove undesired fence-like structures from images. Specifically, the proposed technique works on the RGBD images which have recently become cheaper and easier to obtain using the Microsoft Kinect. The basic idea is to segment and remove the undesired fence-like structures by using both depth and color information, and then adapt an existing inpainting algorithm to fill the holes resulting from the structure removal. We found that it is difficult to achieve a satisfactory segmentation of such structures by only using the depth channel. In this paper, we use the depth information to help identify a set of foreground and background strokes, with which we apply a graph-cut algorithm on the color channels to obtain a more accurate segmentation for inpainting. We demonstrate the effectiveness of the proposed technique by experiments on a set of Kinect images.



Multi-scale patch-based sparse appearance model for robust object tracking
Chengjun Xie, Jieqing Tan, Peng Chen, Jie Zhang, Lei He

When objects undergo large pose change, illumination variation or partial occlusion, most existing visual tracking algorithms tend to drift away from targets and even fail to track them. To address the issue, in this paper we propose a multi-scale patch-based appearance model with sparse representation and provide an efficient scheme involving the collaboration between multi-scale patches encoded by sparse coefficients. The key idea of our method is to model the appearance of an object by different scale patches, which are represented by sparse coefficients with different scale dictionaries. The model exploits both partial and spatial information of targets based on multi-scale patches. Afterwards, a similarity score of one candidate target is input into a particle filter framework to estimate the target state sequentially over time in visual tracking. Additionally, to decrease the visual drift caused by frequently updating model, we present a novel two-step object tracking method which exploits both the ground truth information of the target labeled in the first frame and the target obtained online with the multi-scale patch information. Experiments on some publicly available benchmarks of video sequences showed that the similarity involving complementary information can locate targets more accurately and the proposed tracker is more robust and effective than others.



3D Hough transform for sphere recognition on point clouds
Marco Camurri, Roberto Vezzani, Rita Cucchiara

Three-dimensional object recognition on range data and 3D point clouds is becoming more important nowadays. Since many real objects have a shape that could be approximated by simple primitives, robust pattern recognition can be used to search for primitive models. For example, the Hough transform is a well-known technique which is largely adopted in 2D image space. In this paper, we systematically analyze different probabilistic/randomized Hough transform algorithms for spherical object detection in dense point clouds. In particular, we study and compare four variants which are characterized by the number of points drawn together for surface computation into the parametric space and we formally discuss their models. We also propose a new method that combines the advantages of both single-point and multi-point approaches for a faster and more accurate detection. The methods are tested on synthetic and real datasets.



Exploiting street-level panoramic images for large-scale automated surveying of traffic signs
Lykele Hazelhoff, Ivo M. Creusen, Peter H. N. de With

Accurate and up-to-date inventories of traffic signs contribute to efficient road maintenance and a high road safety. This paper describes a system for the automated surveying of road signs from street-level images. This is an extremely challenging task, as the involved capturings are non-densely sampled, captured under a wide range of weather conditions and signs may be distorted. The described system is designed in a generic and learning-based fashion, which enables the recognition of different sign appearance classes with the same algorithms, based on class-specific training data. The system starts with detection of the signs visible within each image, using a detection cascade. Next, the 3D position of the signs that are detected consequently within consecutive capturings is calculated. Afterwards, each positioned road sign is classified to retrieve its sign type, thereby exploiting all detections used during positioning of the respective sign. The presented system is intended for large-scale application and currently supports 11 sign appearance classes, containing 176 different sign types. Performance evaluations conducted on a large, real-world dataset (68,010 images) show that our approach accurately positions 95.5 % of the 3,385 present signs, where 96.3 % of them are also correctly classified. Furthermore, our system localized 98.5 % of the signs in at least a single image. Our system design allows for appending a limited manual correction stage to attain a very high performance, so that sign inventories can be created cost effectively.