Center for Research in Comptuer Vision
Center for Research in Comptuer Vision



MVA

Volume 25, Issue 4


This issue features the following papers.



Attributed hypergraph matching on a Riemannian manifold
J. M. Wang, S. W. Chen, C. S. Fuh

If we consider a matching that preserves high-order relationships among points in the same set, we can int-roduce a hypergraph-matching technique to search for correspondence according to high-order feature values. While graph matching has been widely studied, there is limited research available regarding hypergraph matching. In this paper, we formulate hypergraph matching in terms of tensors. Then, we reduce the hypergraph matching to a bipartite matching problem that can be solved in polynomial time. We then extend this hypergraph matching to attributed hypergraph matching using a combination of different attributes with different orders. We perform analyses that demonstrate that this method is robust when handling noisy or missing data and can achieve inexact graph matching. To the best of our knowledge, while attributed graph-matching and hypergraph-matching have been heavily researched, methods for attributed hypergraph matching have not been proposed before.



Face relighting using discriminative 2D spherical spaces for face recognition
Amr Almaddah, Sadi Vural, Yasushi Mae, Kenichi Ohara, Tatsuo Arai

As part of the face recognition task in a robust security system, we propose a novel approach for the illumination recovery of faces with cast shadows and specularities. Given a single 2D face image, we relight the face object by extracting the nine spherical harmonic bases and the face spherical illumination coefficients by using the face spherical spaces properties. First, an illumination training database is generated by computing the properties of the spherical spaces out of face albedo and normal values estimated from 2D training images. The training database is then discriminately divided into two directions in terms of the illumination quality and light direction of each image. Based on the generated multi-level illumination discriminative training space, we analyze the target face pixels and compare them with the appropriate training subspace using pre-generated tiles. When designing the framework, practical real-time processing speed and small image size were considered. In contrast to other approaches, our technique requires neither 3D face models nor restricted illumination conditions for the training process. Furthermore, the proposed approach uses one single face image to estimate the face albedo and face spherical spaces. In this work, we also provide the results of a series of experiments performed on publicly available databases to show the significant improvements in the face recognition rates.



Fully automatic expression-invariant face correspondence
Augusto Salazar, Stefanie Wuhrer, Chang Shu, Flavio Prieto

We consider the problem of computing accurate point-to-point correspondences among a set of human face scans with varying expressions. Our fully automatic approach does not require any manually placed markers on the scan. Instead, the approach learns the locations of a set of landmarks present in a database and uses this knowledge to automatically predict the locations of these landmarks on a newly available scan. The predicted landmarks are then used to compute point-to-point correspondences between a template model and the newly available scan. To accurately fit the expression of the template to the expression of the scan, we use as template a blendshape model. Our algorithm was tested on a database of human faces of different ethnic groups with strongly varying expressions. Experimental results show that the obtained point-to-point correspondence is both highly accurate and consistent for most of the tested 3D face models.



Fusing the information in visible light and near-infrared images for iris recognition
Faranak Shamsafar, Hadi Seyedarabi, Ali Aghagolzadeh

Automated human identification is a significant issue in real and virtual societies. Iris is a suitable choice for meeting this goal. In this paper, we present an iris recognition system that uses images acquired in both near-infrared and visible lights. These two types of images reveal different textural information of the iris tissue. We demonstrated the necessity to process both VL and NIR images to recognize irides. The proposed system exploits two feature extraction algorithms: one is based on 1D log-Gabor wavelet which gives a detailed representation of the iris region and the other is based on 1D Haar wavelet which represents a coarse model of iris. The Haar wavelet algorithm is proposed in this paper. It makes smaller iris templates than the 1D log-Gabor approach and yet achieves an appropriate recognition rate. We performed the fusion at the match score level and examined the performance of the system in both verification and identification modes. UTIRIS database was used to evaluate the method. The results were compared with other approaches and proved to have better recognition accuracy, while no image enhancement technique is utilized prior to the feature extraction stage. Furthermore, we demonstrated that fusion can compensate the lack of input image information, which can be beneficial in reducing the computation complexity and handling non-cooperative iris images.



OPTIMUS:online persistent tracking and identification of many users for smart spaces
Donghoon Lee, Inhwan Hwang, Songhwai Oh

A smart space, which is embedded with networked sensors and smart devices, can provide various useful services to its users. For the success of a smart space, the problem of tracking and identification of smart space users is of paramount importance. We propose a system, called Optimus, for persistent tracking and identification of users in a smart space, which is equipped with a camera network. We assume that each user carries a smartphone in a smart space. A camera network is used to solve the problem of tracking multiple users in a smart space and information from smartphones is used to identify tracks. For robust tracking, we first detect human subjects from images using a head detection algorithm based on histograms of oriented gradients. Then, human detections are combined to form tracklets and delayed track-level association is used to combine tracklets to build longer trajectories of users. Last, accelerometers in smartphones are used to disambiguate identities of trajectories. By linking identified trajectories, we show that the average length of a track can be lengthened by over six times. The performance of the proposed system is evaluated extensively in realistic scenarios.



Family verification based on similarity of individual family member’s facial segments
Mohammad Ghahramani, Wei-Yun Yau, Eam Khwang Teoh

Humans process faces to recognize family resemblance and act accordingly. Undoubtedly, they are capable of recognizing their kin and family members. In this paper, we study the facts and valid assumptions of facial resemblance in family members’ facial segments. Our analysis and psychological studies show that the facial resemblance differs from member to member and depends on image segments. First, we estimate the degree of resemblance of each member’s image segment. Then, we propose a novel method to fuse similarity of each member’s facial image segments to perform family verification. Employing the proposed approach on the collected 5,400-sample family database achieves considerable improvement compared to the state-of-the-art fusion rule in three designated test scenarios. Experimental results also show that the proposed approach could estimate the similarity slightly more accurate than human perception. We believe the public availableness of the database may advance the development in this domain.



Kernelized pyramid nearest-neighbor search for object categorization
Hong Cheng, Rongchao Yu, Zicheng Liu, Lu Yang, Xue-wen Chen

Nearest-neighbor-based image classification has drawn considerable attention in the past several years thanks to its simplicity and efficiency. Recently, a Kernelized version of Naive-Bayes Nearest-Neighbor (KNBNN) approach has been proposed to combine Nearest-Neighbor-based approaches with other bag-of-feature (BoF) based kernels. However, similar to an orderless BoF image representation, the KNBNN ignores global geometric correspondence. In this paper, our contributions are threefolded. First, we present a technique to exploit the global geometric correspondence in a kernelized NBNN classifier framework. We divide an image into increasingly fine sub-regions like the spatial pyramid matching (SPM) approach; Second, we introduce a pyramid nearest-neighbor kernel by measuring the local similarity in each pyramid window. Third, for better calibrating the outputs of each window, we fit a sigmoid function to add posterior probability to its SVM outputs, and then weight these outputs of all windows. The sigmoid parameters and weight values are learned in a class-dependent and window-dependent manner. By doing so, we learn a class-specific geometric correspondence. Finally, the proposed approach is evaluated on two public datasets: Scene-15 and Caltech-101. We reach 85.2 % recognition rate on Scene-15 and 73.3 % on Caltech-101 only using single descriptor. The experimental results show that our approach significantly outperforms existing techniques.



A natural and synthetic corpus for benchmarking of hand gesture recognition systems
Javier Molina, José A. Pajuelo, Marcos Escudero-Viñolo, Jesús Bescós, José M. Martínez

The use of hand gestures offers an alternative to the commonly used human–computer interfaces (i.e. keyboard, mouse, gamepad, voice, etc.), providing a more intuitive way of navigating among menus and in multimedia applications. This paper presents a dataset for the evaluation of hand gesture recognition approaches in human–computer interaction scenarios. It includes natural data and synthetic data from several State of the Art dictionaries. The dataset considers single-pose and multiple-pose gestures, as well as gestures defined by pose and motion or just by motion. Data types include static pose videos and gesture execution videos—performed by a set of eleven users and recorded with a time-of-flight camera—and synthetically generated gesture images. A novel collection of critical factors involved in the creation of a hand gestures dataset is proposed: capture technology, temporal coherence, nature of gestures, representativeness, pose issues and scalability. Special attention is given to the scalability factor, proposing a simple method for the synthetic generation of depth images of gestures, making possible the extension of a dataset with new dictionaries and gestures without the need of recruiting new users, as well as providing more flexibility in the point-of-view selection. The method is validated for the presented dataset. Finally, a separability study of the pose-based gestures of a dictionary is performed. The resulting corpus, which exceeds in terms of representativity and scalability the datasets existing in the State Of Art, provides a significant evaluation scenario for different kinds of hand gesture recognition solutions.



A complete system for garment segmentation and color classification
Marco Manfredi, Costantino Grana, Simone Calderara, Rita Cucchiara

In this paper, we propose a general approach for automatic segmentation, color-based retrieval and classification of garments in fashion store databases, exploiting shape and color information. The garment segmentation is automatically initialized by learning geometric constraints and shape cues, then it is performed by modeling both skin and accessory colors with Gaussian Mixture Models. For color similarity retrieval and classification, to adapt the color description to the users’ perception and the company marketing directives, a color histogram with an optimized binning strategy, learned on the given color classes, is introduced and combined with HOG features for garment classification. Experiments validating the proposed strategy, and a free-to-use dataset publicly available for scientific purposes, are finally detailed.



When standard RANSAC is not enough: cross-media visual matching with hypothesis relevancy
Tal Hassner, Liav Assif, Lior Wolf

The same scene can be depicted by multiple visual media. For example, the same event can be captured by a comic image or a movie frame; the same object can be represented by a photograph or by a 3D computer graphics model. In order to extract the visual analogies that are at the heart of cross-media analysis, spatial matching is required. This matching is commonly achieved by extracting key points and scoring multiple, randomly generated mapping hypotheses. The more consensus a hypothesis can draw, the higher its score. In this paper, we go beyond the conventional set-size measure for the quality of a match and present a more general hypothesis score that attempts to reflect how likely is each hypothesized transformation to be the correct one for the matching task at hand. This is achieved by considering additional, contextual cues for the relevance of a hypothesized transformation. This context changes from one matching task to another and reflects different properties of the match, beyond the size of a consensus set. We demonstrate that by learning how to correctly score each hypothesis based on these features we are able to deal much more robustly with the challenges required to allow cross-media analysis, leading to correct matches where conventional methods fail.



Image forgery detection using steerable pyramid transform and local binary pattern
Ghulam Muhammad, Munner H. Al-Hammadi, Muhammad Hussain, George Bebis

In this paper, a novel image forgery detection method is proposed based on the steerable pyramid transform (SPT) and local binary pattern (LBP). First, given a color image, we transform it in the YCbCr color space and apply the SPT transform on chrominance channels Cb and Cr, yielding a number of multi-scale and multi-oriented subbands. Then, we describe the texture in each SPT subband using LBP histograms. The histograms from each subband are concatenated to produce a feature vector. Finally, a support vector machine uses the feature vector to classify images into forged or authentic. The proposed method has been evaluated on three publicly available image databases. Our experimental results demonstrate the effectiveness of the proposed method and its superiority over some recent other methods.



Anisotropic diffusion algorithm based on weber local descriptor for illumination invariant face verification
Weihong Li, Ting Kuang, Weiguo Gong

The anisotropic diffusion (AD) algorithm is well known for the illumination invariant feature extraction of face images. The performance of anisotropic diffusion algorithm depends on its conduction function and discontinuity measure. However, in the traditional anisotropic diffusion algorithm, the discontinuity measure usually adopts space gradient or in-homogeneity. Though they are good at describing the local variation of image well, they cannot reflect the variation relative to its background. The relative variations tend to be more able to reflect the variation’s degree of local image. In this paper, we propose an improved anisotropic diffusion algorithm that uses Weber local descriptor (WLD), a powerful and robust local descriptor as the discontinuity measure. Then, we introduce a centre-symmetric logarithmic transformation to eliminate the effect of shadow boundary. Experiments are executed in our proposed illumination invariant face verification scheme on CMU PIE, CAS-PEAL databases and a self-built real-life face database. The results demonstrate that the proposed method outperforms some typical methods on the face databases with large illumination variations.



Silhouette analysis for human action recognition based on maximum spatio-temporal dissimilarity embedding
Jian Cheng, Haijun Liu, Hongsheng Li

In this paper, we present a human action recognition method for human silhouette sequences. Inspired by the locality preserving projection and its variants, a novel manifold embedding method, maximum spatio-temporal dissimilarity embedding, is proposed to embed each action frame into a manifold, where frames from different action classes can be well separated. Unlike existing methods that incorporate both inter-class and intra-class information in the embedding process, our proposed method focuses on maximizing distances between frames that are similar in appearance but are from different classes and takes the temporal information into consideration. A variant of Hausdorff distance is introduced for frame and sequence classifications. Extensive experimental results and comparison with state-of-the-art methods demonstrate the effectiveness and robustness of the proposed method for human action silhouette analysis.



Forest species recognition using macroscopic images
Pedro L. Paula Filho, Luiz S. Oliveira, Silvana Nisgoski, Alceu S. Britto Jr.

The recognition of forest species is a very challenging task that generally requires well-trained human specialists. However, few reach good accuracy in classification due to the time taken for their training; then they are not enough to meet the industry demands. Computer vision systems are a very interesting alternative for this case. The construction of a reliable classification system is not a trivial task, though. In the case of forest species, one must deal with the great intra-class variability and also the lack of a public available database for training and testing the classifiers. To cope with such a variability, in this work, we propose a two-level divide-and-conquer classification strategy where the image is first divided into several sub-images which are classified independently. In the lower level, all the decisions of the different classifiers, trained with different features, are combined through a fusion rule to generate a decision for the sub-image. The higher-level fusion combines all these partial decisions for the sub-images to produce a final decision. Besides the classification system we also extended our previous database, which now is composed of 41 species of Brazilian flora. It is available upon request for research purposes. A series of experiments show that the proposed strategy achieves compelling results. Compared to the best single classifier, which is a SVM trained with a texture-based feature set, the divide-and-conquer strategy improves the recognition rate in about 9 percentage points, while the mean improvement observed with SVMs trained on different descriptors was about 19 percentage points. The best recognition rate achieved in this work was 97.77%.



Recognizing interactions between human performers by ‘Dominating Pose Doublet’
Snehasis Mukherjee, Sujoy Kumar Biswas, Dipti Prasad Mukherjee

A graph theoretic approach is proposed to recognize interactions (e.g., handshaking, punching, etc.) between two human performers in a video. Pose descriptors corresponding to each performer in the video are generated and clustered to form initial codebooks of human poses. Compact codebooks of dominating poses for each of the two performers are created by ranking the poses of the initial codebooks using two different methods. First, an average centrality measure of graph connectivity is introduced where poses are nodes in the graph. The dominating poses are graph nodes sharing a close semantic relationship with all other pose nodes and hence are expected to be at the central part of the graph. Second, a novel similarity measure is introduced for ranking dominating poses. The ‘pose doublets’, all possible combinations of dominating poses of the two performers, are ranked using an improved centrality measure of a bipartite graph. The set of ‘dominating pose doublets’ that best represents the corresponding interaction are selected using the perceptual analysis technique. The recognition results on standard interaction datasets show the efficacy of the proposed approach compared to the state-of-the-art.



Temporal synchronization in mobile sensor networks using image sequence analysis
Darlan N. Brito, Flávio L. C. Pádua, Guilherme A. S. Pereira

This paper addresses the problem of estimating the temporal synchronization in mobile sensors’ networks, by using image sequence analysis of their corresponding scene dynamics. Unlike existing methods, which are frequently based on adaptations of techniques originally designed for wired networks with static topologies, or even based on solutions specially designed for ad hoc wireless sensor networks, but that have a high energy consumption and a low scalability regarding the number of sensors, this work proposes a novel approach that reduces the problem of synchronizing a general number N of sensors to the robust estimation of a single line in ℝN+1. This line captures all temporal relations between the sensors and can be computed without any prior knowledge of these relations. It is assumed that (1) the network’s mobile sensors cross the field of view of a stationary calibrated camera that operates with constant frame rate and (2) the sensors trajectories are estimated with a limited error at a constant sampling rate, both in the world coordinate system and in the camera’s image plane. Experimental results with real-world and synthetic scenarios demonstrate that our method can be successfully used to determine the temporal alignment in mobile sensor networks.



Shape from interaction
Damien Michel, Xenophon Zabulis, Antonis A. Argyros

We present “shape from interaction” (SfI), an approach to the problem of acquiring 3D representations of rigid objects through observing the activity of a human who handles a tool. SfI relies on the fact that two rigid objects cannot share the same physical space. The 3D reconstruction of the unknown object is achieved by tracking the known 3D tool and by carving out the space it occupies as a function of time. Due to this indirection, SfI reconstructs rigid objects regardless of their material and appearance properties and proves particularly useful for the cases of textureless, transparent, translucent, refractive and specular objects for which there exists no practical vision-based 3D reconstruction method. Additionally, object concavities that are not directly observable can also be reconstructed. The 3D tracking of the tool is formulated as an optimization problem that is solved based on visual input acquired by a multicamera system. Experimental results from a prototype implementation of SfI support qualitatively and quantitatively the effectiveness of the proposed approach.



Real-time moustache detection by combining image decolorization and texture detection with applications to facial gender recognition
Jian-Gang Wang, Wei-Yun Yau

There are still many challenging problems in facial gender recognition which is mainly due to the complex variances of face appearance. Although there has been tremendous research effort to develop robust gender recognition over the past decade, none has explicitly exploited the domain knowledge of the difference in appearance between male and female. Moustache contributes substantially to the facial appearance difference between male and female and could be a good feature to be incorporated into facial gender recognition. Little work on moustache segmentation has been reported in the literature. In this paper, a novel real-time moustache detection method is proposed which combines face feature extraction, image decolorization and texture detection. Image decolorization, which converts a color image to grayscale, aims to enhance the color contrast while preserving the grayscale. On the other hand, moustache appearance is normally grayscale surrounded by the skin color face tissue. Hence, it is a fast and efficient way to segment the moustache by using the decolorization technology. In order to make the algorithm robust to the variances of illumination and head pose, an adaptive decolorization segmentation has been proposed in which both the segmentation threshold selection and the moustache region following are guided by some special regions defined by their geometric relationship with the salient facial features. Furthermore, a texture-based moustache classifier is developed to compensate the decolorization-based segmentation which could detect the darker skin or shadow around the mouth caused by the small lines or skin thicker from where he/she smiles as moustache. The face is verified as the face containing a moustache only when it satisfies: (1) a larger moustache region can be found by applying the decolorization segmentation; (2) the segmented moustache region is detected as moustache by the texture moustache detector. The experimental results on color FERET database showed that the proposed approach can achieve 89% moustache face detection rate with 0.1% false acceptance rate. By incorporating the moustache detector into a facial gender recognition system, the gender recognition accuracy on a large database has been improved from 91 to 93.5%.