Associate Professor Jian Zhang

Biography

Since August 2011, A/Prof. Zhang has been with the Advanced Analytics Institute (AAi) and recently he has joined the Big Data Technologies Centre (GBDTC), Faculty of Engineering and Information Technology (FEIT) at UTS, Sydney. As a research lab leader of Multimedia and Data Analytics in GBDTC. His research focuses on Industry management data analytics, surveillance video content analytics and social media analytics. He has actively engaged with research projects, supervised PhD research students, as well as developed new analytics courses. As a leading chief investigator, he has led more than 10 research projects with industry partners including Microsoft Research, Nokia Research Centre and Huawei Technologies in US, Finland, Australia and China. Apart from more than 100 paper publications and book chapters from his research output, he was co-author of more than ten patents filed in US, UK, Japan and Australia including six issued US patents and one China patent.

From January 2004 - July 2011, Dr. Zhang was a Principal Researcher with National ICT Australia (NICTA) and a Conjoint Associate Professor in School of Computer Science & Engineering, the University of New South Wales. As a research project leader at NICTA Sydney Lab in UNSW Kensington campus, Dr. Zhang led several NICTA research projects in the areas of computer vision, surveillance video content analysis and human action classification and recognition

From June 1997 – December 2003, Dr. Zhang was with the Visual Information Processing Lab, Motorola Labs in Sydney as a senior research engineer and later became a principal research engineer, and a foundation manager of Visual Communications Research Team. He has completed several technology transfers to Motorola product groups. While at Motorola labs, he had been working on diverse research projects including image processing, video coding and communication, image segmentation and multimedia content adaptation etc.

PhD supervisions:

As a principal supervisor, Dr. Jian Zhang has supervised 15 PhD and MSc students including 7 completed degree students (PhD and MSc).

Dr Zhang earned his Bachelor of Science in Electronic Engineering from East Normal University, China in 1982; a Masters of Science in Computer Science from Flinders University, South Australia in 1994; and a PhD from School of Information Technology and Electrical Engineering, UNSW-ADFA at the University of New South Wales in 1999.

Professional

Jian Zhang is an IEEE Senior Member. He is the member of Multimedia Signal Processing Technical Committee in Signal Processing Society, Jian was Technical Program Chair, 2008 IEEE Multimedia Signal Processing Workshop; Associated Editor, IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT) and Associated Editor, EURASIP Journal on Image and Video Processing. Dr Zhang was Guest Editor of T-CSVT for Video Technology for Special Issue (March 2007) of the Convergence of Knowledge Engineering Semantics and Signal Processing in Audiovisual Information Retrieval. As a General Co-Chair, Jian has chaired the International Conference on Multimedia and Expo (ICME 2012) in Melbourne Australia 2012.

Professional Activities

Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT)

Associate Editor of International Journal of Image and Video Processing (EURASIP_JIVP)

Senior member of the IEEE and its Communications, Computer, and Signal Processing Societies.

1. Microsoft External Collaboration Project (Pilot funded project): Advanced 3D Deformable Surface Reconstruction and Tracking through RGB-D Cameras. The aim of this project is to develop novel computer vision technology for real time modeling and tracking of 3D dense and deformable surfaces using general RGB-D cameras. The expected outcomes of this project will add significant value to the current RGB-D camera platform when applied in the common scenario in which the RGB-D camera does not move but the deformable objects of interest are moving.

My Research Students (as the Principal Supervisor)

Mr. Yucheng Wang PhD Candidate (funded by Microsoft Research Project and UTS International Research Scholarship)

Mr. Shangrong Huang PhD Candidate (funded by Microsoft Research Project and UTS International Research Scholarship)

Conventional scene flow containing only translational vectors is not able to model 3D motion with rotation properly. Moreover, the accuracy of 3D motion estimation is restricted by several challenges such as large displacement, noise, and missing data (caused by sensing techniques or occlusion). In terms of solution, there are two kinds of approaches: local approaches and global approaches. However, local approaches can not generate smooth motion field, and global approaches is difficult to handle large displacement motion. In this paper, a completed dense scene flow framework is proposed, which models both rotation and translation for general motion estimation. It combines both a local method and a global method considering their complementary characteristics to handle large displacement motion and enforce smoothness respectively. The proposed framework is applied on the RGB-D image space where the computation efficiency is further improved. According to the quantitative evaluation based on Middlebury dataset, our method outperforms other published methods. The improved performance is further confirmed on the real data acquired by Kinect sensor.

Multiple kernel learning (MKL) optimally combines the multiple channels of each sample to improve classification performance. However, existing MKL algorithms cannot effectively handle the situation where some channels are missing, which is common in practical applications. This paper proposes an absent MKL (AMKL) algorithm to address this issue. Different from existing approaches where missing channels are firstly imputed and then a standard MKL algorithm is deployed on the imputed data, our algorithm directly classifies each sample with its observed channels. In specific, we define a margin for each sample in its own relevant space, which corresponds to the observed channels of that sample. The proposed AMKL algorithm then maximizes the minimum of all sample-based margins, and this leads to a difficult optimization problem. We show that this problem can be reformulated as a convex one by applying the representer theorem. This makes it readily be solved via existing convex optimization packages. Extensive experiments are conducted on five MKL benchmark data sets to compare the proposed algorithm with existing imputation-based methods. As observed, our algorithm achieves superior performance and the improvement is more significant with the increasing missing ratio.

Friend recommendation is an important recommender application in social media. Major social websites such as Twitter and Facebook are all capable of recommending friends to individuals. However, friend recommendation is a difficult task and most social websites use simple friend recommendation algorithms such as similarity and popularity, whose level of accuracy does do not satisfy the majority of users.
In this paper we propose a two-stage procedure for more accurate friend recommendation: In the rest stage, based on the relationship of different social networks, the Flickr tag network and contact network are aligned to generate a "possible friend list"; In the second stage, making the assumption that a friend's friends also tend to be friends",
co-clustering is applied to the tag and image information of the list to refine the recommendation result in the first stage. Experimental results show that the proposed method achieves good performance and every stage contributes to the recommendation.

Stereo matching methods based on Patch-Match obtain good results on complex texture regions but show poor ability on low texture regions. In this paper, a new method that integrates Patch-Match and graph cuts (GC) is proposed in order to achieve good results in both complex and low texture regions. A label is randomly assigned for each pixel and the label is optimized through propagation process. All these labels constitute a label space for each iteration in GC. Also, a Ground Control Points (GCPs) constraint term is added to the GC to overcome the disadvantages of Patch-Match stereo in low texture regions. The proposed method has the advantage of the spatial propagation of Patch-Match and the global property of GC. The results of experiments are tested on the Middlebury evaluation system and outperform all the other PatchMatch based methods

Dense correspondence computation is a critical computer vision task with many applications. The most existing dense correspondence methods consider all the neighbors connected to the center pixels and use local support region. However, such approach might only achieve a locally-optimal solution.In this paper, we propose a non-local dense correspondence computation method by calculating the match cost on a tree structure. It is non-local because all other nodes on the tree contribute to the match cost computing for the current node. The proposed method consists of three steps, namely: 1) DAISY descriptor computation, 2) edge-preserving segmentation and forest construction, 3) PatchMatch fast search. We test our algorithm on the Middlebury and Moseg datasets. The results show that the proposed method outperforms the state-of-the-art methods in dense correspondence computing and has a low computation complexity.

Previous work on human action analysis mainly focuses on designing hand-crafted local features and combining their context information. In this paper, we propose using supervised feature learning as a way to learn spatio-temporal features. More specifically, a modified hidden conditional random field is applied to learn two high-level features conditioned on a certain action label. Among them, the individual features can describe the appearance of local parts and the interaction features can capture their spatial constraints. In order to make the best of what have been learned, a new categorization model is proposed for action matching. It is inspired by the Deformable Part Model and the intuition is that actions can be modeled by local features in a changeable spatial and temporal dependency. Experimental result shows that our algorithm can successfully recognize human actions with high accuracies both on the simple atomic action database (KTH and Weizmann) and complex interaction activity database (CASIA).

In this paper, we present a depth super-resolution
framework by fusing depth imaging and stereo vision for highresolution
and high-accuracy depth maps. Depth cameras and
stereo vision have their own limitations in some aspects, but
their characteristics of range sensing are complementary. Thus,
combining both approaches can produce more satisfactory results
than either one. Unlike previous fusion methods, we initially
taking the noisy depth observation from depth camera as prior
information of scene structure. The prior information of scene
structure is also utilized to infer structural determinant information,
like depth discontinuity and occlusion, which is essential
to improve the quality of depth map in the fusion process. In
succession, the prior knowledge helps to overcome difficulties of
intensity inconsistency in image observation from stereo vision
component. Experimental results dem

Object registration has been widely discussed with the development of various range sensing technologies. In most work, however, the point clouds of reference and target are generated by the same technology, such as a Kinect range camera, LiDAR sensor, or Structure from Motion technique. Cases in which reference and target point clouds are generated by different technologies are rarely discussed. Due to the significant differences across various point cloud data in terms of point cloud density, sensing noise, scale, occlusion etc., object registration between such different point clouds becomes extremely difficult. In this study, we address for the first time an even more challenging case in which the differently-sourced point clouds are acquired from a real street view. One is generated on the basis of an image sequence through the SfM process, and the other is produced directly by the LiDAR system. We propose a two-stage matching and registration algorithm to achieve object registration between these two different point clouds. The experiments are based on real building object point cloud data and demonstrate the effectiveness and efficiency of the proposed solution. The newly proposed solution can be further developed to contribute to several related applications, such as Location Based Service.

Playing a vital role, saliency has been widely applied for various image analysis tasks, such as content-aware image retargeting, image retrieval and object detection. It is generally accepted that saliency detection can benefit from the integration of multiple visual features. However, most of the existing literatures fuse multiple features at saliency map level without considering cross-feature information, i.e. generate a saliency map based on several maps computed from an individual feature. In this paper, we propose a Multiple Feature Distance Preserving (MFDP) model to seamlessly integrate multiple visual features through an alternative optimization process. Our method outperforms the state-of-the-arts methods on saliency detection. Saliency detected by our method is further cooperated with seam carving algorithm and significantly improves the performance on image retargeting.

Existing multiple kernel learning (MKL) algorithms indiscriminately
apply a same set of kernel combination weights
to all samples. However, the utility of base kernels could vary
across samples and a base kernel useful for one sample could
become noisy for another. In this case, rigidly applying a
same set of kernel combination weights could adversely affect
the learning performance. To improve this situation, we
propose a sample-adaptive MKL algorithm, in which base
kernels are allowed to be adaptively switched on/off with
respect to each sample. We achieve this goal by assigning
a latent binary variable to each base kernel when it is applied
to a sample. The kernel combination weights and the
latent variables are jointly optimized via margin maximization
principle. As demonstrated on five benchmark data sets,
the proposed algorithm consistently outperforms the comparable
ones in the literature.

People counting is a topic with various practical
applications. Over the last decade, two general approaches have
been proposed to tackle this problem: a) counting based on
individual human detection; b) counting by measuring regression
relation between the crowd density and number of people.
Because the regression based method can avoid explicit people
detection which faces several well-known challenges, it has been
considered as a robust method particularly on a complicated
environments. An efficient regression based method is proposed
in this paper, which can be well adopted into any existing video
surveillance system. It adopts color based segmentation to extract
foreground regions in images. Regression is established based on
the foreground density and the number of people. This method
is fast and can deal with lighting condition changes. Experiments
on public datasets and one captured dataset have shown the
effectiveness and robustness of the method.

For a given data set, exploring their multi-view instances under a clustering framework is a practical way to boost the clustering performance. This is because that each view might reflect partial information for the existing data. Furthermore, due to the noise and other impact factors, exploring these instances from different views will enhance the mining of the real structure and feature information within the data set. In this paper, we propose a multiple kernel spectral clustering algorithm through the multi-view instances on the given data set. By combining the kernel matrix learning and the spectral clustering optimization into one process framework, the algorithm can determine the kernel weights and cluster the multi-view data simultaneously. We compare the proposed algorithm with some recent published methods on real-world datasets to show the efficiency of the proposed algorithm.

`Gait' is a person's manner of walking. Patients may have an abnormal gait due to a range of physical impairment or brain damage. Clinical gait analysis (CGA) is a technique for identifying the underlying impairments that affect a patients gait pattern. The CGA is critical for treatment planning. Essentially, CGA tries to use patients physical examination results, known as static data, to interpret the dynamic characteristics in an abnormal gait, known as dynamic data. This process is carried out by gait analysis experts, mainly based on their experience which may lead to subjective diagnoses. To facilitate the automation of this process and form a relatively objective diagnosis, this paper proposes a new probabilistic correlated static-dynamic model (CSDM) to discover correlated relationships between the dynamic characteristics of gait and their root cause in the static data space. We propose an EMbased algorithm to learn the parameters of the CSDM. One of the main advantages of the CSDM is its ability to provide intuitive knowledge. For example, the CSDM can describe what kinds of static data will lead to what kinds of hidden gait patterns in the form of a decision tree, which helps us to infer dynamic characteristics based on static data. Our initial experiments indicate that the CSDM is promising for discovering the correlated relationship between physical examination (static) and gait (dynamic) data.

In this work, we introduce a new edge feature to improve the head-shoulder detection performance. Since Head-shoulder detection is much vulnerable to vague contour, our new edge feature is designed to extract and enhance the head-shoulder contour and suppress the other contours. The basic idea is that head-shoulder contour can be predicted by filtering edge image with edge patterns, which are generated from edge fragments through a learning process. This edge feature can significantly enhance the object contour such as human head and shoulder known as En-Contour. To evaluate the performance of the new En-Contour, we combine it with HOG+LBP [1] as HOG+LBP+En-Contour. The HOG+LBP is the state-of-the-art feature in pedestrian detection. Because the human head-shoulder detection is a special case of pedestrian detection, we also use it as our baseline. Our experiments have indicated that this new feature significantly improve the HOG+LBP.

Boosting algorithms have attracted great attention since the first real-time face detector by Viola & Jones through feature selection and strong classifier learning simultaneously. On the other hand, researchers have proposed to decouple such two procedures to improve the performance of Boosting algorithms. Motivated by this, we propose a boosting-like algorithm framework by embedding semi-supervised subspace learning methods. It selects weak classifiers based on class-separability. Combination weights of selected weak classifiers can be obtained by subspace learning. Three typical algorithms are proposed under this framework and evaluated on public data sets. As shown by our experimental results, the proposed methods obtain superior performances over their supervised counterparts and AdaBoost.

Scalability to large numbers of classes is an important challenge for multi-class classification. It can often be computationally infeasible at test phase when class prediction is performed by using every possible classifier trained for each individual class. This paper proposes an attribute-based learning method to overcome this limitation. First is to define attributes and their associations with object classes automatically and simultaneously. Such associations are learned based on greedy strategy under certain conditions. Second is to learn a classifier for each attribute instead of each class. Then, these trained classifiers are used to predict classes based on their attribute representations. The proposed method also allows trade-off between test-time complexity (which grows linearly with the number of attributes) and accuracy. Experiments based on Animals-with-Attributes and ILSVRC2010 datasets have shown that the performance of our method is promising when compared with the state-of-the-art.

We present a framework to simultaneously detect and segment pedestrian in images. Our work is based on part-based method. We first segment the image into superpixels, then assemble superpixels into body part candidates by comparing the assembled shape with pre-built template library. A structure-based shape matching algorithm is developed to measure the shape similarity. All the body part candidates are input into our modified AND/OR graph to generate the most reasonable combination. The graph describes the possible variation of body configuration and model the constrain relationship between body parts. We perform comparison experiments on the public database and the results show the effectiveness of our framework.

In this paper, we propose a novel unsupervised online learning trajectory analysis method based on weighted directed graph. Each trajectory can be represented as a sequence of key points. In the training stage, unsupervised expectation-maximization algorithm (EM) is applied for training data to cluster key points. Each class is a Gaussian distribution. It is considered as a node of the graph. According to the classification of key points, we can build a weighted directed graph to represent the trajectory network in the scene. Each path is a category of trajectories. In the test stage, we adopt online EM algorithm to classify trajectories and update the graph. In the experiments, we test our approach and obtain a good performance compared with state-of-the-art approaches.

Mobile visual search has attracted extensive attention for its huge potential for numerous applications. Research on this topic has been focused on two schemes: sending query images, and sending compact descriptors extracted on mobile phones. The first scheme requires about 30&acirc;40KB data to transmit, while the second can reduce the bit rate by 10 times. In this paper, we propose a third scheme for extremely low bit ratemobile visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. This scheme can further reduce the bit rate with few extra computational costs on the client. Specifically, we store a vocabulary tree and extract visual descriptors on the mobile client. A light-weight pre-retrieval is performed to obtain the visited leaf nodes in the vocabulary tree. The orientation of each local descriptor and the tree histogram are then encoded to be transmitted to server. Our new scheme transmits less than 1KB data, which reduces the bit rate in the second scheme by 3 times, and obtains about 30% improvement in terms of search accuracy over the traditional Bag-of-Words baseline. The time cost is only 1.5 secs on the client and 240 msecs on the server.

Mobile visual search has attracted extensive attention for its huge potential for numerous applications. Research on this topic has been focused on two schemes: sending query images, and sending compact descriptors extracted on mobile phones. The first scheme requires about 30&acirc;40KB data to transmit, while the second can reduce the bit rate by 10 times. In this paper, we propose a third scheme for extremely low bit ratemobile visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. This scheme can further reduce the bit rate with few extra computational costs on the client. Specifically, we store a vocabulary tree and extract visual descriptors on the mobile client. A light-weight pre-retrieval is performed to obtain the visited leaf nodes in the vocabulary tree. The orientation of each local descriptor and the tree histogram are then encoded to be transmitted to server. Our new scheme transmits less than 1KB data, which reduces the bit rate in the second scheme by 3 times, and obtains about 30% improvement in terms of search accuracy over the traditional Bag-of-Words baseline. The time cost is only 1.5 secs on the client and 240 msecs on the server

Image co-occurrence has shown great powers on object classification because it captures the characteristic of individual features and spatial relationship between them simultaneously. For example, Co-occurrence Histogram of Oriented Gradients (CoHOG) has achieved great success on human detection task. However, the gradient orientation in CoHOG is sensitive to noise. In addition, CoHOG does not take gradient magnitude into account which is a key component to reinforce the feature detection. In this paper, we propose a new LBP feature detector based image co-occurrence. Building on uniform Local Binary Patterns, the new feature detector detects Co-occurrence Orientation through Gradient Magnitude calculation. It is known as CoGMuLBP. An extension version of the GoGMuLBP is also presented. The experimental results on the UIUC car data set show that the proposed features outperform state-of-the-art methods.

Many approaches to image classification tend to transform an image into an unstructured set of numeric feature vectors obtained globally and/or locally, and as a result lose important relational information between regions. In order to encode the geometric relationships between image regions, we propose a variety of structural image representations that are not specialised for any particular image category. Besides the traditional grid-partitioning and global segmentation methods, we investigate the use of local scale-invariant region detectors. Regions are connected based not only upon nearest-neighbour heuristics, but also upon minimum spanning trees and Delaunay triangulation. In order to maintain the topological and spatial relationships between regions, and also to effectively process undirected connections represented as graphs, we utilise the recently-proposed graph neural network model. To the best of our knowledge, this is the first utilisation of the model to process graph structures based on local-sampling techniques, for the task of image classification. Our experimental results demonstrate great potential for further work in this domain.

Two main components of Procrustes Shape Analysis (PSA) are adopted and adapted specifically to address gait recognition under small viewing angle change: 1) Procrustes Mean Shape (PMS) for gait signature description; 2) Procrustes Distance (PD) for similarity measurement. Pairwise Shape Configuration (PSC) is proposed as a shape descriptor in place of existing Centroid Shape Configuration (CSC) in conventional PSA. PSC can better tolerate shape change caused by viewing angle change than CSC. Small variation of viewing angle makes large impact only on global gait appearance. Without major impact on local spatio-temporal motion, PSC which effectively embeds local shape information can generate robust view-invariant gait feature. To enhance gait recognition performance, a novel boundary re-sampling process is proposed. It provides only necessary re-sampled points to PSC description. In the meantime, it efficiently solves problems of boundary point correspondence, boundary normalization and boundary smoothness. This re-sampling process adopts prior knowledge of body pose structure. Comprehensive experiment is carried out on the CASIA gait database. The proposed method is shown to significantly improve performance of gait recognition under small viewing angle change without additional requirements of supervised learning, known viewing angle and multi-camera system, when compared with other methods in literatures.

This paper presents an active learning approach for recognizing human actions in videos based on multiple kernel combined method. We design the classifier based on Multiple Kernel Learning (MKL) through Gaussian Processes (GP) regression. This classifier is then trained in an active learning approach. In each iteration, one optimal sample is selected to be interactively annotated and incorporated into training set. The selection of the sample is based on the heuristic feedback of the GP classifier. To our knowledge, GP regression MKL based active learning methods have not been applied to address the human action recognition yet. We test this approach on standard benchmarks. This approach outperforms the state-of-the-art techniques in accuracy while requires significantly less training samples.

Pedestrian detection in a thermal image is a difficult task due to intrinsic challenges:1) low image resolution, 2) thermal noising, 3) polarity changes, 4) lack of color, texture or depth information. To address these challenges, we propose a novel mid-level feature descriptor for pedestrian detection in thermal domain, which combines pixel-level Steering Kernel Regression Weights Matrix (SKRWM) with their corresponding covariances. SKRWM can properly capture the local structure of pixels, while the covariance computation can further provide the correlation of low level feature. This mid-level feature descriptor not only captures the pixel-level data difference and spatial differences of local structure, but also explores the correlations among low-level features. In the case of human detection, the proposed mid-level feature descriptor can discriminatively distinguish pedestrian from complexity. For testing the performance of proposed feature descriptor, a popular classifier framework based on Principal Component Analysis (PCA) and Support Vector Machine (SVM) is also built. Overall, our experimental results show that proposed approach has overcome the problems caused by background subtraction in [1] while attains comparable detection accuracy compared to the state-of-the-arts.

Walking speed change is considered a typical challenge hindering reliable human gait recognition. This paper proposes a novel method to extract speed-invariant gait feature based on Procrustes Shape Analysis (PSA). Two major components of PSA, i.e., Procrustes Mean Shape (PMS) and Procrustes Distance (PD), are adopted and adapted specifically for the purpose of speed-invariant gait recognition. One of our major contributions in this work is that, instead of using conventional Centroid Shape Configuration (CSC) which is not suitable to describe individual gait when body shape changes particularly due to change of walking speed, we propose a new descriptor named Higher-order derivative Shape Configuration (HSC) which can generate robust speed-invariant gait feature. From the first order to the higher order, derivative shape configuration contains gait shape information of different levels. Intuitively, the higher order of derivative is able to describe gait with shape change caused by the larger change of walking speed. Encouraging experimental results show that our proposed method is efficient for speed-invariant gait recognition and evidently outperforms other existing methods in the literatures.

There is an abundant literature on face detection due to its important role in many vision applications. Since Viola and Jones proposed the first real-time AdaBoost based face detector, Haar-like features have been adopted as the method of choice for frontal face detection. In this work, we show that simple features other than Haar-like features can also be applied for training an effective face detector. Since, single feature is not discriminative enough to separate faces from difficult non-faces, we further improve the generalization performance of our simple features by introducing feature co-occurrences. We demonstrate that our proposed features yield a performance improvement compared to Haar-like features. In addition, our findings indicate that features play a crucial role in the ability of the system to generalize.

It has been shown that gait is an efficient biometric feature for identifying a person at a distance. However, it is a challenging problem to obtain reliable gait feature when viewing angle changes because the body appearance can be different under the various viewing angles. In this paper, the problem above is formulated as a regression problem where a novel View Transformation Model (VTM) is constructed by adopting Multilayer Perceptron (MLP) as regression tool. It smoothly estimates gait feature under an unknown viewing angle based on motion information in a well selected Region of Interest (ROI) under other existing viewing angles. Thus, this proposal can normalize gait features under various viewing angles into a common viewing angle before gait similarity measurement is carried out. Encouraging experimental results have been obtained based on widely adopted benchmark database.

Gait is a well recognized biometric feature that is used to identify a human at a distance. However, in real environment, appearance changes of individuals due to viewing angle changes cause many difficulties for gait recognition. This paper re-formulates this problem as a regression problem. A novel solution is proposed to create a View Transformation Model (VTM) from the different point of view using Support Vector Regression (SVR). To facilitate the process of regression, a new method is proposed to seek local Region of Interest (ROI) under one viewing angle for predicting the corresponding motion information under another viewing angle. Thus, the well constructed VTM is able to transfer gait information under one viewing angle into another viewing angle. This proposal can achieve view-independent gait recognition. It normalizes gait features under various viewing angles into a common viewing angle before similarity measurement is carried out. The extensive experimental results based on widely adopted benchmark dataset demonstrate that the proposed algorithm can achieve significantly better performance than the existing methods in literature.

Feature enhancement in an image is to reinforce some exacted features so that it can be used for object classification and detection. As the thermal image is lack of texture and colorful information, the techniques for visual image feature enhancement is insufficient to apply to thermal images. In this paper, we propose a new gradient-based approach for feature enhancement in thermal image. We use the statistical properties of gradient of foreground object profiles, and formulate object features with gradient saliency. Empirical evaluation of the proposed approach shows significant performance improved on human contours which can be used for detection and classification.

In this paper, we propose an adaptive cross layer technique that optimally enhance the QoS of wireless video transmission in an IEEE 802.11e WLAN. The optimization takes into account the unequal error protection characteristics of video streaming, the IE

This paper presents a unified framework for human action classification and localization in video using structured learning of local space-time features. Each human action class is represented by a set of its own compact set of local patches. In our appr

We develop a robust technique to find similar matches of human actions in video. Given a query video, Motion History Images (MHI) are constructed for consecutive keyframes. This is followed by dividing the MHI into local Motion-Shape regions, which allow

We present a new method for detecting pedestrians in thermal images. The method is based on the Shape Context Descriptor (SCD) with the Adaboost cascade classifier framework. Compared with standard optical images, thermal imaging cameras offer a clear advantage for night-time video surveillance. It is robust on the light changes in day-time. Experiments show that shape context features with boosting classification provide a significant improvement on human detection in thermal images. In this work, we have also compared our proposed method with rectangle features on the public dataset of thermal imagery. Results show that shape context features are much better than the conventional rectangular features on this task.

Detection of duplicate or near-duplicate videos on large-scale database plays an important role in video search. In this paper, we analyze the problem of near-duplicates detection and propose a practical and effective solution for real-time large-scale v

In this paper, we present a robust framework for action recognition in video, that is able to perform competitively against the state-of-the-art methods, yet does not rely on sophisticated background subtraction preprocess to remove background features.

This paper presents a unified framework for recognizing human action in video using human pose estimation. Due to high variation of human appearance and noisy context background, accurate human pose analysis is hard to achieve and rarely employed for the task of action recognition. In our approach, we take advantage of the current success of human detection and view invariability of local feature-based approach to design a pose-based action recognition system. We begin with a frame-wise human detection step to initialize the search space for human local parts, then integrate the detected parts into human kinematic structure using a tree structural graphical model. The final human articulation configuration is eventually used to infer the action class being performed based on each single part behavior and the overall structure variation. In our work, we also show that even with imprecise pose estimation, accurate action recognition can still be achieved based on informative clues from the overall pose part configuration. The promising results obtained from action recognition benchmark have proven our proposed framework is comparable to the existing state-of-the-art action recognition algorithms.

Human identification by recognizing the spontaneous gait recorded in real-world setting is a tough and not yet fully resolved problem in biometrics research. Several issues have contributed to the difficulties of this task. They include various poses, different clothes, moderate to large changes of normal walking manner due to carrying diverse goods when walking, and the uncertainty of the environments where the people are walking. In order to achieve a better gait recognition, this paper proposes a new method based on Weighted Binary Pattern (WBP). WBP first constructs binary pattern from a sequence of aligned silhouettes. Then, adaptive weighting technique is applied to discriminate significances of the bits in gait signatures. Being compared with most of existing methods in the literatures, this method can better deal with gait frequency, local spatial-temporal human pose features, and global body shape statistics. The proposed method is validated on several well known benchmark databases. The extensive and encouraging experimental results show that the proposed algorithm achieves high accuracy, but with low complexity and computational time.

Gait is one of well recognized biometrics that has been widely used for human identification. However, the current gait recognition might have difficulties due to viewing angle being changed. This is because the viewing angle under which the gait signature database was generated may not be the same as the viewing angle when the probe data are obtained. This paper proposes a new multi-view gait recognition approach which tackles the problems mentioned above. Being different from other approaches of same category, this new method creates a so called View Transformation Model (VTM) based on spatial-domain Gait Energy Image (GEI) by adopting Singular Value Decomposition (SVD) technique. To further improve the performance of the proposed VTM, Linear Discriminant Analysis (LDA) is used to optimize the obtained GEI feature vectors. When implementing SVD there are a few practical problems such as large matrix size and over-fitting. In this paper, reduced SVD is introduced to alleviate the effects caused by these problems. Using the generated VTM, the viewing angles of gallery gait data and probe gait data can be transformed into the same direction. Thus, gait signatures can be measured without difficulties. The extensive experiments show that the proposed algorithm can significantly improve the multiple view gait recognition performance when being compared to the similar methods in literature.

Recent efforts show that it is possible to calibrate a surveillance camera simply from observing a walking human. This procedure can be seen as a special application of the camera self-calibration technique. Several methods have been proposed along this

We present a two-layer night time vehicle detector in this work. At the first layer, vehicle headlight detection [1, 2, 3] is applied to find areas (bounding boxes) where the possible pairs of headlights locate in the image, the Haar feature based AdaBoo

Face detection plays an important role in many vision applications. Since Viola and Jones [1] proposed the first real-time AdaBoost based object detection system, much ef- fort has been spent on improving the boosting method. In this work, we first show

This paper presents a new technique for action recognition in video using human body part-based approach, combining both local feature description of each body part, and global graphical model structure of the human action. The human body is divided into

Automated surveillance system is becoming increasingly important especially in the fields of computer vision and video processing. This paper describes a novel approach for improving the results of detecting foreground objects and their shadows in indoor

This paper presents an experimental study on pedestrian detection using state-of-the-art local feature extraction and support vector machine (SVM) classifiers. The performance of pedestrian detection using region covariance, histogram of oriented gradien

This paper presents a novel algorithm for robust object tracking based on the particle filtering method employed in recursive Bayesian estimation and image segmentation and optimisation techniques employed in active contour models and level set methods.

In wireless environments, video quality can be severely degraded due to channel errors. Improving error robustness towards the impact of packet loss in error-prone network is considered as a critical concern in wireless video networking research. Data pa

Robust visual tracking has become an important topic of research in computer vision. A novel method for robust object tracking, GATE [11], improves object tracking in complex environments using the particle filtering and the level set-based active contou

A statistical and computer vision approach using tracked moving vehicle shapes for auto-calibrating traffic surveillance cameras is presented. Vanishing point of the traffic direction is picked up from Linear Regression of all tracked vehicle points. Pre

A robust framework to classify vehicles in nighttime traffic using vehicle eigenspaces and support vector machine is presented. In this paper, a systematic approach has been proposed and implemented to classify vehicles from roadside camera video sequenc

Techniques for detecting pedestrian in still images have
attached considerable research interests due to its wide applications
such as video surveillance and intelligent transportation
systems. In this paper, we propose a novel simpler
pedestrian detector using state-of-the-art locally extracted
features, namely, covariance features. Covariance
features were originally proposed in [1, 2]. Unlike the work
in [2], where the feature selection and weak classifier training
are performed on the Riemannian manifold, we select
features and train weak classifiers in the Euclidean space
for faster computation. To this end, AdaBoost with weighted
Fisher linear discriminant analysis based weak classifiers
are adopted. Multiple layer boosting with heterogeneous
features is constructed to exploit the efficiency of the Haarlike
feature and the discriminative power of the covariance
feature simultaneously. Extensive experiments show that by
combining the Haar-like and covariance features, we speed
up the original covariance feature detector [2] by up to an
order of magnitude in processing time without compromising
the detection performance. For the first time, the proposed
work enables covariance feature based pedestrian
detection to work real-time.

This paper proposes an efficient method for detecting ghost and left objects in surveillance video, which, if not identified, may lead to errors or wasted computation in background modeling and object tracking in surveillance systems. This method contain

The ability to detect pedestrians is a first important step in many computer vision applications such as video surveillance. This paper presents an experimental study on pedestrian detection using state-of-the-art local feature extraction and support vec

With recent advances in computer vision, image processing and analysis, a retrieval process based on visual content has became a key component in achieving high efficiency image query for large multimedia databases. In this paper, we propose and develop

We describe a novel method to detect new stable objects in video. This includes detecting new objects that appear in a scene and remain stationary for a period of time. Examples include detecting a dropped bag or a parked car. Our method utilizes the sta

This paper describes a method of categorizing the moving objects using eigen-features and support vector machines. Eigen-features, generally used in face recognition and static image classification, are applied to classify the moving objects detected fro

Purpose: The aim of this study is to understand the knowledge sharing structure and co-production of trip-related knowledge through online travel forums.
Design/methodology/approach: The travel forum threads were collected from TripAdvisor Sydney travel forum for the period from 2010 to 2014, which contains 115,847 threads from 8,346 conversations. The data analytical technique was based on a novel methodological approach - visual analytics including semantic pattern generation and network analysis.
Findings: Findings indicate that the knowledge structure is created by community residents who camouflage as local experts, serve as ambassadors of a destination. The knowledge structure presents collective intelligence co-produced by community residents and tourists. Further findings reveal how these community residents associate with each other and form a knowledge repertoire with information covering various travel domain areas.
Practical implications: The study offers valuable insights to help destination management organizations and tour operators identify existing and emerging tourism issues to achieve a competitive destination advantage.
Originality/value: This study highlights the process of social media mediated travel knowledge co-production. It also discovers how community residents engage in reaching out to tourists by camouflaging as ordinary users.

Stochastic sampling based trackers have shown good performance for abrupt motion tracking so that they have gained popularity in recent years. However, conventional methods tend to use a two-stage sampling paradigm in which the search space needs to be uniformly explored with an inefficient preliminary sampling phase. In this paper, we propose a novel sampling-based method in the Bayesian filtering framework to address the problem. Within the framework, nearest neighbor field estimation is utilized to compute the importance proposal probabilities, which guide the Markov chain search towards promising regions and thus enhance the sampling efficiency; given the motion priors, a smoothing stochastic sampling Monte Carlo algorithm is proposed to approximate the posterior distribution through a smoothing weight-updating scheme. Moreover, to track the abrupt and the smooth motions simultaneously, we develop an abrupt-motion detection scheme which can discover the presence of abrupt motions during online tracking. Extensive experiments on challenging image sequences demonstrate the effectiveness and the robustness of our algorithm in handling the abrupt motions.

The appearance gap between sketches and photo- realistic images is a fundamental challenge in sketch-based image retrieval (SBIR) systems. The existence of noisy edges on photo- realistic images is a key factor in the enlargement of the appearance gap and significantly degrades retrieval performance . To bridge the gap, we propose a framework consisting of a new line segment -based descriptor named histogram of line relationship (HLR) and a new noise impact reduction algorithm known as object boundary selection . HLR treats sketches and extracted edges of photo- realistic images as a series of piece-wise line segments and captures the relationship between them. Based on the HLR, the object boundary selection algorithm aims to reduce the impact of noisy edges by selecting the shaping edges that best correspond to the object boundaries. Multiple hypotheses are generated for descriptors by hypothetical edge selection. The selection algorithm is formulated to find the best combination of hypotheses to maximize the retrieval score; a fast method is also proposed. To reduce the distraction of false matches in the scoring process, two constraints on spatial and coherent aspects are introduced . We tested the HLR descriptor and the proposed framework on public datasets and a new image dataset of three million images, which we recently collected for SBIR evaluation purposes. We compared the proposed HLR with state-of-the-art descriptors (SHoG, GF-HOG). The experimental results show that our HLR descriptor outperforms them. Combined with the object boundary selection algorithm, our framework significantly improves SBIR performance.

Integrating multi-source information has recently shown promising performance in predicting Alzheimers disease (AD). Multiple kernel learning (MKL) plays an important role in this regard by learning the combination weights of a set of base kernels via the principle of margin maximisation. The latest research on MKL further incorporates the radius of minimum enclosing ball (MEB) of training data to improve the kernel learning performance. However, we observe that directly applying these radius-incorporated MKL algorithms to AD prediction tasks does not necessarily improve, and sometimes even deteriorate, the prediction accuracy. In this paper, we propose an improved radius-incorporated MKL algorithm for AD prediction. First, we redesign the objective function by approximating the radius of MEB with its upper bound, a linear function of the kernel weights. This approximation makes the resulting optimisation problem convex and globally solvable. Second, instead of using cross-validation, we model the regularisation parameter C of the SVM classifier as an extra kernel weight and automatically tune it in MKL. Third, we theoretically show that our algorithm can be reformulated into a similar form of the SimpleMKL algorithm and conveniently solved by the off-the-shelf packages. We discuss the factors that contribute to the improved performance and apply our algorithm to discriminate different clinic groups from the benchmark ADNI data set. As experimentally demonstrated, our algorithm can better utilise the radius information and achieve higher prediction accuracy than the comparable MKL methods in the literature. In addition, our algorithm demonstrates the highest computational efficiency among all the comparable methods.

BACKGROUND: Vision-based surveillance and monitoring is a potential alternative for early detection of respiratory disease outbreaks in urban areas complementing molecular diagnostics and hospital and doctor visit-based alert systems. Visible actions representing typical flu-like symptoms include sneeze and cough that are associated with changing patterns of hand to head distances, among others. The technical difficulties lie in the high complexity and large variation of those actions as well as numerous similar background actions such as scratching head, cell phone use, eating, drinking and so on. RESULTS: In this paper, we make a first attempt at the challenging problem of recognizing flu-like symptoms from videos. Since there was no related dataset available, we created a new public health dataset for action recognition that includes two major flu-like symptom related actions (sneeze and cough) and a number of background actions. We also developed a suitable novel algorithm by introducing two types of Action Matching Kernels, where both types aim to integrate two aspects of local features, namely the space-time layout and the Bag-of-Words representations. In particular, we show that the Pyramid Match Kernel and Spatial Pyramid Matching are both special cases of our proposed kernels. Besides experimenting on standard testbed, the proposed algorithm is evaluated also on the new sneeze and cough set. Empirically, we observe that our approach achieves competitive performance compared to the state-of-the-arts, while recognition on the new public health dataset is shown to be a non-trivial task even with simple single person unobstructed view. CONCLUSIONS: Our sneeze and cough video dataset and newly developed action recognition algorithm is the first of its kind and aims to kick-start the field of action recognition of flu-like symptoms from videos. It will be challenging but necessary in future developments to consider more complex real-life scenario of detecting ...

The recent literature indicates that preserving global pairwise sample similarity is of great importance for feature selection and that many existing selection criteria essentially work in this way. In this paper, we argue that besides global pairwise sample similarity, the local geometric structure of data is also critical and that these two factors play different roles in different learning scenarios. In order to show this, we propose a global and local structure preservation framework for feature selection (GLSPFS) which integrates both global pairwise sample similarity and local geometric data structure to conduct feature selection. To demonstrate the generality of our framework, we employ methods that are well known in the literature to model the local geometric data structure and develop three specific GLSPFS-based feature selection algorithms. Also, we develop an efficient optimization algorithm with proven global convergence to solve the resulting feature selection problem. A comprehensive experimental study is then conducted in order to compare our feature selection algorithms with many state-of-the-art ones in supervised, unsupervised, and semisupervised learning scenarios. The result indicates that: 1) our framework consistently achieves statistically significant improvement in selection performance when compared with the currently used algorithms; 2) in supervised and semisupervised learning scenarios, preserving global pairwise similarity is more important than preserving local geometric data structure; 3) in the unsupervised scenario, preserving local geometric data structure becomes clearly more important; and 4) the best feature selection performance is always obtained when the two factors are appropriately integrated. In summary, this paper not only validates the advantages of the proposed GLSPFS framework but also gains more insight into the information to be preserved in different feature selection tasks.

With the development of image search technology, users are no longer satisfied with searching for images using just metadata and textual descriptions. Instead, more search demands are focused on retrieving images based on similarities in their contents (textures, colors, shapes etc.). Nevertheless, one image may deliver rich or complex content and multiple interests. Sometimes users do not sufficiently define or describe their seeking demands for images even when general search interests appear, owing to a lack of specific knowledge to express their intents. A new form of information seeking activity, referred to as exploratory search, is emerging in the research community, which generally combines browsing and searching content together to help users gain additional knowledge and form accurate queries, thereby assisting the users with their seeking and investigation activities. However, there have been few attempts at addressing integrated exploratory search solutions when image browsing is incorporated into the exploring loop. In this work, we investigate the challenges of understanding users' search interests from the images being browsed and infer their actual search intentions. We develop a novel system to explore an effective and efficient way for allowing users to seamlessly switch between browse and search processes, and naturally complete visual-based exploratory search tasks. The system, called Browse-to-Search enables users to specify their visual search interests by circling any visual objects in the webpages being browsed, and then the system automatically forms the visual entities to represent users' underlying intent. One visual entity is not limited by the original image content, but also encapsulated by the textual-based browsing context and the associated heterogeneous attributes. We use large-scale image search technology to find the associated textual attributes from the repository. Users can then utilize the encapsulated visual entities to co...

Learning an optimal kernel plays a pivotal role in kernel-based methods. Recently, an approach called optimal neighborhood kernel learning (ONKL) has been proposed, showing promising classification performance. It assumes that the optimal kernel will reside in the neighborhood of a pre-specified kernel. Nevertheless, how to specify such a kernel in a principled way remains unclear. To solve this issue, this paper treats the pre-specified kernel as an extra variable and jointly learns it with the optimal neighborhood kernel and the structure parameters of support vector machines. To avoid trivial solutions, we constrain the pre-specified kernel with a parameterized model. We first discuss the characteristics of our approach and in particular highlight its adaptivity. After that, two instantiations are demonstrated by modeling the pre-specified kernel as a common Gaussian radial basis function kernel and a linear combination of a set of base kernels in the way of multiple kernel learning (MKL), respectively. We show that the optimization in our approach is a min-max problem and can be efficiently solved by employing the extended level method and Nesterov's method. Also, we give the probabilistic interpretation for our approach and apply it to explain the existing kernel learning methods, providing another perspective for their commonness and differences. Comprehensive experimental results on 13 UCI data sets and another two real-world data sets show that via the joint learning process, our approach not only adaptively identifies the pre-specified kernel, but also achieves superior classification performance to the original ONKL and the related MKL algorithms.

Integrating radius information has been demonstrated by recent work on multiple kernel learning (MKL) as a promising way to improve kernel learning performance. Directly integrating the radius of the minimum enclosing ball (MEB) into MKL as it is, however, not only incurs significant computational overhead but also possibly adversely affects the kernel learning performance due to the notorious sensitivity of this radius to outliers. Inspired by the relationship between the radius of the MEB and the trace of total data scattering matrix, this paper proposes to incorporate the latter into MKL to improve the situation. In particular, in order to well justify the incorporation of radius information, we strictly comply with the radius-margin bound of support vector machines (SVMs) and thus focus on the l2-norm soft-margin SVM classifier. Detailed theoretical analysis is conducted to show how the proposed approach effectively preserves the merits of incorporating the radius of the MEB and how the resulting optimization is efficiently solved. Moreover, the proposed approach achieves the following advantages over its counterparts: 1) more robust in the presence of outliers or noisy training samples; 2) more computationally efficient by avoiding the quadratic optimization for computing the radius at each iteration; and 3) readily solvable by the existing off-the-shelf MKL packages. Comprehensive experiments are conducted on University of California, Irvine, protein subcellular localization, and Caltech-101 data sets, and the results well demonstrate the effectiveness and efficiency of our approach.

Sparse coding which encodes the natural visual signal into a sparse space for visual codebook generation and feature quantization, has been successfully utilized for many image classification applications. However, it has been seldom explored for many video analysis tasks. In particular, the increased complexity in characterizing the visual patterns of diverse human actions with both the spatial and temporal variations imposes more challenges to the conventional sparse coding scheme. In this paper, we propose an enhanced sparse coding scheme through learning discriminative dictionary and optimizing the local pooling strategy. Localizing when and where a specific action happens in realistic videos is another challenging task. By utilizing the sparse coding based representations of human actions, this paper further presents a novel coarse-to-fine framework to localize the Volumes of Interest (VOIs) for the actions. Firstly, local visual features are transformed into the sparse signal domain through our enhanced sparse coding scheme. Secondly, in order to avoid exhaustive scan of entire videos for the VOI localization, we extend the Spatial Pyramid Matching into temporal domain, namely Spatial Temporal Pyramid Matching, to obtain the VOI candidates. Finally, a multi-level branch-and-bound approach is developed to refine the VOI candidates. The proposed framework is also able to avoid prohibitive computations in local similarity matching (e.g., nearest neighbors voting). Experimental results on both two popular benchmark datasets (KTH and YouTube UCF) and the widely used localization dataset (MSR) demonstrate that our approach reduces computational cost significantly while maintaining comparable classification accuracy to that of the state-of-the-art methods

Human gait is an important biometric feature which is able to identify a person remotely. However, change of view causes significant difficulties for recognizing gaits. This paper proposes a new framework to construct a new view-invariant feature for cross-view gait recognition. Our view-normalization process is performed in the input layer (i.e., on gait silhouettes) to normalize gaits from arbitrary views. That is, each sequence of gait silhouettes recorded from a certain view is transformed onto the common canonical view by using corresponding domain transformation obtained through invariant low-rank textures (TILTs). Then, an improved scheme of procrustes shape analysis (PSA) is proposed and applied on a sequence of the normalized gait silhouettes to extract a novel view-invariant gait feature based on procrustes mean shape (PMS) and consecutively measure a gait similarity based on procrustes distance (PD). Comprehensive experiments were carried out on widely adopted gait databases. It has been shown that the performance of the proposed method is promising when compared with other existing methods in the literature.

Gait has been shown to be an efficient biometric feature for human identification at a distance. However, performance of gait recognition can be affected by view variation. This leads to a consequent difficulty of cross-view gait recognition. A novel method is proposed to solve the above difficulty by using view transformation model (VTM). VTM is constructed based on regression processes by adopting multi-layer perceptron (MLP) as a regression tool. VTM estimates gait feature from one view using a well selected region of interest (ROI) on gait feature from another view. Thus, trained VTMs can normalize gait features from across views into the same view before gait similarity is measured. Moreover, this paper proposes a new multi-view gait recognition which estimates gait feature on one view using selected gait features from several other views. Extensive experimental results demonstrate that the proposed method significantly outperforms other baseline methods in literature for both cross-view and multi-view gait recognitions. In our experiments, particularly, average accuracies of 99%, 98% and 93% are achieved for multiple views gait recognition by using 5 cameras, 4 cameras and 3 cameras respectively.

It is well recognized that gait is an important biometric feature to identify a person at a distance, e. g., in video surveillance application. However, in reality, change of viewing angle causes significant challenge for gait recognition. A novel approa

In this paper, we propose a framework for human action analysis from video footage. A video action sequence in our perspective is a dynamic structure of sparse local spatial&acirc;temporal patches termed action elements, so the problems of action analysis in video are carried out here based on the set of local characteristics as well as global shape of a prescribed action. We first detect a set of action elements that are the most compact entities of an action, then we extend the idea of Implicit Shape Model to space time, in order to properly integrate the spatial and temporal properties of these action elements. In particular, we consider two different recipes to construct action elements: one is to use a Sparse Bayesian Feature Classifier to choose action elements from all detected Spatial Temporal Interest Points, and is termed discriminative action elements. The other one detects affine invariant local features from the holistic Motion History Images, and picks up action elements according to their compactness scores, and is called generative action elements. Action elements detected from either way are then used to construct a voting space based on their local feature representations as well as their global configuration constraints. Our approach is evaluated in the two main contexts of current human action analysis challenges, action retrieval and action classification. Comprehensive experimental results show that our proposed framework marginally outperforms all existing state-of-the-arts techniques on a range of different datasets.

In this letter, a new scheme for generating local binary patterns (LBP) is presented. This Modi?ed Symmetric LBP (MS-LBP) feature takes advantage of LBP and gradient features. It is then applied into a boosted cascade framework for human detection. By combining MS-LBP with Haar-like feature into the boosted framework, the performances of heterogeneous features based detectors are evaluated for the best trade-off between accuracy and speed. Two feature training schemes, namely Single AdaBoost Training Scheme (SATS) and Dual AdaBoost Training Scheme (DATS) are proposed and compared. On the top of AdaBoost, two multidimensional feature projection methods are described. A comprehensive experiment is presented. Apart from obtaining higher detection accuracy, the detection speed based on DATS is 17 times faster than HOG method.

Human action recognition is a promising yet non-trivial computer vision field with many potential applications. Current advances in bag-of-feature approaches have brought significant insights into recognizing human actions within complex context. It is, however, a common practice in literature to consider action as merely an orderless set of local salient features. This representation has been shown to be oversimplified, which inherently limits traditional approaches from robust deployment in real-life scenarios. In this work, we propose and show that, by taking into account global configuration of local features, we can greatly improve recognition performance. We first introduce a novel feature selection process called Sparse Hierarchical Bayes Filter to select only the most contributive features of each action type based on neighboring structure constraints. We then present the application of structured learning in human action analysis. That is, by representing human action as a complex set of local features, we can incorporate different spatial and temporal feature constraints into the learning tasks of human action classification and localization. In particular, we tackle the problem of action localization in video using structured learning with two alternatives: one is Dynamic Conditional Random Field from probabilistic perspective; the other is Structural Support Vector Machine from max-margin point of view. We evaluate our modular classification-localization framework on various testbeds, in which our proposed framework is proven to be highly effective and robust compared against bag-of-feature methods.

Gait has been known as an effective biometric feature to identify a person at a distance. However, variation of walking speeds may lead to significant changes to human walking patterns. It causes many difficulties for gait recognition. A comprehensive analysis has been carried out in this paper to identify such effects. Based on the analysis, Procrustes shape analysis is adopted for gait signature description and relevant similarity measurement. To tackle the challenges raised by speed change, this paper proposes a higher order shape configuration for gait shape description, which deliberately conserves discriminative information in the gait signatures and is still able to tolerate the varying walking speed. Instead of simply measuring the similarity between two gaits by treating them as two unified objects, a differential composition model (DCM) is constructed. The DCM differentiates the different effects caused by walking speed changes on various human body parts. In the meantime, it also balances well the different discriminabilities of each body part on the overall gait similarity measurements. In this model, the Fisher discriminant ratio is adopted to calculate weights for each body part. Comprehensive experiments based on widely adopted gait databases demonstrate that our proposed method is efficient for cross-speed gait recognition and outperforms other state-of-the-art methods.

Circles packing into a circular container with equilibrium constraint is a NP hard layout optimization problem. It has a broad application in engineering. This paper studies a two-dimensional constrained packing problem. Classical di?erential evolution for solving this problem is easy to fall into local optima. An adaptive chaotic di?erential evolution algorithm is proposed to improve the performance in this paper. The weighting parameters are dynamically adjusted by chaotic mutation in the searching procedure. The penalty factors of the ?tness function are modi?ed during iteration. To keep the diversity of the population, we limit the populations concentration. To enhance the local search capability, we adopt adaptive mutation of the global optimal individual. The improved algorithm can maintain the basic algorithms structure as well as extend the searching scales, and can hold the diversity of population as well as increase the searching accuracy. Furthermore, our improved algorithm can escape from premature and speed up the convergence. Numerical examples indicate the e?ectiveness and efficiency of the proposed algorithm.

Real-time object detection has many computer vision applications. Since Viola and Jones proposed the first real-time AdaBoost based face detection system, much effort has been spent on improving the boosting method. In this work, we first show that feature selection methods other than boosting can also be used for training an efficient object detector. In particular, we introduce greedy sparse linear discriminant analysis (GSLDA) for its conceptual simplicity and computational efficiency; and slightly better detection performance is achieved compared with . Moreover, we propose a new technique, termed boosted greedy sparse linear discriminant analysis (BGSLDA), to efficiently train a detection cascade. BGSLDA exploits the sample reweighting property of boosting and the class-separability criterion of GSLDA. Experiments in the domain of highly skewed data distributions (e.g., face detection) demonstrate that classifiers trained with the proposed BGSLDA outperforms AdaBoost and its variants. This finding provides a significant opportunity to argue that AdaBoost and similar approaches are not the only methods that can achieve high detection results for real-time object detection.

The ability to efficiently and accurately detect objects plays a very crucial role for many computer vision tasks. Recently, offline object detectors have shown a tremendous success. However, one major drawback of offline techniques is that a complete set of training data has to be collected beforehand. In addition, once learned, an offline detector cannot make use of newly arriving data. To alleviate these drawbacks, online learning has been adopted with the following objectives: 1) the technique should be computationally and storage efficient; 2) the updated classifier must maintain its high classification accuracy. In this paper, we propose an effective and efficient framework for learning an adaptive online greedy sparse linear discriminant analysis model. Unlike many existing online boosting detectors, which usually apply exponential or logistic loss, our online algorithm makes use of linear discriminant analysis&acirc; learning criterion that not only aims to maximize the class-separation criterion but also incorporates the asymmetrical property of training data distributions. We provide a better alternative for online boosting algorithms in the context of training a visual object detector.We demonstrate the robustness and efficiency of our methods on handwritten digit and face data sets. Our results confirm that object detection tasks benefit significantly when trained in an online manner.

Detecting pedestrians accurately is the first fundamental step for many computer vision applications such as video surveillance, smart vehicles, intersection traffic analysis and so on. The authors present an experimental study on pedestrian detection us

Efficiently and accurately detecting pedestrians plays a very important role in many computer vision applications such as video surveillance and smart cars. In order to find the right feature for this task, we first present a comprehensive experimental s

This paper provides a novel approach to detect unattended packages in public venues. Different from previous works on this topic which are mostly limited to detecting static objects where no human is nearby, we provide a solution which can detect an unat

Audio-visual and other multimedia services are seen as important sources of traffic for future telecommunication networks, including wireless networks. A major drawback with some wireless networks is that they introduce a significant number of transmissi

The MPEG-2 video coding standard is being extensively used worldwide for the provision of digital video services. Many of these applications involve the transport of MPEG-2 video over cell-based (or packet) networks. Examples include the broadband integr

With increasing interest in the transport of video traffic over lossy networks, several techniques for improving the quality of video services in the presence of loss have been proposed, often using the MPEG 2 video coding algorithm as a basis. Many of t

Audio-visual and other multimedia services are seen as an important source of traffic for future telecommunications networks, including wireless networks. In this paper, we examine the impact of the properties of a 50 Mb/s asynchronous transfer mode (ATM

1. Microsoft External Collaboration Project (Pilot funded project):Advanced 3D Deformable Surface Reconstruction and Tracking through RGB-D Cameras. The aim of this project is to develop novel computer vision technology for real time modelling and tracking of 3D dense and deformable surfaces using general RGB-D cameras. The expected outcomes of this project will add significant value to the current RGB-D camera platform when applied in the common scenario in which the RGB-D camera does not move but the deformable objects of interest are moving.