Über dieses Buch

The two-volume proceedings LNCS 9314 and 9315, constitute the proceedings of the 16th Pacific-Rim Conference on Multimedia, PCM 2015, held in Gwangju, South Korea, in September 2015.

The total of 138 full and 32 short papers presented in these proceedings was carefully reviewed and selected from 224 submissions. The papers were organized in topical sections named: image and audio processing; multimedia content analysis; multimedia applications and services; video coding and processing; multimedia representation learning; visual understanding and recognition on big data; coding and reconstruction of multimedia data with spatial-temporal information; 3D image/video processing and applications; video/image quality assessment and processing; social media computing; human action recognition in social robotics and video surveillance; recent advances in image/video processing; new media representation and transmission technologies for emerging UHD services.

Anzeige

Inhaltsverzeichnis

Frontmatter

Image and Audio Processing

Frontmatter

Recent brain theories indicate that perceiving an image visually is an active inference procedure of the brain by using the Internal Generative Mechanism (IGM). Inspired by the theory, an IGM based Otsu multilevel thresholding algorithm for medical images is proposed in this paper, in which the Otsu thresholding technique is implemented on both the original image and the predicted version obtained by simulating the IGM on the original image. A regrouping measure is designed to refining the segmentation result. The proposed method takes the predicted visual information generated by the complicated Human Visual System (HVS) into account, as well as the details. Experiments on medical MR-T2 brain images are conducted to demonstrate the effectiveness of the proposed method. The experimental results indicate that the IGM based Otsu multilevel thresholding is superior to the other multilevel thresholdings.

Recent years have witnessed great progress in image deblurring. However, as an important application case, the deblurring of face images has not been well studied. Most existing face deblurring methods rely on exemplar set construction and candidate matching, which not only cost much computation time but also are vulnerable to possible complex or exaggerated face variations. To address the aforementioned problems, we propose a novel face deblurring method by integrating classical L0 deblurring approach with face landmark detection. A carefully tailored landmark detector is used to detect the main face contours. Then the detected contours are used as salient edges to guide the blind image deconvolution. Extensive experimental results demonstrate that the proposed method can better handle various complex face poses while greatly reducing computation time, as compared with state-of-the-art approaches.

This paper proposes a deblur algorithm using IMU sensor and long/ short exposure-time image pair. First, we derive an initial blur kernel from gyro data of IMU sensor. Second, we refine the blur kernel by applying Lucas-Kanade algorithm to long/short exposure-time image pair. Using residual deconvolution based on the non-uniform blur kernel, we synthesize a final image. Experimental results show that the proposed algorithm is superior to the state-of-the-art methods in terms of subjective/objective visual quality.

Object searching is the identification of an object in an image or video. There are several approaches to object detection, including template matching in computer vision. Template matching uses a small image, or template, to find matching regions in a larger image. In this paper, we propose a robust object searching method based on adaptive combination template matching. We apply a partition search to resize the target image properly. During this process, we can make efficiently match each template into the sub-images based on normalized sum of squared differences or zero-mean normalized cross correlation depends on the class of the object location such as corresponding, neighbor, or previous location. Finally, the template image is updated appropriately by an adaptive template algorithm. Experiment results show that the proposed method outperforms in object searching.

Wisarut Chantara, Yo-Sung Ho

Multimedia Content Analysis

Frontmatter

Greedy subspace clustering methods provide an efficient way to cluster large-scale multimedia datasets. However, these methods do not guarantee a global optimum and their clustering performance mainly depends on their initializations. To alleviate this initialization problem, this paper proposes a two-step greedy strategy by exploring proper neighbors that span an initial subspace. Firstly, for each data point, we seek a sparse representation with respect to its nearest neighbors. The data points corresponding to nonzero entries in the learning representation form an initial subspace, which potentially rejects bad or redundant data points. Secondly, the subspace is updated by adding an orthogonal basis involved with the newly added data points. Experimental results on real-world applications demonstrate that our method can significantly improve the clustering accuracy of greedy subspace clustering methods without scarifying much computational time.

Sketch recognition is an important issue in human-computer interaction, especially in sketch-based interface. To provide a scalable and flexible tool for user-driven sketch recognition, this paper proposes an iterative sketch collection annotation method for classifier-training by interleaving online metric learning, semi-supervised clustering and user intervention. It can discover the categories of the collections iteratively by combing online metric learning with semi-supervised clustering, and put the user intervention into the loop of each iteration. The features of our methods lie in three aspects. Firstly, the unlabeled collections are annotated with less effort in a group by group form. Secondly, the users can annotate the collections flexibly and freely to define the sketch recognition personally for different applications. Finally, the scalable collection can be annotated efficiently by combining the dynamically processing and online learning. The experimental results prove the effectiveness of our method.

Categories of images are often arranged in a hierarchical structure based on their semantic meanings. Many existing approaches demonstrate the hierarchical category structure could bolster the learning process for classification, but most of them are designed based on a flat category structure, hence may not be appreciated for dealing with complex category structure and large numbers of categories. In this paper, given the hierarchical category structure, we propose to jointly learn a shared discriminative dictionary and corresponding level classifiers for visual categorization by making use of the relationship between the edges and the relationship between each layer. Specially, we use the graphguided- fused-lasso penalty to embed the relationship between edges to the dictionary learning process. Besides, our approach not only learns the classifier towards the basic-class level, but also learns the classifier corresponding to the super-class level to embed the relationship between levels to the learning process. Experimental results on Caltech256 dataset and its subset show that the proposed approach yields promising performance improvements over some state-of-the-art methods.

Person re-identification is a challenging issue due to large visual appearance changes caused by variations in viewpoint, lighting, background clutter and occlusion among different cameras. Recently, Mahalanobis metric learning methods, which aim to find a global, linear transformation of the feature space between cameras [1-4], are widely used in person re-identification. In order to maximize the inter-class variation, general Mahalanobis metric learning methods usually push impostors (i.e., all negative samples that are nearer than the target neighbors) to a fixed threshold distance away, treating all these impostors equally without considering their diversity. However, for person re-identification, the discrepancies among impostors are useful for refining the ranking list. Motivated by this observation, we propose an Adaptive Margin Nearest Neighbor (AMNN) method for person re-identification. AMNN aims to take unequal treatment to each samples impostors by pushing them to adaptive variable margins away. Extensive comparative experiments conducted on two standard datasets have confirmed the superiority of the proposed method.

Camera motions seriously affect the accuracy of action recognition. Traditional methods address this issue through estimating and compensating camera motions based on optical flow in pixel-domain. But the high computational complexity of optical flow hinders these methods from applying to realtime scenarios. In this paper, we advance an efficient camera motion estimation and compensation method for realtime action recognition by exploiting motion vectors in video compressed-domain (a.k.a. compressed-domain global motion estimation, CGME). Taking advantage of geometric symmetry and differential theory of motion vectors, we estimate the parameters of camera affine transformation. These parameters are then used to compensate the initial motion vectors to retain crucial object motions. Finally, we extract video features for action recognition based on compensated motion vectors. Experimental results show that our method improves the speed of camera motion estimation by over 100 times with a minor reduction of about 4% in recognition accuracy compared with iDT.

Huafeng Chen, Jun Chen, Hongyang Li, Zengmin Xu, Ruimin Hu

Image and Audio Processing

Frontmatter

In this paper, we present a unified understanding on the formal performance evaluation for image manipulation forensics techniques. With hypothesis testing model, security is qualified as the difficulty for defeating an existing forensics system and making it generate two types of forensic errors, i.e., missing and false alarm detection. We point out that the security on false alarm risk, which is rarely addressed in current literatures, is equally significant for evaluating the performance of manipulation forensics techniques. With a case study on resampling-based composition forensics detector, both qualitative analyses and experimental results verify the correctness and rationality of our understanding on manipulation forensics security.

This paper presents a new label pruning based on sparse representation in image inpainting. In this literature, the label indicates a small rectangular patch to fill the missing regions. Global optimization-based image inpainting requires heavy computational cost due to a large number of labels. Therefore, it is necessary to effectively prune redundant labels. Also, inappropriate label pruning could degrade the inpainting quality. In this paper, we adopt the sparse representation of label to obtain a few reliable labels. The sparse representation of label is used to prune the redundant labels. Sparsely represented labels as well as non-zero sparse labels with high similarity to the target region are used as reliable labels in global optimization based image inpainting. Experimental results show that the proposed method can achieve the computational efficiency and structurally consistency.

In this paper, we propose a novel interactive image segmentation method for RGB-D images using hierarchical Graph Cut. Considering the characteristics of RGB channels and depth channel in RGB-D image, we utilize Euclidean distance on RGB space and geodesic distance on 3D space to measure how likely a pixel belongs to foreground or background in color and depth respectively, and integrate the color cue and depth cue into a unified Graph Cut framework to obtain the optimal segmentation result. Moreover, to overcome the low efficiency problem of Graph Cut in handling high resolution images, we accelerate the proposed method with hierarchical strategy. The experimental results show that our method outperforms the state-of-the-art methods with high efficiency.

We present a novel approach to resolve the problem of face alignment with a two-layer shape regression framework. Traditional regression-based methods [4, 6, 7] regress all landmarks in a single shape without consideration of the difference between various landmarks in biologic property and texture, which would lead to a suboptimal prediction. Unlike previous regression-based approach, we do not regress the entire landmarks in a holistic manner without any discrimination. We categorize the geometric constraints into two types, inter-component constraints and intra-component constraints. Corresponding to these two shape constraints, we design a two-layer shape regression framework which can be integrated with regression-based methods. We define a term of “key points” of components to describe inter-component constraints and then determine the sub-shapes. We verify our two-layer shape regression framework on two widely used datasets (LFPW [10] and Helen [11]) for face alignment and experimental results prove its improvements in accuracy.

When conventional first order Ambisonics system uses four loudspeaker with platonic solid layout to reconstruct sound field, the 3D acoustic field effect is limited. A new signal distribution method is proposed to enhance the reproduced field without increasing loudspeakers. First, a platonic solid is extended to get more new vertexes, based on the traditional Ambisonics signal distribution method, original field signal is distributed to loudspeakers at original and new vertexes of platonic solid. Second, signals of loudspeakers at new vertexes are distributed to loudspeakers at original vertexes by a new 3D panning method, then loudspeakers at new vertexes of platonic solid are deleted, only original vertexes of platonic solid are left. The proposed method can improve the quality of the reconstructed sound field and will not increase the complexity of loudspeaker layout in practice. Results are verified through objective and subjective experiments.

Multimedia Applications and Services

Frontmatter

Nowadays, dual-camera systems, which consist of a static camera and a pan-tilt-zoom (PTZ) camera, have become popular in video surveillance, since they can offer wide area coverage and highly detailed images of the interesting starget simultaneously. Different from most previous multi-target tracking methods without information fusion, we propose a multi-target tracking framework based on information fusion of the heterogeneous cameras. Specifically, a conservative online multitarget tracking method is introduced to generate reliable tracklets in both cameras in real time. A max-entropy target selection strategy is proposed to determine which target should be observed by the PTZ camera at a higher resolution to reduce the ambiguity of multi-target tracking. Finally, the information from the static camera and the PTZ camera is fused into a tracking-by-detection framework for more robust multitarget tracking. The proposed method is tested in an outdoor scene, and the experimental results show that our method significantly improves the multi-target tracking performance.

Tracking an object in long term is still a great challenge in computer vision. Appearance modeling is one of keys to build a good tracker. Much research attention focuses on building an appearance model by employing special features and learning method, especially online learning. However, one model is not enough to describe all historical appearances of the tracking target during a long term tracking task because of view port exchanging, illuminance varying, camera switching, etc. We propose the Adaptive Multiple Appearance Model (AMAM) framework to maintain not one model but appearance model set to solve this problem. Different appearance representations of the tracking target could be employed and grouped unsupervised and modeled by Dirichlet Process Mixture Model (DPMM) automatically. And tracking result can be selected from candidate targets predicted by trackers based on those appearance models by voting and confidence map. Experimental results on multiple public datasets demonstrate the better performance compared with state-of-the-art methods.

As intelligent devices and human-computer interaction ways become diverse, the in-air writing is becoming popular as a very natural interaction way. Compared with online handwritten Chinese character recognition (OHCCR) based on touch screen or writing board, the research of in-air handwritten Chinese character recognition (IAHCCR) is still in the start-up phase. In this paper, we present an on-line sample generation method to enlarge the number of training instances in an automatic synthesis way. In our system, the in-air writing trajectory of fingertip is first captured by a Leap Motion Controller. Then corner points are detected. Finally, the corner points as well as the sampling points between corner points are distorted to generate artificial patterns. Compared with the previous sample generation methods, the proposed method focuses on distorting the inner structure of character patterns. We evaluate the proposed method on our in-air handwritten Chinese character dataset IAHCC-UCAS2014 which covers 3755 classes of Chinese characters. The experimental results demonstrate that proposed approach achieves higher recognition accuracies and lower computational cost.

This article proposed a progressive image segmentation, which allow users to segment images according to their preferences without any boring pre-labeling or training stages. We use an online learning method to train/update the segmentation model progressively. User can scribble on the image to label initial samples or correct the false-labeled regions of the result. To efficiently integrate the interaction with the learning and updating process, a three-level representation of images is built. The proposed method has three advantages. Firstly, the segmentation model can be learned online along with user’s manipulation without any pre-labeling. Secondly, the diversity of segmentation accord with user’s preferences can be met flexibly, and the more use the more accurate the segmentation could be. Finally, the segmentation model can be updated online to meet the needs of users. The experimental results demonstrate these advantages.

Many communication models for communication arts and numerous interactive multimedia applications for computer science were discussed over many decades ago. However, there has been little work giving an overview of recent integrated research of digital media and emerging trends, such as interactive multimedia experience in an interdisciplinary aspect. In this paper, we review and study recently interactive digital multimedia applications using and applying the aforementioned emerging trends. We provide a short blueprint for interactive digital multimedia research when applying virtual reality, image processing, computer vision, real-time augmented reality, and interactive media into the senses of hearing and vision for virtual environments. A SMCR (Source-Message-Channel-Receiver) model for communicating via all human senses is also explained and linked to some interactive digital multimedia applications presented recently. After that, the senses of hearing and vision are discussed using related-technologies. It will be of good value to the new researchers in this integrated emerging field of interactive digital multimedia.

Chutisant Kerdvibulvech

Video Coding and Processing

Frontmatter

3D position tracking of the ball plays a crucial role in professional volleyball analysis. In volleyball games, the constraint conditions that limit the performance of the ball tracking include the fast irregular movement of the ball, the small-size of the ball, the complex background as well as the occlusion problem caused by players. This paper proposes a ball size adaptive (BSA) tracking window, a ball feature likelihood model and an anti-occlusion likelihood measurement (AOLM) base on Particle Filter for improving the accuracy. By adaptively changing the tracking windows according to the ball size, it is possible to track the ball with changing size in different video images. On the other hand, the ball feature likelihood enables to track stably even in complex background. Furthermore, AOLM based on a multiple-camera system solves the occlusion problems since it can eliminate the low likelihood caused by occlusion. Experimental results which are based on the HDTV video sequences (2014 Inter High School Games of Men’s Volleyball) captured by four cameras located at the corners of the court show that the success rate of the ball’s 3D position tracking achieves 93.39%.

There are often lots of periodic motions in the background of surveillance videos, such as countdown traffic lights, LED billboards and etc. The conventional motion-compensation scheme and the existing frame-based single background reference scheme cannot eliminate this kind of redundancies efficiently, especially when the cycle time exceeds the maximum GOP size. In this paper, we propose a block-based global and multiple-reference scheme to solve this problem. Firstly, the background is modeled on the basis of co-located blocks but not frames, which makes it possible to realize an adaptive block-level background updating. Secondly, multiple background blocks can be kept for one block location, which makes it suitable for modeling periodic background. Thirdly, the scheme enables global reference, which further eliminates the extensively existed redundancies among GOPs in surveillance videos. Experimental results show that the proposed scheme achieves better rate-distortion performance over the existing frame-based single background reference scheme in most cases.

Scene surveillance video is a kind of video which are captured by stationary camera for a long time in specific surveillance scene. Due to regular movement of vehicles with similarity structures, models and appearances, surveillance video produce amounts of redundancy and needs to be efficiently coded for transmission and storage. In this study, we investigated the video redundancy generation mechanism of scene surveillance, exploit and presents a new redundancy type-Global Object Redundancy (GOR), it is proven that the vehicles occupy the mostly proportion which caused by amounts of vehicles movement. Secondly, aiming at global vehicle objects representation and GOR elimination, a global object representation scheme of scene surveillance video based on model and feature parameters is introduced, by establish a global knowledge dictionary and feature parameter sets, low bitrate with high quality compression can be achieved due to only few vehicle objects individual semantic and feature parametric be transfer and coded. Finally, we carried out preliminary experiments in simulation environment and shows that the object representation scheme can effectively improve the compression of long-term archive surveillance video which with a certain of image quality assurance.

Foreground detection is a fundamental task in video processing. Recently, many background subspace estimation based foreground detection methods have been proposed. In this paper, a sparse error compensation based incremental principal component analysis method, which robustly updates background subspace and estimates foreground, is proposed for foreground detection. There are mainly two notable features in our method. First, a sparse error compensation process via a probability sampling procedure is designed for subspace updating, which reduces the interference of undesirable foreground signal. Second, the proposed foreground detection method could operate without an initial background subspace estimation, which enlarges the application scope of our method. Extensive experiments on multiple real video sequences show the superiority of our method.

Ming Qin, Yao Lu, Huijun Di, Tianfei Zhou

Multimedia Representation Learning

Frontmatter

The features extracted from convolutional neural networks (CNNs) are able to capture the discriminative part of an image and have shown superior performance in visual recognition. Furthermore, it has been verified that the CNN activations trained from large and diverse datasets can act as generic features and be transferred to other visual recognition tasks. In this paper, we aim to learn more from an image and present an effective method called Principal Pyramidal Convolution (PPC). The scheme first partitions the image into two levels, and extracts CNN activations for each sub-region along with the whole image, and then aggregates them together. The concatenated feature is later reduced to the standard dimension using Principal Component Analysis (PCA) algorithm, generating the refined CNN feature. When applied in image classification and retrieval tasks, the PPC feature consistently outperforms the conventional CNN feature, regardless of the network type where they derive from. Specifically, PPC achieves state-of-the-art result on the MIT Indoor67 dataset, utilizing the activations from Places-CNN.

In this paper, we propose a novel gaze shifting kernel for scene image categorization, focusing on discovering the mechanism of humans perceiving visually/semantically salient regions in a scene. First, a weakly supervised embedding algorithm projects the local image descriptors (

i.e.

, graphlets) into a pre-specified semantic space. Afterward, each graphlet can be represented by multiple visual features at both low-level and high-level. As humans typically attend to a small fraction of regions in a scene, a sparsity-constrained graphlet ranking algorithm is proposed to dynamically integrate both the low-level and the high-level visual cues. The top-ranked graphlets are either visually or semantically salient according to human perception. They are linked into a path to simulate human gaze shifting. Finally, we calculate the gaze shifting kernel (GSK) based on the discovered paths from a set of images. Experiments on the USC scene and the ZJU aerial image data sets demonstrate the competitiveness of our GSK, as well as the high consistency of the predicted path with real human gaze shifting path.

In this paper, we propose the two-phase representation based classification called the two-phase linear reconstruction measure based classification (TPLRMC). It is inspired from the fact that the linear reconstruction measure (LRM) gauges the similarities among feature samples by decomposing each feature sample as a liner combination of the other feature samples with L

p

-norm regularization. Since the linear reconstruction coefficients can fully reveal the feature’s neighborhood structure that is hidden in the data, the similarity measures among the training samples and the query sample are well provided in classifier design. In TPLRMC, it first coarsely seeks the K nearest neighbors for the query sample with LRM, and then finely represents the query sample as the linear combination of the determined K nearest neighbors and uses LRM to perform classification. The experimental results on face databases show that TPLRMC can significantly improve the classification performance.

Recently, deep architectures, such as stack auto-encoders (SAEs), have been used to learn features from the unlabeled data. However, it is difficult to get the multi-level visual information from the traditional deep architectures (such as SAEs). In this paper, a feature representation method which concatenates Multiple Different Stack Auto-Encoders (MDSAEs) is presented. The proposed method tries to imitate the human visual cortex to recognize the objects from different views. The output of the last hidden layer for each SAE can be regarded as a kind of feature. Several kinds of features are concatenated together to form a final representation according to their weights (The output of deep architectures are assigned a high weight, and vice versa). From this way, the hierarchical structure of the human brain cortex can be simulated. Experimental results on datasets MNIST and CIRFA10 for classification have demonstrated the superior performance.

Histogram of Oriented Gradients (HOG) features have laid solid foundation for object detection in recent years for its both accuracy and speed. However, the expressivity of HOG is limited because the simple gradient features may ignore some important local information about objects and HOG is actually data-independent. In this paper, we propose to replace HOG by a parts-based representation, Histogram of Local Parts (HLP), for object detection under sliding window framework. HLP can capture richer and larger local patterns of objects and are more expressive than HOG. Specifically, we adopt Sparse Nonnegative Matrix Factorization to learn an over-complete parts-based dictionary from data. Then we can obtain HLP representation for a local patch by aggregating the Local Parts coefficients of pixels in this patch. Like DPM, we can train a supervised model with HLP given the latent positions of roots and parts of objects. Extensive experiments on INRIA and PASCAL datasets verify the superiority of HLP to state-of-the-art HOG-based methods for object detection, which shows that HLP is more effective than HOG.

Chenjie Huang, Zheng Qin, Kaiping Xu, Guolong Wang, Tao Xu

Regular Poster Session

Frontmatter

Due to the under-sparsity or over-sparsity, the widely used regularization methods, such as ridge regression and sparse representation, lead to poor hallucination performance in the presence of noise. In addition, the regularized penalty function fails to consider the locality constraint within the observed image and training images, thus reducing the accuracy and stability of optimal solution. This paper proposes a locally weighted sparse regularization method by incorporating distance-inducing weights into the penalty function. This method accounts for heteroskedasticity of representation coefficients and can be theoretically justified from Bayesian inference perspective. Further, in terms of the reduced sparseness of noisy images, a moderately sparse regularization method with a mixture of

l

1

and

l

2

norms is introduced to deal with noise robust face hallucination. Various experimental results on public face database validate the effectiveness of proposed method.

In this paper, we present a robust object tracking method by fusing multiple correlation filters which leads to a weighted sum of these classifier vectors. Different from other learning methods which utilize a sparse sampling mechanism to generate training samples, our method adopts a dense sampling strategy for both training and testing which is more effective yet efficient due to the highly structured kernel matrix. A correlation filter pool is established based on the correlation filters trained by historical frames as tracking goes on. We consider the weighted sum of these correlation filters as the final classifier to locate the position of object. We introduce a coefficients optimization scheme by balancing the test errors for all correlation filters and emphasizing the recent frames. Also, a budget mechanism by removing the one which will result in the smallest change to final correlation filter is illustrated to prevent the unlimited increase of filter number. The experiments compare our method with other three state-of-the-art algorithms, demonstrating a robust and encouraging performance of the proposed algorithm.

This paper considers a unified tone mapping operation (TMO) for HDR images. This paper includes not only floating-point data but also long-integer (i.e. longer than 8-bit) data as HDR image expression. A TMO generates a low dynamic range (LDR) image from a high dynamic range (HDR) image by compressing its dynamic range. A unified TMO can perform tone mapping for various HDR image formats with a single common TMO. The integer TMO which can perform unified tone mapping by converting an input HDR image into an intermediate format was proposed. This method can be executed efficiently with low memory and low performance processor. However, only floatingpoint HDR image formats have been considered in the unified TMO. In other words, a long-integer which is one of the HDR image formats has not been considered in the unified TMO. This paper extends the unified TMO to a long-integer format. Thereby, the unified TMO for all possible HDR image formats can be realized. The proposed method ventures to convert a long-integer number into a floating-point number, and treats it as two 8-bit integer numbers which correspond to its exponent part and mantissa part. These two integer numbers are applied the tone mapping separately. The experimental results shows the proposed method is effective for an integer format in terms of the resources such as the computational cost and the memory cost.

Human action recognition is an important research topic that has many potential applications such as video surveillance, humancomputer interaction and virtual reality combat training. However, many researches of human action recognition have been performed in single camera system, and has low performance due to vulnerability to partial occlusion. In this paper, we propose a human action recognition system using multiple Kinect sensors to overcome the limitation of conventional single camera based human action recognition system. To test feasibility of the proposed system, we use the snapshot and temporal features which are extracted from three-dimensional (3D) skeleton data sequences, and apply the support vector machine (SVM) for classification of human action. The experiment results demonstrate the feasibility of the proposed system.

Home sound environments are becoming increasingly important to the entertainment and audio industries. Compared with single zone soundfield reproduction, 3D spatial multizone soundfield reproduction is a more complex and challenging problem with few loudspeakers. In this paper, we introduce a simplification method based on the Least-Squares sound pressure matching method, and two separated zones can be reproduced accurately. For NHK 22.2 system, fourteen kinds of loudspeaker arrangements from 22 to 8 channels are derived. Simulation results demonstrate the favorable performance for two zones soundfield reproduction, and subjective evaluation results show the soundfield of two heads can be reproduced perfectly until 10 channels, and 8-channel systems can keep low distortions at ears. Compared with Ando’s multichannel conversion method by subjective evaluation, our proposed method is very close Ando’s in terms of sound localization in the center zone, what’s more, the performance of sound localization are improved significantly in the other zone of which position off the center.

To improve the spatial precision of three-dimensional (3D) audio, the bit rates of spatial parameters are increased sharply. This paper presents a spatial parameters compression approach to decrease the bit rates of spatial parameters for 3D audio. Based on spatial direction filtering and spatial side information clustering, new multi-channel object-based spatial parameters compression approach (MOSPCA) is presented, through which the spatial parameters of intra-frame different frequency bands belonging to the same sound source can be compressed to one spatial parameter. In an experiment it is shown that the compression ratio of spatial parameter can reach 7:1 compared with the 1.4:1 of MPEG Surround and S

With the development of digital image processing technology, computer vision technology has been widely used in various areas. Active vision is one of the main research fields in computer vision and can be used in different scenes, such as airports, ball games, and so on. FPGA (Field Programmable Gate Array) is widely used in computer vision field for its high speed and the ability to process a great amount of data. In this paper, a novel FPGA based high-speed binocular active vision system for tracking circle-shaped target is introduced. Specifically, our active vision system includes three parts: target tracking, coordinate transformation, and pan-tilt control. The system can handle 1000 successive frames in 1 s, track and keep the target at the center of the image for attention.

With the quick increase of video data, it is difficult for people to find the favorite video to watch quickly. The existing video summarization methods can do a favor for viewers. However, these methods mainly contain the very brief content from the start to the end of the whole video. Viewers may hardly be interested in scanning these kinds of summary videos, and they will want to know the interesting or exciting contents in a shorter time. In this paper, we propose a video summarization approach of powerful and attractive contents based on the extracted deep learning feature and implement our approach on One Class SVM (OCSVM). Extensive experiments demonstrate that our approach is able to extract the powerful and attractive contents effectively and performs well on generating attractive summary videos, and we can provide a benchmark of powerful content extraction at the same time.

A new methodology for blur detection with multi-method fusion is presented in this paper. The research is motivated by the observation that there is no single method that can give the best performance in all situations. We try to discover the underlying performance complementary patterns of several state-of-the-art methods, then use the pattern specific to each image to get a better overall result. Specifically, a Conditional Random Filed (CRF) framework is adopted for multi-method blur detection that not only models the contribution from individual blur detection result but also the interrelation between neighbouring pixels. Considering the dependence of multi-method fusion on the specific image, we single out a subset of images similar to the input image from a training dataset and train the CRF-based multi-method fusion model only using this subset instead of the whole training dataset. The proposed multi-method fusion approach is shown to stably outperform each individual blur detection method on public blur detection benchmarks.

Multiple players tracking plays a key role in volleyball analysis. Due to the demand of developing effective tactics for professional events, players’ 3D information like speed and trajectory is needed. Although, 3D information can solve the occlusion relation problem, complete occlusion and similar feature between players may still reduce the accuracy of tracking. Thus, this paper proposes a motion vector and players’ features based particle filter for multiple players tracking in 3D space. For the prediction part, a motion vector prediction model combined with Gaussian window model is proposed to predict player’s position after occlusion. For the likelihood estimation part, a 3D distance likelihood model is proposed to avoid error tracking between two players. Also, a number detection likelihood model is used to distinguish players. With the proposed multiple players tracking algorithm, not only occlusion relation problem can be solved, but also physical features of players in the real world can be obtained. Experiment which executed on an official volleyball match video (Final Game of 2014 Japan Inter High School Games of Men’s Volleyball in Tokyo Metropolitan Gymnasium) shows that our tracking algorithm can achieve 91.9 % and 92.6 % success rate in the first and third set.

gradient minimization. Existing propagation methods only take simple constraints into consideration and neglects image structure information. We propose a new optimization framework making use of

L

0

gradient minimization, which can globally satisfy user-specified edits as well as tackle counts of non-zero gradients. In this process, a modified affinity matrix approximation method which efficiently reduces randomness is raised. We introduce a self-adaptive re-parameterization way to control the counts based on both original image and user inputs. Our approach is demonstrated by image recoloring and tonal values adjustments. Numerous experiments show that our method can significantly improve edit propagation via

Recently, many saliency detection models use image boundary as an effective prior of image background for saliency extraction. However, these models may fail when the salient object is overlapped with the boundary. In this paper, we propose a novel saliency detection model by computing the contrast between superpixels with background priors and introducing a refinement method to address the problem in existing studies. Firstly, the SLIC (Simple Linear Iterative Clustering) method is used to segment the input image into superpixels. Then, the feature difference is calculated between superpixels based on the color histogram. The initial saliency value of each superpixel is computed as the sum of feature differences between this superpixel and other ones in image boundary. Finally, a saliency map refinement method is used to reassign the saliency value of each image pixel to obtain the final saliency map for images. Compared with other state-of-the-art saliency detection methods, the proposed saliency detection method can provide better saliency prediction results for images by the measure from precision, recall and F-measure on two widely used datasets.

Position-patch based face hallucination methods aim to reconstruct the high-resolution (HR) patch of each low-resolution (LR) input patch independently by the optimal linear combination of the training patches at the same position. Most of current approaches directly use the reconstruction weights learned from LR training set to generate HR face images, without considering the structure difference between LR and the HR feature space. However, it is reasonable to assume that utilizing HR images for weights learning would benefit the reconstruction process, because HR feature space generally contains much more information. Therefore, in this paper, we propose a novel representation scheme, called High-resolution Reconstructed-weights Representation (HRR), that allows us to improve an intermediate HR image into a more accurate one. Here the HR reconstruction weights can be effectively obtained by solving a least square problem. Our evaluations on publicly available face databases demonstrate favorable performance compared to the previous position-patch based methods.

This paper proposes a real-time system to render with layered materials by using a linearly filterable reflectance model. This model effectively captures both surface and subsurface reflections, and supports smooth transitions over different resolutions. In a preprocessing stage, we build mip-map structures for both surface and subsurface mesostructures via fitting their bumpiness with mixtures of von Mises Fisher (movMF) distributions. Particularly, a movMF convolution algorithm and a movMF reduction algorithm are provided to well-approximate the visually perceived bumpiness of the subsurface with controllable rendering complexity. Then, both surface and subsurface reflections are implemented on GPUs with real-time performance. Experimental results reveal that our approach enables aliasing-free illumination under environmental lighting at different scales.

We developed and evaluated different schemes for the realtime compression of multiple depth image streams. Our analysis suggests that a hybrid lossless-lossy compression approach provides a good tradeoff between quality and compression ratio. Lossless compression based on run length encoding is used to preserve the information of the highest bits of the depth image pixels. The lowest 10-bits of a depth pixel value are directly encoded in the Y channel of a YUV image and encoded by a x264 codec. Our experiments show that the proposed method can encode and decode multiple depth image streams in less than 12 ms on average. Depending on the compression level, which can be adjusted during application runtime, we are able to achieve a compression ratio of about 4:1 to 20:1. Initial results indicate that the quality for 3D reconstructions is almost indistinguishable from the original for a compression ratio of up to 10:1.

(LRC). The MFRC aims at minimizing the within-class compactness over the between-class separability to find an optimal embedding matrix for the LRC so that the LRC on that subspace achieves a high discrimination for classification. Specifically, the within-class compactness is measured with the sum of distances between each sample and its neighbors within the same class with the LRC, and the between-class separability is characterized as the sum of distances between margin points and their neighboring points from different classes with the LRC. Therefore, the MFRC embodies the ideas of the LRC, Fisher analysis and manifold learning. Experiments on the FERET, PIE and AR datasets demonstrate the effectiveness of the MFRC.

In video coder, inter-frame prediction results in distortion propagation among adjacent frames, and this distortion dependency is a crucial factor for rate control and video coding algorithm optimization. The macroblock tree (MBTree) is a typical temporal quantization control algorithm, in which a quantization offset

δ

is employed for adjustment according to the amount of distortion propagation measured by the relative propagation cost

ρ

. Appropriate

δ

−

ρ

model is the key to the MBTree-like adaptive quantization algorithm. The default

δ

−

ρ

model in MBTree algorithm is designed in an empirical way with rough model accuracy and insufficient universality to different input source. This paper focuses on this problem and apply the competitive decision mechanism in exploring optimal

For multi-label image classification, we use active learning to select example-label pairs to acquire labels from experts. The core of active learning is to select the most informative examples to request their labels. Most previous studies in active learning for multi-label classification have two shortcomings. One is that they didn’t pay enough attention on label correlations. The other shortcoming is that existing example-label selection methods predict all the rest labels of the selected example-label pair. This leads to a bad performance for classification when the number of the labels is large. In this paper, we propose a semi-automatic labeling multi-label active learning (SLMAL) algorithm. Firstly, SLMAL integrates uncertainty and label informativeness to select example-label pairs to request labels. Then we choose the most uncertain example-label pair and predict its partial labels using its nearest neighbor. Our empirical results demonstrate that our proposed method SLMAL outperforms the state-of-the-art active learning methods for multi-label classification. It significantly reduces the labeling workloads and improves the performance of a classifier built.

The detection and recognition of racing bib number/text, which is printed on paper, cardboard tag, or t-shirt in natural images in marathon, race and sports, is challenging due to person movement, non-rigid surface, distortion by non-illumination, severe occlusions, orientation variations etc. In this paper, we present a multi-modal technique that combines both biometric and textual features to achieve good results for bib number/text detection. We explore face and skin features in a new way for identifying text candidate regions from input natural images. For each text candidate region, we propose to use text detection and recognition methods for detecting and recognizing bib numbers/texts, respectively. To validate the usefulness of the proposed multi-modal technique, we conduct text detection and recognition experiments before text candidate region detection and after text candidate region detection in terms of recall, precision and f-measure. Experimental results show that the proposed multi-modal technique outperforms the existing bib number detection method.

Text detection and recognition in degraded video is complex and challenging due to lighting effect, sensor and motion blurring. This paper presents a new method that derives multi-spectral images from each input video frame by studying non-linear intensity values in Gray, R, G and B color spaces to increase the contrast of text pixels, which results in four respective multi-spectral images. Then we propose a multiple fusion criteria for the four multi-spectral images to enhance text information in degraded video frames. We propose median operation to obtain a single image from the results of the multiple fusion criteria, which we name fusion-1. We further apply k-means clustering on the fused images obtained by the multiple fusion criteria to classify text clusters, which results in binary images. Then we propose the same median operation to obtain a single image by fusing binary images, which we name fusion-2. We evaluate the enhanced images at fusion-1 and fusion-2 using quality measures, such as Mean Square Error, Peak Signal to Noise Ratio and Structural Symmetry. Furthermore, the enhanced images are validated through text detection and recognition accuracies in video frames to show the effectiveness of enhancement.

Video text is very important semantic information, which brings precise and meaningful clues for video indexing and retrieval. However, most previous approaches did video text extraction and recognition separately, while the main difficulty of extraction and recognition with complex background wasn’t handled very well. In this paper, these difficulty is investigated by combining text extraction and recognition together as well as using OCR feedback information. The following features are highlighted in our approach: (i) an efficient character image segmentation method is proposed in consideration of most prior knowledge. (ii) text extraction are implemented both on text-row and segmented single character images, since text-row based extraction maintains the color consistency of characters and backgrounds while single character has simpler background. After that, the best binary image is chosen for recognition with OCR feedback. (iii) The K-means algorithm is used for extraction which ensures that the best extraction result is involved, which is the binary image with clear classification of text strokes and background. Finally, extensive experiments and empirical evaluations on several video text images are conducted to demonstrate the satisfying performance of the proposed approach.

Speaking of active infrared vision, its inability to see physical colors has long been considered as one major drawback or something everybody has paid no attention to until very recently. Looking at this color blindness from other perspective, we propose an idea of a novel medium whose visibilities in both visible and active infrared light spectrums can be controlled, enabling vision-based techniques to transform everyday printed media into smart, eco-friendly and sustainable monitorlike interactive displays.

To begin with, this paper observes the most important key success procedure regarding the idea–estimating how physical colors should look like when being seen by an active infrared camera. Two alternative methods are proposed and evaluated here. The first one uses Bayesian classifier to find some color-attribute combinations that can precisely classify our sample data. The second alternative relies on simple weighted average and k-nearest neighbor regression in two color models–RGB and CIE L*a*b*. Suggesting by experimental results, the second method is more practical and consistent at different distances. Besides, it shows likelihoods of the model created in this work being able to estimate infrared vision of colors printed on different material.

Modern audio coding technologies apply methods of bandwidth extension (BWE) to efficiently represent audio data at low bitrates. An established method is the well-known spectral band replication (SBR) that can provide the very high sound quality with imperceptible artifact. However, its bitrates and complexity are very high. Another great method is LPC-based BWE, which is part of 3GPP AMR-WB+ codec. Although its bitrates and complexity are reduced distinctly, the sound quality it provided is unsatisfactory for music. In this paper, a novel bandwidth extension method is proposed which provided the high sound quality close to eSBR, with only 0.8 kbps bitrates. The proposed method predicts the fine structure of high frequency band from low frequency band by a deep auto-encoder, and only extracts the envelope of high frequency as side information. The performance evaluation demonstrates the advantage of the proposed method compared to the state of the art. Compared with eSBR, the bitrates drop about 63 %, and the subjective listening quality is close to it. Compared with LPC-based BWE, the subjective listening quality is better than it with the same bitrates.

It is difficult to segment images of fine-grained objects due to the high variation of appearances. Common segmentation methods can hardly separate the part regions of the instance from background with sufficient accuracy. However, these parts are crucial in fine-grained recognition. Observing that fine-grained objects share the same configuration of parts, we present a novel part-aware segmentation method, which can get the foreground segmentation from a bounding box with preservation of semantic parts. We firstly design a hybrid part localization method, which combines parametric and non-parametric models. Then we iteratively update the segmentation outputs and the part proposal, which can get better foreground segmentation results. Experiments demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.

This paper presents a 3D soft tissue surface reconstruction method based on improved compressed sensing and radial basis function interpolation for a small amount of uniform sampling data points on 3D surface. We adopt radial basis function interpolation to obtain the same amount of data points as to be reconstructed and propose an improved compressed sensing method to reconstruct 3D surface: we design a deterministic measurement matrix to signal observation, and then adopt the discrete cosine transform to the 3D coordinate sparse representation and use weak choose regularized orthogonal matching pursuit algorithm to reconstruct. Experimental results show that the proposed algorithm improves the resolution of the surface as well as the accuracy. The average maximum error is less than 0.9012 mm, which is smooth enough to provide accurate surface data model for virtual reality based surgery system.

Videos are commonly used as course materials for e-learning. In most existing systems, the lecture videos are usually presented in a linear manner. Structuring the video corpus has proven an effective way for the learners to conveniently browse the video corpus and design their learning strategies. However, the content analysis of lecture videos is difficult due to the low recognition rate of speech and handwriting texts and the noisy information. In this paper, we explore the use of external domain knowledge from Wikipedia to construct learning maps for online learners. First, with the external knowledge, we filter the noisy texts extracted from videos to form a more precise and elegant representation of the video content. This facilitates us to construct a more accurate video map to represent the domain knowledge of the course. Second, by combining the video information and the external academic articles for the domain concepts, we construct a directed map to show the relationships between different concepts. This can facilitate online learners to design their learning strategies and search for the target concepts and related videos. Our experiments demonstrate that external domain knowledge can help organize the lecture video corpus and construct more comprehensive knowledge representations, which improves the learning experience of online learners.

In this paper, in order to track objects which undergo rotation and pose changes, we propose a novel algorithm that combines discriminative global and generative local model. Initially, we exploit the wavelet approximation coefficients and completed local binary pattern (CLBP) to represent the object global features. With the obtained global appearance descriptor, we use online discriminative metric learning to differentiate the target object from background. To avoid the drift problem results from global discriminative model, a novel generative spatial geometric local model is introduced. Based on SURF features, the generative local model quantizes the geometric structure information in scale and angle. Then, we combine these global and local models so that they can be benefit each other. Compared with several other tracking algorithms, the experimental results demonstrate that the proposed algorithm is able to track the target object reliably, especially for object pose change and rotation.

In this study, we present a novel multi-cues active contours based method for tracking target contours using edge, region, and shape information. To locate the target position, a contour based meanshift tracker is designed which combines both color and texture information. In order to reduce the adverse impact of sophisticated background and accelerate the curve motion, we extract rough target region from the coming frame by the proposed target appearance model. What’s more, both discriminative pre-learning based global layer and voting based local layer are integrated into our appearance model. For obtaining the detailed target boundaries, we embed edge, region, and shape information into the level sets based multi-cues active contour model (MCAC). Experiments on seven video sequences demonstrate that the proposed method performs better than other competitive contour tracking methods under various tracking environment.

Person re-identification is a problem of recognising and associating persons across different cameras. Existing methods usually take visual appearance features to address this issue, while the visual descriptions are sensitive to the environment variation. Relatively, the semantic attributes are more robust in complicated environments. Therefore, several attribute-based methods are introduced, but most of them ignored the diversities of different attributes. We epitomize the diversities of different attributes as two folds: the

attribute confidence

which denotes the descriptive power, and the

attribute saliency

which expresses the discriminative power. Specifically, the attribute confidence is determined by the performance of each attribute classifier, and the attribute saliency is defined by their occurrence frequency, similar to the IDF (Inverse Document Frequency) [1] idea in information retrieval. Then, each attribute is assigned an appropriate weighting according to its saliency and confidence when calculating similarity distances. Based on above considerations, a novel person re-identification method is proposed. Experiments conducted on two benchmark datasets have validated the effectiveness of the proposed method.

Edit propagation algorithms are a powerful tool for performing complex edits with a few coarse strokes. However, current methods fail when dealing with light fields, since these methods do not account for view-consistency and due to the large size of data that needs to be handled. In this work we propose a new scalable algorithm for light field edit propagation, based on reparametrizing the input light field so that the coherence in the angular domain of the edits is preserved. Then, we handle the large size and dimensionality of the light field by using a downsampling-upsampling approach, where the edits are propagated in a reduced version of the light field, and then upsampled to the original resolution. We demonstrate that our method improves angular consistency in several experimental results.

This paper presents a novel interactive motion mapping system that maps the human motion to virtual characters with different body part size, topology and geometry. Our method is especially effective for characters whose body is disproportional to human structure. To achieve this, we propose an improved Embedded Deformation algorithm to control virtual characters in realtime. In preprocessing stage, we construct the deformation subgraph for each part, and then merge them into a connected deformation graph, these works are entirely automatic and only have to be done once before running. At runtime, we use the Kinect to track human skeletal joints and iteratively solve the rotation matrix and translation vector for each deformation graph node. Then, we update mesh vertices position and normal. We demonstrate the flexibility and versatility of our method on a variety of virtual characters.

Hao Jiang, Lei Zhang

Visual Understanding and Recognition on Big Data

Frontmatter

Similarity search in graph databases has been widely studied in graph query processing in recent years. With the fast accumulation of graph databases, it is worthwhile to develop a fast algorithm to support similarity search in large-scale graph databases. In this paper, we study

k

-NN similarity search problem via locality sensitive hashing.We propose a fast graph search algorithm, which first transforms complex graphs into vectorial representations based on the prototypes in the database and then accelerates query efficiency in Euclidean space by employing locality sensitive hashing. Additionally, a general retrieval framework is established in our approach. Experiments on three real datasets show that our work achieves high performance both on the accuracy and the efficiency of the presented algorithm.

In this paper, we focus on English text localization in natural scene images. We propose a hierarchical localization framework which goes from characters to strings to words. Different from existing methods which either bet on sophisticated hand-crafted features or rely on heavy learning models, our approach tends to design simple but effective features and learning models. In this study, we introduce a kind of two level character structure features in collaboration with the Histogram of Gradient (HOG) and the Convolutional Neural Network (CNN) features for character localization. In string localization, a nine-dimension string feature is proposed for discriminative verification after grouping characters. For the final word localization, we learn an optimal splitting strategy based on the interval cues to split strings into words. Experiments on the challenging ICDAR benchmark datasets demonstrate the effectiveness and superiority of our approach.

Most of the current action recognition approaches learn each action category separately. An important observation is that many action categories are correlated and could be clustered into groups, which are always ignored to decreasing the recognition accuracy. In this paper, we employ a multi-task learning framework with group-structured regularization to share knowledge in category groups. First, we employ Fisher Vector, concatenated by gradients with respect to mean vector and covariance matrix of GMM, to represent action data. Intuitively, the action categories in the same group are prone to have a closer relationship with the same Gaussian components. The proposed method uses one-vs-one SVM margin to measure the degree of similarity between each pair of categories and obtain the implicit group structure by Affinity Propagation Clustering. In order to encourage the categories in the same group to share dimensions feature from the same Gaussian component and vice versa, the implicit group structure is used as the prior regularization in multi-task learning. Our experiments on large and realistic dataset HMDB51 show that the proposed method has achieved the comparative even higher accuracy with less dimensions of feature over several state-of-the-art approaches.

Human parsing is a challenging task because it is difficult to obtain accurate results of each part of human body. Precious Boltzmann Machine based methods reach good results on segmentation but are poor expression on human parts. In this paper, an approach is presented that exploits Shape Boltzmann Machine networks to improve the accuracy of human body parsing. The proposed Curve Correction method refines the final segmentation results. Experimental results show that the proposed method achieves good performance in body parsing, measured by Average Pixel Accuracy (aPA) against state-of-the-art methods on Penn- Fudan Pedestrians dataset and Pedestrian Parsing Surveillance Scenes dataset.

With the popularity of 3D display and the widespread using of depth camera, 3D saliency detection is feasible and significant. Different with 2D saliency detection, 3D saliency detection increases an additional depth channel so that we need to take the influence of depth and binocular parallax into account. In this paper, a new depth-based stereoscopic projection approach is proposed for 3D visual salient region detection. 3D images reconstructed with color and depth images are respectively projected onto XOZ plane and YOZ plane with the specific direction. We find some obvious characteristics that help us to remove the background and progressive surface where the depth is from the near to the distant so that the salient regions are detected more accurately. Then depth saliency map (DSM) is created, which is combined with 2D saliency map to obtain a final 3D saliency map. Our approach performs well in removing progressive surface and background which are difficult to be detected in 2D saliency detection.

Hongyun Lin, Chunyu Lin, Yao Zhao, Jimin Xiao, Tammam Tillo

Coding and Reconstruction of Multimedia Data with Spatial-Temporal Information

Frontmatter

Due to limited network bandwidth, the blurred and downsampled high-resolution images in the spatial domain are inevitably used for transmission over the internet, and so single image super-resolution (SISR) algorithms would play a vital role in reconstructing the lost spatial information of the low-resolution images. Recently, it has been recognized that the blur kernel is crucial to the SISR performances. As most of the existing SISR methods typically assume the blur kernel is known, and in fact the blur kernel is either fixed with the scaling factor or unknown, it thus would be of high value to investigate the relationship between blur kernels and reconstruction algorithms. In this paper, we first propose a fast and effective SISR method based on mixture of experts and then give an empirical study on the sensitivity of different SISR algorithms to the blur kernels. Specially, we find that different algorithms have different sensitivity to the blur kernels and the most suitable blur kernels for different algorithms are different. Our findings highlight the importance of the blur models for SISR algorithms and may benefit current spatial information coding methods in multimedia processing.

Perceived audio quality is an important metric to measure the perception degradation of multi-channel audio signals especially for coding and rendering systems. Conventional objective quality measurement such as PEAQ (Perceptual Evaluation of Audio Quality) is limited to describe both the basic audio quality and the spatial impression. A novel prediction model is proposed to predict the subjective quality of 5.1-channels audio systems. Two attributes are included in the evaluation including basic quality and surround effects. Multiple Linear Regression (MLR) combined with Principal Component Analysis (PCA) is used to establish the prediction model from the objective parameters to subjective audio quality. Data set for model training and testing is obtained from formal listening tests under different coding conditions. Preliminary experiment results with 5.1-channels audio show that the proposed model can predict multi-channel audio quality more accurately than the conventional PEAQ method considering both the basic audio quality and the surround effects.

3D spatial sound effects can be achieved by amplitude panning with several loudspeakers, which can produce the auditory event of phantom source at arbitrary location with loudspeakers at arbitrary locations in 3D space. The estimation of the phantom source is to estimate the signal and location of a sound source which produce the same perception of auditory event with that of phantom source by loudspeakers. Several methods have been proposed to estimate the phantom sources, but these methods couldn’t ensure the conservation of sound energy at listening point in sound field, which including kinetic energy (particle velocity) and potential energy (sound pressure), so estimated errors were caused. A new method to estimate phantom source signal and the position is proposed, which is based on the physical properties (particle velocity, sound pressure) of the listening point in the sound field by loudspeakers. Moreover, the proposed method could be also appropriate for arbitrary asymmetric arranged loudspeakers. Experimental results showed that compared with current methods, estimated distortions of the location of phantom source and the superposed signal by loudspeakers with proposed method have been reduced obviously.

In multi-source surveillance videos, a large number of moving objects are captured by different surveillance cameras. Although the regions that each camera covers are seldom overlapped, similarities of these objects among different videos still result in tremendous global object redundancy. Coding each source in an independent way for multi-source surveillance videos is inefficient due to the ignoring of correlation among different videos. Therefore, a novel coding framework for multi-source surveillance videos using two-layer knowledge dictionary is proposed. By analyzing the characteristics of multi-source surveillance videos in large scale of spatio and time space, a two-layer dictionary is built to explore the global object redundancy. Then, a dictionary-based coding method is developed for moving objects. For any object in multi-source surveillance videos, only some pose parameters and sparse coefficients are required for object representation and reconstruction. The experiment with two simulated surveillance videos has demonstrated that the proposed coding scheme can achieve better coding performance than the main profile of HEVC and can preserve better visual quality.

Depth map is currently exploited in 3D video coding and computer vision systems. In this paper, a novel global motion information assisted depth map sequence coding method is proposed. The global motion information of depth camera is synchronously sampled to assist the encoder to improve depth map coding performance. This approach works by down-sampling the frame rate at the encoder side. Then, at the decoder side, each skipped frame is projected from its neighboring depth frames using the camera global motion. Using this technique, the frame rate of depth sequence is down-sampled. Therefore, the coding rate-distortion performance is improved. Finally, the experiment result demonstrates that the proposed method enhances the coding performance in various camera motion conditions and the coding performance gain could be up to 2.04 dB.