Proceedings of Machine Learning ResearchProceedings of the Third Asian Conference on Machine Learning
Held in South Garden Hotels and Resorts, Taoyuan, Taiwain on 14-15 November 2011
Published as Volume 20 by the Proceedings of Machine Learning Research on 17 November 2011.
Volume Edited by:
Chun-Nan Hsu
Wee Sun Lee
Series Editors:
Neil D. Lawrence
http://proceedings.mlr.press/v20/
Mon, 29 May 2017 07:18:45 +0000Mon, 29 May 2017 07:18:45 +0000Jekyll v3.4.3Unsupervised Multiple Kernel LearningTraditional multiple kernel learning (MKL) algorithms are essentially supervised learning in the sense that the kernel learning task requires the class labels of training data. However, class labels may not always be available prior to the kernel learning task in some real world scenarios, e.g., an early preprocessing step of a classification task or an unsupervised learning task such as dimension reduction. In this paper, we investigate a problem of Unsupervised Multiple Kernel Learning (UMKL), which does not require class labels of training data as needed in a conventional multiple kernel learning task. Since a kernel essentially defines pairwise similarity between any two examples, our unsupervised kernel learning method mainly follows two intuitive principles: (1) a good kernel should allow every example to be well reconstructed from its localized bases weighted by the kernel values; (2) a good kernel should induce kernel values that are coincided with the local geometry of the data. We formulate the unsupervised multiple kernel learning problem as an optimization task and propose an efficient alternating optimization algorithm to solve it. Empirical results on both classification and dimension reductions tasks validate the efficacy of the proposed UMKL algorithm.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/zhuang11.html
http://proceedings.mlr.press/v20/zhuang11.htmlA General Linear Non-Gaussian State-Space Model: Identifiability, Identification, and ApplicationsState-space modeling provides a powerful tool for system identification and prediction. In linear state-space models the data are usually assumed to be Gaussian and the models have certain structural constraints such that they are identifiable. In this paper we propose a non-Gaussian state-space model which does not have such constraints. We prove that this model is fully identifiable. We then propose an efficient two-step method for parameter estimation: one first extracts the subspace of the latent processes based on the temporal information of the data, and then performs multichannel blind deconvolution, making use of both the temporal information and non-Gaussianity. We conduct a series of simulations to illustrate the performance of the proposed method. Finally, we apply the proposed model and parameter estimation method on real data, including major world stock indices and magnetoencephalography (MEG) recordings. Experimental results are encouraging and show the practical usefulness of the proposed model and method.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/zhang11.html
http://proceedings.mlr.press/v20/zhang11.htmlLearning Attribute-weighted Voter Model over Social NetworksWe propose an opinion formation model, an extension of the voter model that incorporates the strength of each node, which is modeled as a function of the node attributes. Then, we address the problem of estimating parameter values for these attributes that appear in the function from the observed opinion formation data and solve this by maximizing the likelihood using an iterative parameter value updating algorithm, which is efficient and is guaranteed to converge. We show that the proposed algorithm can correctly learn the dependency in our experiments on four real world networks for which we used the assumed attribute dependency. We further show that the influence degree of each node based on the extended voter model is substantially different from that obtained assuming a uniform strength (a naive model for which the influence degree is known to be proportional to the node degree), and is more sensitive to the node strength than the node degree even for a moderate value of the node strength.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/yamagishi11.html
http://proceedings.mlr.press/v20/yamagishi11.htmlComputationally Efficient Sufficient Dimension Reduction via Squared-Loss Mutual InformationThe purpose of sufficient dimension reduction (SDR) is to find a low-dimensional expression of input features that is sufficient for predicting output values. In this paper, we propose a novel distribution-free SDR method called sufficient component analysis (SCA), which is computationally more efficient than existing methods. In our method, a solution is computed by iteratively performing dependence estimation and maximization: Dependence estimation is analytically carried out by recently-proposed least-squares mutual information (LSMI), and dependence maximization is also analytically carried out by utilizing the Epanechnikov kernel. Through large-scale experiments on real-world image classification and audio tagging problems, the proposed method is shown to compare favorably with existing dimension reduction approaches.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/yamada11.html
http://proceedings.mlr.press/v20/yamada11.htmlMixed-Variate Restricted Boltzmann MachinesModern datasets are becoming heterogeneous. To this end, we present in this paper Mixed-Variate Restricted Boltzmann Machines for simultaneously modelling variables of multiple types and modalities, including binary and continuous responses, categorical options, multicategorical choices, ordinal assessment and category-ranked preferences. Dependency among variables is modeled using latent binary variables, each of which can be interpreted as a particular hidden aspect of the data. The proposed model, similar to the standard RBMs, allows fast evaluation of the posterior for the latent variables. Hence, it is naturally suitable for many common tasks including, but not limited to, (a) as a pre-processing step to convert complex input data into a more convenient vectorial representation through the latent posteriors, thereby offering a dimensionality reduction capacity, (b) as a classifier supporting binary, multiclass, multilabel, and label-ranking outputs, or a regression tool for continuous outputs and (c) as a data completion tool for multimodal and heterogeneous data. We evaluate the proposed model on a large-scale dataset using the world opinion survey results on three tasks: feature extraction and visualization, data completion and prediction.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/tran11.html
http://proceedings.mlr.press/v20/tran11.htmlRobust Generation of Dynamical Patterns in Human Motion by a Deep Belief NetsWe propose a Deep Belief Net model for robust motion generation, which consists of two layers of Restricted Boltzmann Machines (RBMs). The lower layer has multiple RBMs for encoding real-valued spatial patterns of motion frames into compact representations. The upper layer has one conditional RBM for learning temporal constraints on transitions between those compact representations. This separation of spatial and temporal learning makes it possible to reproduce many attractive dynamical behaviors such as walking by a stable limit cycle, a gait transition by bifurcation, synchronization of limbs by phase-locking, and easy top-down control. We trained the model with human motion capture data and the results of motion generation are reported here.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/sukhbaatar11.html
http://proceedings.mlr.press/v20/sukhbaatar11.htmlMapping Kernels Defined Over Countably Infinite Mapping Systems and their ApplicationThe mapping kernel is a generalization of Haussler’s convolution kernel, and has a wide range of application including kernels for higher degree structures such as trees. Like Haussler’s convolution kernel, a mapping kernel is a finite sum of values of a primitive kernel. One of the major reasons to use the mapping kernel template in engineering novel kernels is because a strong theorem is known for positive definiteness of the resulting mapping kernels. If the mapping kernel meets the transitivity condition and if the primitive kernel is positive definite, the mapping kernel is also positive definite. In this paper, we generalize this theorem by showing, even when we extend the definition of mapping kernels so that a mapping kernel can be a converging sum of countably infinite primitive kernel values, the transitivity condition is still a criteria to determine positive definiteness of mapping kernels according to the extended definition. Interestingly, this result is also useful to investigate positive definiteness of mapping kernels determined as finite sums, when they do not meet the transitivity condition. For this purpose, we introduce a general method that we call covering technique.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/shin11.html
http://proceedings.mlr.press/v20/shin11.htmlImproving Policy Gradient Estimates with Influence InformationIn reinforcement learning (RL) it is often possible to obtain sound, but incomplete, information about influences and independencies among problem variables and rewards, even when an exact domain model is unknown. For example, such information can be computed based on a partial, qualitative domain model, or via domain-specific analysis techniques. While, intuitively, such information appears useful for RL, there are no algorithms that incorporate it in a sound way. In this work, we describe how to leverage such information for improving the estimation of policy gradients, which can be used to speedup gradient-based RL. We prove general conditions under which our estimator is unbiased and show that it will typically have reduced variance compared to standard unbiased gradient estimates. We evaluate the approach in the domain of Adaptation-Based Programming where RL is used to optimize the performance of programs and independence information can be computed via standard program analysis techniques. Incorporating independence information produces a large speedup in learning on a variety of adaptive programs.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/pinto11.html
http://proceedings.mlr.press/v20/pinto11.htmlMicrobagging Estimators: An Ensemble Approach to Distance-weighted ClassifiersSupport vector machines (SVMs) have been the predominate approach to kernel-based classification. While SVMs have demonstrated excellent performance in many application domains, they are known to be sensitive to noise in their training dataset. Motivated by the equalizing effect of bagging classifiers, we present a novel approach to kernel-based classification that we call microbagging. This method bags all possible maximal-margin estimators between pairs of training points to create a novel linear kernel classifier with weights defined directly as functions of the pairwise distance matrix induced by the kernel function. We derive relationships between linear and distance-based classifiers and empirically compare microbagging to the SVMs and robust SVMs on several datasets.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/nelson11.html
http://proceedings.mlr.press/v20/nelson11.htmlLearning to Locate Relative OutliersOutliers usually spread across regions of low density. However, due to the absence or scarcity of outliers, designing a robust detector to sift outliers from a given dataset is still very challenging. In this paper, we consider to identify relative outliers from the target dataset with respect to another reference dataset of normal data. Particularly, we employ Maximum Mean Discrepancy (MMD) for matching the distribution between these two datasets and present a novel learning framework to learn a relative outlier detector. The learning task is formulated as a Mixed Integer Programming (MIP) problem, which is computationally hard. To this end, we propose an effective procedure to find a largely violated labeling vector for identifying relative outliers from abundant normal patterns, and its convergence is also presented. Then, a set of largely violated labeling vectors are combined by multiple kernel learning methods to robustly locate relative outliers. Comprehensive empirical studies on real-world datasets verify that our proposed relative outlier detection outperforms existing methods.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/li11.html
http://proceedings.mlr.press/v20/li11.htmlEstimating Diffusion Probability Changes for AsIC-SIS Model from Information Diffusion ResultsWe address the problem of estimating changes in diffusion probability over a social network from the observed information diffusion results, which is possibly caused by an unknown external situation change. For this problem, we focused on the asynchronous independent cascade (AsIC) model in the SIS (Susceptible/Infected/Susceptible) setting in order to meet more realistic situations such as communication in a blogosphere. This model is referred to as the AsIC-SIS model. We assume that the diffusion parameter changes are approximated by a series of step functions, and their changes are reflected in the observed diffusion results. Thus, the problem is reduced to detecting how many step functions are needed, where in time each one starts and how long it lasts, and what the hight of each one is. The method employs the derivative of the likelihood function of the observed data that are assumed to be generated from the AsIC-SIS model, adopts a divide-and-conquer type greedy recursive partitioning, and utilizes an MDL model selection measure to determine the adequate number of step functions. The results obtained using real world network structures confirmed that the method works well as intended. The MDL criterion is useful to avoid overfitting, and the found pattern is not necessarily the same in terms of the number of step functions as the one assumed to be true, but the error is always reduced to a small value.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/koide11.html
http://proceedings.mlr.press/v20/koide11.htmlAcceleration technique for boosting classification and its application to face detectionWe propose an acceleration technique for boosting classification without any loss of classification accuracy and apply it to a face detection task. In classification task, much effort has been spent on improving the classification accuracy and the computational cost of training. In addition to them, the computational cost of classification itself can be critical in several applications including face detection. In face detection, a celebrating work by Viola and Jones (2001) developed a significantly fast face detector achieving a competitive accuracy with all preceding face detectors. In their algorithm, the cascade structure of boosting classifier plays an important role. In this paper, we propose an acceleration technique for boosting classifier. The key idea of our proposal is the fact that one can determine the sign of discriminant function before all weak learners are evaluated in general. An advantage is that our algorithm has no loss in classification accuracy. Another advantage is that our proposal is a unsupervised learning so that it can treat a covariate shift situation. We also apply our proposal to each cascaded boosting classifier in Viola and Jones type face detector. As a result, our proposal succeeds in reducing the classification cost by 20%.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/kawakita11.html
http://proceedings.mlr.press/v20/kawakita11.htmlBayesian inference for statistical abduction using Markov chain Monte CarloAbduction is one of the basic logical inferences (deduction, induction and abduction) and derives the best explanations for our observation. Statistical abduction attempts to define a probability distribution over explanations and to evaluate them by their probabilities. The framework of statistical abduction is general since many well-known probabilistic models, i.e., BNs, HMMs and PCFGs, are formulated as statistical abduction. Logic-based probabilistic models (LBPMs) have been developed as a way to combine probabilities and logic, and it enables us to perform statistical abduction. However, most of existing LBPMs impose restrictions on explanations (logical formulas) to realize efficient probability computation and learning. To relax those restrictions, we propose two MCMC (Markov chain Monte Carlo) methods for Bayesian inference on LBPMs using binary decision diagrams. The main advantage of our methods over existing methods is that it has no restriction on formulas. In the context of statistical abduction with Bayesian inference, whereas our deterministic knowledge can be described by logical formulas as rules and facts, our non-deterministic knowledge like frequency and preference can be reflected in a prior distribution in Bayesian inference. To illustrate our methods, we first formulate LDA (latent Dirichlet allocation) which is a well-known generative probabilistic model for bag-of-words as a form of statistical abduction, and compare the learning result of our methods with that of an MCMC method called collapsed Gibbs sampling specialized for LDA. We also apply our methods to diagnosis for failure in a logic circuit and evaluate explanations using a posterior distribution approximated by our method. The experiment shows Bayesian inference achieves better predicting accuracy than that of Maximum likelihood estimation.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/ishihata11.html
http://proceedings.mlr.press/v20/ishihata11.htmlMulti-label Active Learning with Auxiliary LearnerMulti-label active learning is an important problem because of the expensive labeling cost in multi-label classification applications. A state-of-the-art approach for multi-label active learning, maximum loss reduction with maximum confidence (MMC), heavily depends on the binary relevance support vector machine in both learning and querying. Nevertheless, it is not clear whether the heavy dependence is necessary or unrivaled. In this work, we extend MMC to a more general framework that removes the heavy dependence and clarifies the roles of each component in MMC. In particular, the framework is characterized by a major learner for making predictions, an auxiliary learner for helping with query decisions and a query criterion based on the disagreement between the two learners. The framework takes MMC and several baseline multi-label active learning algorithms as special cases. With the flexibility of the general framework, we design two criteria other than the one used by MMC. We also explore the possibility of using learners other than the binary relevance support vector machine for multi-label active learning. Experimental results demonstrate that a new criterion, soft Hamming loss reduction, is usually better than the original MMC criterion across different pairs of major/auxiliary learners, and validate the usefulness of the proposed framework.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/hung11.html
http://proceedings.mlr.press/v20/hung11.htmlPrefacePreface to the Proceedings of the 3rd Asian Conference on Machine Learning, November 13-15, Taoyuan, Taiwan.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/hsu11.html
http://proceedings.mlr.press/v20/hsu11.htmlSummarization of Yes/No Questions Using a Feature Function ModelAnswer summarization is an important problem in the study of Question and Answering. In this paper, we deal with the general questions with “Yes/No” answers in English. We design 1) a model to score the relevance of the answers and the questions, and 2) a feature function combining the relevance and opinion scores to classify each answer to be “Yes”, “No” or “Neutral”. We combine the opinion features together with two weighting scores to solve this problem and conduct experiments on a real word dataset. Given an input question, the system firstly detects if it can be simply answered by “Yes/No” or not, and then outputs the resulting voting numbers of “Yes” answers and “No” answers to this question. We also first proposed the accuracy, precision, and recall to the “Yes/No” answer detection.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/he11.html
http://proceedings.mlr.press/v20/he11.htmlMulti-label Classification with Error-correcting CodesWe formulate a framework for applying error-correcting codes (ECC) on multi-label classification problems. The framework treats some base learners as noisy channels and uses ECC to correct the prediction errors made by the learners. An immediate use of the framework is a novel ECC-based explanation of the popular random k-label-sets (RAKEL) algorithm using a simple repetition ECC. Using the framework, we empirically compare a broad spectrum of ECC designs for multi-label classification. The results not only demonstrate that RAKEL can be improved by applying some stronger ECC, but also show that the traditional Binary Relevance approach can be enhanced by learning more parity-checking labels. In addition, our study on different ECC helps understand the trade-off between the strength of ECC and the hardness of the base learning tasks.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/ferng11.html
http://proceedings.mlr.press/v20/ferng11.htmlLearning Rules from Incomplete Examples via Implicit Mention ModelsWe study the problem of learning general rules from concrete facts extracted from natural data sources such as the newspaper stories and medical histories. Natural data sources present two challenges to automated learning, namely, radical incompleteness and systematic bias. In this paper, we propose an approach that combines simultaneous learning of multiple predictive rules with differential scoring of evidence which adapts to a presumed model of data generation. Learning multiple predicates simultaneously mitigates the problem of radical incompleteness, while the differential scoring would help reduce the effects of systematic bias. We evaluate our approach empirically on both textual and non-textual sources. We further present a theoretical analysis that elucidates our approach and explains the empirical results.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/doppa11.html
http://proceedings.mlr.press/v20/doppa11.htmlLearning low-rank output kernelsOutput kernel learning techniques allow to simultaneously learn a vector-valued function and a positive semidefinite matrix which describes the relationships between the outputs. In this paper, we introduce a new formulation that imposes a low-rank constraint on the output kernel and operates directly on a factor of the kernel matrix. First, we investigate the connection between output kernel learning and a regularization problem for an architecture with two layers. Then, we show that a variety of methods such as nuclear norm regularized regression, reduced-rank regression, principal component analysis, and low rank matrix approximation can be seen as special cases of the output kernel learning framework. Finally, we introduce a block coordinate descent strategy for learning low-rank output kernels.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/dinuzzo11.html
http://proceedings.mlr.press/v20/dinuzzo11.htmlApproximate Model Selection for Large Scale LSSVMModel selection is critical to least squares support vector machine (LSSVM). A major problem of existing model selection approaches of LSSVM is that the inverse of the kernel matrix need to be calculated with O(n^3) complexity for each iteration, where n is the number of training examples. It is prohibitive for the large scale application. In this paper, we propose an approximate approach to model selection of LSSVM. We use multilevel circulant matrices to approximate the kernel matrix so that the fast Fourier transform (FFT) can be applied to reduce the computational cost of matrix inverse. With such approximation, we first design an efficient LSSVM algorithm with O(nlog(n)) complexity and theoretically analyze the effect of kernel matrix approximation on the decision function of LSSVM. We further show that the approximate optimal model produced with the multilevel circulant matrix is consistent with the accurate one produced with the original kernel matrix. Under the guarantee of consistency, we present an approximate model selection scheme, whose complexity is significantly lower than the previous approaches. Experimental results on benchmark datasets demonstrate the effectiveness of approximate model selection.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/ding11.html
http://proceedings.mlr.press/v20/ding11.htmlContinuous Rapid Action Value EstimatesIn the last decade, Monte-Carlo Tree Search (MCTS) has revolutionized the domain of large-scale Markov Decision Process problems. MCTS most often uses the Upper Confidence Tree algorithm to handle the exploration versus exploitation trade-off, while a few heuristics are used to guide the exploration in large search spaces. Among these heuristics is Rapid Action Value Estimate (RAVE). This paper is concerned with extending the RAVE heuristics to continuous action and state spaces. The approach is experimentally validated on two artificial benchmark problems: the treasure hunt game, and a real-world energy management problem.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/couetoux11.html
http://proceedings.mlr.press/v20/couetoux11.htmlNonlinear Online Classification Algorithm with Probability MarginUsually, it is necessary for nonlinear online learning algorithms to store a set of misclassified observed examples for computing kernel values. For large-scale problems, this is not only time consuming but leads also to an out-of-memory problem. In the paper, a nonlinear online classification algorithm is proposed with a probability margin to address the problem. In particular, the discriminant function is defined by the Gaussian mixture model with the statistical information of all the observed examples instead of data points. Then, the learnt model is used to train a nonlinear online classification algorithm with confidence such that the corresponding margin is defined by probability. When doing so, the internal memory is significantly reduced while the classification performance is kept. Also, we prove mistake bounds in terms of the generative model. Experiments carried out on one synthesis and two real large-scale data sets validate the effectiveness of the proposed approach.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/chi11.html
http://proceedings.mlr.press/v20/chi11.htmlSupport Vector Machines Under Adversarial Label NoiseIn adversarial classification tasks like spam filtering and intrusion detection, malicious adversaries may manipulate data to thwart the outcome of an automatic analysis. Thus, besides achieving good classification performances, machine learning algorithms have to be robust against adversarial data manipulation to successfully operate in these tasks. While support vector machines (SVMs) have shown to be a very successful approach in classification problems, their effectiveness in adversarial classification tasks has not been extensively investigated yet. In this paper we present a preliminary investigation of the robustness of SVMs against adversarial data manipulation. In particular, we assume that the adversary has control over some training data, and aims to subvert the SVM learning process. Within this assumption, we show that this is indeed possible, and propose a strategy to improve the robustness of SVMs to training data manipulation based on a simple kernel matrix correction.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/biggio11.html
http://proceedings.mlr.press/v20/biggio11.htmlQuadratic Weighted Automata:Spectral Algorithm and Likelihood MaximizationIn this paper, we address the problem of non-parametric density estimation on a set of strings $\Sigma^*$. We introduce a probabilistic model - called quadratic weighted automaton, or QWA - and we present some methods which can be used in a density estimation task. A spectral analysis method leads to an effective regularization and a consistent estimate of the parameters. We provide a set of theoretical results on the convergence of this method. Experiments show that the combination of this method with likelihood maximization may be an interesting alternative to the well-known Baum-Welch algorithm.Thu, 17 Nov 2011 00:00:00 +0000http://proceedings.mlr.press/v20/bailly11.html
http://proceedings.mlr.press/v20/bailly11.html