Sponsors

Accepted Papers

Long Presentation

[16] Enabling Hierarchical Dirichlet Processes to work better for short texts at large scaleKhai Mai, Sang Mai, Anh Nguyen, Linh Ngo, Khoat Than

AbstractAnalyzing texts from social media often encounters many challenges, including shortness, dynamic, and huge size. Short texts do not provide enough information so that statistical models often fail to work. In this paper, we show a very simple approach (namely, bag-of-biterms) that helps statistical models such as Hierarchical Dirichlet Pro- cesses (HDP) to work well with short texts. By using both terms (words) and biterms to represent documents, bag-of-biterms (BoB) provides significant benefits: (1) it naturally lengthens documents and thus helps us reduce bad effects of shortness; (2) it enables the posterior inference in a large class of probabilistic models including HDP to be less intractable; (3) no modification of existing models/methods is necessary, and thus BoB can be easily employed in a wide class of statistical models. To eval- uate those benefits of BoB, we take Online HDP into account that can deal with dynamic and massive text collections, and we do experiments on three large corpora of short texts which are crawled from Twitter, Yahoo Q&A, and New York Times. Extensive experiments show that BoB can help HDP work significantly better in both predictiveness and quality.

AbstractL1 graph is an effective way to represent data samples in many graph-oriented machine learning applications. Its original construction algorithm is nonparametric, and the graphs it generated may have high sparsity. Meanwhile, the construction algorithm also requires many iterative convex optimization calculations and is very time-consuming. Such characteristics would severely limit the application of L1 graph in many real-world tasks. In this paper, we design a greedy algorithm to speed up the construction of L1 graph. Moreover, we introduce the concept of Ranked Dictionary for L1 minimization. This ranked dictionary not only preserves the locality but also removes the randomness of neighborhood selection during the process of graph construction. To demonstrate the effectiveness of our proposed algorithm, we present our experimental results on several commonly-used datasets using two different ranking strategies: one is based on Euclidean metric, and another is based on diffusion metric.

AbstractIn this work, we exploit the emotional consistency between label information obtained by label propagation and distant supervision to leverage tweet-level sentiment analysis. Existing methods are either relied heavily on sufficient labeled data or sentiment lexicon resources, which are domain-specific in social media. We propose a unified three-phase framework to build a semi-supervised sentiment classifier for social media data. Our framework leverages on both labeled, unlabeled tweets and social relation graph data. First, we use label propagation to learn propagated labels for unlabeled tweets and partition all tweets into two clusters. Our label propagation is inspired by social science about emotional behaviors of connected users, who may be more likely to hold similar opinions. Second, using sentiment lexicon resources, we use an unsupervised method to obtain noisy labels, which is utilized to train a distant supervision classifier. Next, we determine the relevance of each classifier to the unlabeled tweets, using the label consistency between the clustering given by the propagated tweet labels and the clustering given by these trained sentiment classifiers. Third, we trade-off between using relevance-weighted trained classifiers and the labeled tweet data. Our method outperforms numerous baselines and a social networked sentiment classification method on two real-world Twitter datasets.

AbstractCollaborative filtering techniques have been successfully applied in recommender systems recently. In order to improve recommendation accuracy for better user experience, the review texts should be exploited due to its rich information about users’ explicit preferences and items’ features, which cannot be fully revealed only by rating scores. In this paper, we propose an effective algorithm called LBMF to explore review texts and rating scores simultaneously. We directly correlate user and item latent dimensions with each word in review texts and ratings in our model, so semantic word vectors can be easily learned and effectively clustered based on rating values. On the other hand, the learned semantic word vectors can justify the rating values, which can promote better learning of user and item latent vectors for rating prediction. The learned latent dimensions by our model can reasonably explain why users rated items the way they did. This revelation can promote better modeling of user profiles and item information, and enable further analysis of user behaviors. Experimental results on several real-world datasets demonstrate the efficiency and effectiveness of LBMF comparing to the state-of-the-art models.

AbstractImage Grouping in knowledge-rich domains is challenging, since domain knowledge and expertise are key to transform image pixels into meaningful content. Manually marking and annotating images is not only labor-intensive but also ineffective. On the other hand, pure automated learning technology cannot bridge this gap for the absence of experts’ input. We thus present an interactive machine learning paradigm that allows experts to become an integral part of the learning process. This paradigm is designed for automatically computing and quantifying interpretable grouping of dermatological images. In this way, the computational evolution of an image grouping model, its visualization, and expert interactions form a loop to improve image grouping. In our paradigm, dermatologists encode their domain knowledge about the medical images by grouping a small subset of images via an interface. Our learning algorithm automatically incorporates these manually specified connections as constraints for re-organizing the whole image dataset. Performance evaluation shows that this paradigm effectively improves image grouping based on expert knowledge.

[78] Denoising Time Series By Way of A Flexible Model For Phase Space ReconstructionMinhzul Islam Sk, Arunava Banerjee

AbstractWe present a denoising technique in the domain of time series data that presumes a model for the uncorrupted underlying signal instead of a model for noise. Specifically, we show how the non-linear reconstruction of the under- lying dynamical system by way of time delay embedding yields a new solution for denoising where the underlying dynamics is assumed to be highly non-linear yet low-dimensional. The model for the underlying data is recovered using a non-parametric Bayesian approach and is therefore very flexible. The proposed technique first clusters the reconstructed phase space through a Dirichlet Process Mixture of Exponential density, an infinite mixture model. Phase Space Recon- struction is accomplished by time delay embedding in the framework of Taken’s Embedding Theorem with the underlying dimension being determined by the False Neighborhood method. Next, an Infinite Mixtures of Linear Regression via Dirichlet Process is used to non-linearly map the phase space data points to their respective temporally successive points in the phase space. Finally, a convex optimization based approach is used to restructure the dynamics by perturbing the phase space points to create the new denoised time series. We find that this method yields significantly better performance in power spectrum analysis, noise reduction, dimension recovery, and prediction accuracy of the phase space.

AbstractThis paper addresses the problem of feature extraction for estimating users’ transportation modes from their movement trajectories. Previous studies have adopted supervised learning approaches and used engineers’ skills to find effective features for accurate estimation. However, such hand-crafted features cannot always work well because human behaviors are diverse and trajectories include noise due to measurement error. To compensate for the shortcomings of hand-crafted features, we propose a method that automatically extracts additional features using a deep neural network (DNN). In order that a DNN can easily handle input trajectories, our method converts a raw trajectory data structure into an image data structure while maintaining effective spatio-temporal information. A classification model is constructed in a supervised manner using both of the deep features and hand-crafted features. We demonstrate the effectiveness of the proposed method through several experiments using two real datasets, such as accuracy comparison with previous methods and feature visualization.

AbstractAlthough the hyper-plane based One-Class Support Vector Machine (OCSVM) and the hyper-spherical based Support Vector Data Description (SVDD) algorithms have been shown to be very effective in detecting outliers, their performance on noisy and unlabeled training data has not been widely studied. Moreover, only a few heuristic approaches have been proposed to set the different parameters of these methods in an unsupervised manner. In this paper, we propose two unsupervised methods for estimating the optimal parameter settings to train OCSVM and SVDD models, based on analysing the structure of the data. We show that our heuristic is substantially faster than existing parameter estimation approaches while its accuracy is comparable with supervised parameter learning methods, such as grid-search with cross-validation on labeled data. In addition, our proposed approaches can be used to prepare a labeled data set for a OCSVM or a SVDD from unlabeled data.

AbstractWe present new initialization methods for the Expectation-Maximization algorithm for multivariate Gaussian mixture models. Our methods are adaptions of the well-known K-means++ initialization and the Gonzalez algorithm. Thereby we aim to close the gap between simple random, e.g. uniform, and complex methods, that crucially depend on the right choice of hyperparameters. Our extensive experiments indicate the usefulness of our methods compared to common techniques and methods, which e.g. apply the original K-means++ and Gonzalez directly, with respect to arti?cial as well as real-world data sets.

AbstractMatrix Decomposition methods are applied to a wide range of tasks, such as data denoising, dimensionality reduction, co-clustering and community detection. However, in the presence of boolean inputs, common methods either do not scale or do not provide a boolean reconstruction, which results in high reconstruction error and low interpretability of the decomposition. We propose a novel step decomposition of boolean matrices in non-negative factors with boolean reconstruction. By formulating the problem using threshold operators and through suitable relaxation of this problem, we provide a scalable algorithm that can be applied to boolean matrices with millions of non-zero entries. We show that our method achieves significantly lower reconstruction error when compared to standard state of the art algorithms. We also show that the decomposition keeps its interpretability by analyzing communities in a flights dataset (where the matrix is interpreted as a graph in which nodes are airports) and in a movie-ratings dataset with 10 million non-zeros.

AbstractOutlier detection consists in detecting anomalous observations from data. During the past decade, outlier detection methods were proposed using the concept of frequent patterns. Basically such methods require to mine all frequent patterns for computing the outlier factor of each transaction. This approach remains too expensive despite recent progress in pattern mining field. In this paper, we provide exact and approximate methods for calculating the frequent pattern outlier factor (FPOF) without extracting any pattern or by extracting a small sample. We propose an algorithm that returns the exact FPOF of each transaction without mining any pattern. Surprisingly, it works in polynomial time on the size of the dataset. We also present an approximate method where the end-user controls the maximum error on the estimated FPOF. An experimental study shows the interest of both methods for very large datasets where exhaustive mining fails to provide the exact solution. The accuracy of our approximate method outperforms the baseline approach for a same budget.

AbstractVideo recommendation has become an essential part of online video services. Cold start, a problem relatively common in the practical online video recommendation service, occurs when the user who needs video recommendation has no viewing history. A promising approach to resolve this problem is to capitalize on information in online social networks (OSNs): Videos viewed by a user’s friends may be good candidates for recommendation. However, in practice, this information is also quite limited, either because of insufficient friends or lack of abundant viewing history of friends. In this work, we utilize social groups with richer information to recommend videos. It is common that users may be affiliated with multiple groups in OSNs. Through members within the same group, we can read a considerably larger set of users, hence more candidate videos for recommendation. In this paper, by collaborating with Tencent Video, we propose a social-group-based algorithm to produce personalized video recommendations by ranking candidate videos from the groups a user is affiliated with. This algorithm was implemented and tested in the Tencent Video service system. Compared with two state-of-the-art methods, the proposed algorithm not only improves the click-through rate, but also recommends more diverse videos.

AbstractAs those in need increasingly ask for favors in online social services, having a technique to accurately predict whether their requests will be successful can instantaneously help them better formulating the requests. This paper aims to boost the accuracy of predicting the success of altruistic requests, by following the similar setting of the state-of-the-art work ADJ [1]. While ADJ has an unsatisfying prediction accuracy and requires a large set of training data, we develop a novel request success prediction model, termed Graph-based Predictor for Request Success (GPRS). Our GPRS model is featured by learning the correlation between success or not and the set of features extracted in the request, together with a label propagation-based optimization mechanism. Besides, in addition to the textual, social, and temporal features proposed by ADJ, we further propose three effective features, including centrality, role, and topic features, to capture how users interact in the history and how different topics affect the success of requests. Experiments conducted on the requests in the Random Acts of Pizza community of Reddit.com show GPRS can lead to around 0.81 and 0.68 AUC scores using sufficient and limited training data respectively, which significantly outperform ADJ by 0.14 and 0.08 respectively.

AbstractModeling interactions in regression models poses both computational as well as statistical challenges: the computational resources and the amount of data required to solve them increases sharply with the size of the problem. We focus on logistic regression with categorical variables and propose a method for learning dependencies that are expressed as general Boolean formulas. The computational and statistical challenges are solved by applying a technique called transformed Lasso, which involves a matrix transformation of the original covariates. We compare the method to an earlier related method, LogicReg, and show that our method scales better in terms of the number of covariates as well as the order and complexity of the interactions.

AbstractRecently diverse variations of large margin learning formalism have been proposed to improve the flexibility and the performance of classic discriminative models such as SVM. However, extra difficulties do arise in optimizing non-convex learning objectives and selecting multiple hyperparameters. Existing training methods for non-convex problems are usually heuristics suffiering from local minimas, and traditional cross validation based model (hyperparameter) selection is computationally intractable even for a small number of hyperparameters. Observing that many variations of large margin learning could be reformulated as jointly minimizing a parameterized quadratic objective, in this paper we propose a novel optimization framework, namely Parametric Dual sub-Gradient Descent Procedure (PDGDP) , that produces a globally optimal training algorithm and an efficient model selection algorithm for two classes of large margin learning variations. The theoretical bases are a series of new results for parametric program, which characterize the unique local and global structure of the dual optimum. The proposed algorithms are evaluated on two representative applications, i.e., the training of latent SVM and the model selection of cost sensitive feature re-scaling SVM. The results show that PDGDP based training and model selection achieves significant improvement over the state-of-the-art approaches.

AbstractIn this paper, a fuzzy classification with quantification algorithm is proposed for solving the air quality monitoring problem using e-noses. When e-noses are used in dynamic outdoor environment, the performance suffers from noise, signal drift and fast-changing natural environment. In such case, how to develop a prediction model capable of detecting as well as quantifying gases effectively and efficiently. Most of the current research work has focused either on detection or quantification of gas sensors without taking into account of those dynamic factors. In this paper, we propose a new model, namely, Fuzzy Classification with Quantification Model (FCQM) to cope with the above mentioned challenges. To evaluate our model, extensive experiments are conducted on publicly available datasets generated over a three-year period, and the results demonstrate its superiority over other baseline methods.To our knowledge, gas type detection together with quantification is an unsolved challenge. Our paper provides the first solution for this kind.

AbstractLASIK (Laser-Assisted in SItu Keratomileusis) surgeries have been quite popular for treatment of myopia (nearsightedness), hyperopia (farsightedness) and astigmatism over the past two decades. In the past decade, over 10 million LASIK procedures had been performed in the United States alone with an average cost of approximately $2000 USD per surgery. While 99% of such surgeries are successful, the commonest side effect is a residual refractive error and poor uncorrected visual acuity (UCVA). In this work, we aim at predicting the UCVA post LASIK surgery. We model the task as a regression problem and use the patient demography and pre-operative examination details as features. To the best of our knowledge, this is the first work to systematically explore this critical problem using machine learning methods. Further, LASIK surgery settings are often determined by practitioners using manually designed rules. We explore the possibility of determining such settings automatically to optimize for the best post-operative UCVA by including such settings as features in our regression model. Our experiments on a dataset of 791 surgeries provides an RMSE of 0.102, 0.094 and 0.074 for the predicted post-operative UCVA after one day, one week and one month of the surgery respectively.

AbstractOne of the key challenges in large attributed graph clustering is how to select representative attributes. Previous studies introduce user-guided clustering methods by letting a user select samples based on his/her knowledge. However, due to knowledge limitation, a single user may only pick out the samples that s/he is familiar with while ignoring the others, such that the selected samples are often biased. We propose a framework to address the bias issue which allows multiple individuals to select samples for a specific clustering. With wider knowledge coming from multiple users, the selected samples can be more relevant to the clustering. The challenges of this study are two-folds. Firstly, as user selected samples are usually sparse and the graph can be large, it is non-trivial to effectively combine the annotations. Secondly, it is also difficult to design a scalable approach to cluster large graphs with millions of nodes. We propose the approach CGMA (Clustering Graphs with Multiple Annotations) to address these challenges. CGMA is able to combine the crowd’s consensus opinions in an unbiased way, and conducts an effective clustering with low time complexity. It scales well with graph size, because we perform a local clustering instead of global partition. It scales well with graph size, because we perform a local clustering instead of global partition. We show the effectiveness and efficiency of the proposed approach on real-world graphs, by comparing with existing attributed graph clustering approaches.

AbstractAustralia’s critical water pipes break on average 7, 000 times per year. Being able to accurately identify which pipes are at risk of failure will potentially save Australia’s water utilities and the community up to $700 million a year in reactive repairs and maintenance. However, ranking these water pipes according to their calculated risk has mixed results due to their different types of attributes, data incompleteness and data imbalance. This paper describes our experience in improving the performance of classifying and ranking these data via local metric learning. Distance metric learning is a powerful tool that can improve the performance of similarity based classifications. In general, global metric learning techniques do not consider local data distributions, and hence do not perform well on complex / heterogeneous data. Local metric learning methods address this problem but are usually expensive in runtime and memory. This paper proposes a fuzzy-based local metric learning approach that out-performs recently proposed local metric methods, while still being faster than popular global metric learning methods in most cases. Extensive experiments on Australia water pipe datasets demonstrate the effectiveness and performance of our proposed approach.

AbstractMany data mining applications potentially operate in an adversarial environment where adversaries adapt their behavior to evade detection. Typically adversaries alter data under their control to cause a large divergence of distribution between training and test data. Existing state-of-the-art adversarial learning techniques try to address this problem in which there is only a single type of adversary. In practice, a learner often has to face multiple types of adversaries that may employ different attack tactics. In this paper, we tackle the challenges of multiple types of adversaries with a nested Stackelberg game framework. We demonstrate the effectiveness of our framework with extensive empirical results on both synthetic and real data sets. Our results demonstrate that the nested game framework offers more reliable defense against multiple types of attackers.

AbstractWe introduce a new computational problem, the BackboneDiscovery problem, which encapsulates both functional and structural aspects of network analysis. While the topology of a typical road network has been available for a long time (e.g., through maps), it is only recently that fine-granularity functional (activity and usage) information about the network (like source-destination traffic information) is being collected and is readily available. The combination of functional and structural information provides an efficient way to explore and understand usage patterns of networks and aid in design and decision making. We propose efficient algorithms for the BackboneDiscovery problem including a novel use of edge centrality. We observe that for many real world networks, our algorithm produces a backbone with a small subset of the edges that support a large percentage of the network activity.

AbstractFrequent pattern mining has been widely studied in the past decades and has been applied to many domains. In particular, numerical transaction databases, where not only the items but also the utility associated with them are available in user transactions, are useful for real applications. For example, customer mobile App traffic data collected by mobile service providers contains such information. In this paper, we aim to find frequent patterns that occupy a large portion of total utility of the supporting transactions, to answer questions like On which mobile Apps do the customers spend most of their data traffic? Towards this goal, we define a measure called it utility occupancy to measure the contribution of a pattern within a transaction. The challenge of of high utility occupancy itemset discovering is the lack of monotone or anti-monotone property. So we derive an upper bound for it utility occupancy and design an efficient mining algorithm called OCEAN based on a fast implementation of utility list. Evaluations on real world mobile App traffic data and other three datasets show that OCEAN is efficient and effective in finding frequent patterns with large utility occupancy.

AbstractRecently, social recommendation becomes a hot research direction, which leverages social relations among users to alleviate data sparsity and cold-start problems in recommender systems. The social recommendation methods usually employ simple similarity information of users as social regularization on users. Unfortunately, the widely used social regularization may suffer from several aspects: (1) the similarity information of users only stems from users’social relations; (2) it only has constraint on users; (3) it may not work well for users with low similarity. In order to overcome the shortcomings of social regularization, we propose a new dual similarity regularization to impose the constraint on users and items with high and low similarities simultaneously. With the dual similarity regularization, we design an optimization function to integrate the similarity information of users and items, and a gradient descend solution is derived to optimize the objective function. Experiments on two real datasets validate the effectiveness of the proposed solution.

AbstractFor class imbalance problem, the integration of sampling and ensemble methods has shown great success among various methods. Nevertheless, as the representatives of sampling methods, undersampling and oversampling cannot outperform each other. That is, undersampling fits some data sets while oversampling fits some other. Besides, the sampling rate also significantly influences the performance of a classifier, while existing methods usually adopt full sampling rate to produce balanced training set. In this paper, we propose a new algorithm that utilizes a new hybrid scheme of undersampling and oversampling with sampling rate selection to preprocess the data in each ensemble iteration. Bagging is adopted as the ensemble framework because the sampling rate selection can benefit from the Out-Of-Bag estimate in bagging. The proposed method features both of undersampling and oversampling, and the specifically selected sampling rate for each data set. The experiments are conducted on 26 data sets from the UCI data repository, in which the proposed method in comparison with the existing counterparts is evaluated by three evaluation metrics. Experiments show that, combined with bagging, the proposed hybrid sampling method significantly outperforms the other state-of-the-art bagging-based methods for class imbalance problem. Meanwhile, the superiority of sampling rate selection is also demonstrated.

AbstractMulti-label classification targets the prediction of multiple interdependent and non-exclusive binary target variables. Transformation-based algorithms transform the data set such that regular single-label algorithms can be applied to the problem. A special type of transformation-based classifiers are label compression methods, that compress the labels and then mostly use single label classifiers to predict the compressed labels. So far, there are no compression-based algorithms follow a problem transformation approach and address non-linear dependencies in the labels. In this paper, we propose a new algorithm, called Maniac (Multi-lAbel classificatioN usIng AutoenCoders), which extracts the non-linear dependencies by compressing the labels using autoencoders. We adapt the training process of autoencoders in a way to make them more suitable for a parameter optimization in the context of this algorithm. The method is evaluated on eight standard multi-label data sets. Experiments show that despite not producing a good ranking, {sc Maniac} generates a particularly good bipartition of the labels into positives and negatives. This is caused by rather strong predictions with either really high or low probability. Additionally, the algorithm seems to perform better given more labels and a higher label cardinality in the data set.

AbstractRecent years have witnessed the boom of heterogeneous information network (HIN), which contains different types of nodes and relations. Many data mining tasks have been explored in this kind of network. Among them, link prediction is an important task to predict the potential links among nodes, which are required in many applications. The contemporary link prediction usually are based on simple HIN whose schema are bipartite or star-schema. In these HINs, the meta paths are predefined or can be enumerated. However, in many real networked data, it is hard to describe their network structure with simple schema. For example, the knowledge base with RDF format include tens of thousands types of objects and links. On this kind of schema-rich HIN, it is impossible to enumerate meta paths. In this paper, we study the link prediction in schema-rich HIN and propose a novel Link Prediction with automatic meta Paths method (LiPaP). The LiPaP designs an algorithm called Automatic Meta Path Generation (AMPG) to automatically extract meta paths from schema-rich HIN and a supervised method with likelihood function to learn weights of the extracted meta paths. Experiments on real knowledge database, Yago, validate that LiPaP is an effective, steady and efficient method.

AbstractThis paper discusses a complete and efficient algorithm for enumerating densely-connected k-Plexes. A k-plex is a kind of pseudo-clique which imposes a Disconnection Upper Bound (DUB) by the parameter k for each constituent vertex. However, since the parameter is usually fixed not depending on sizes of our targeted pseudo-cliques, we often have k-Plexes not densely-connected. In order to overcome this drawback, we introduce another constraint using a parameter j designating Connection Lower Bound (CLB). Based on CLB, we can additionally enjoy a monotonic j-core operation and design an efficient depth-first algorithm which can exclude hopeless vertex sets which cannot be extended to their supersets satisfying both DUB and CLB. Our experimental results show the algorithm can work well as a useful tool for detecting densely-connected pseudo cliques in large networks including one with over 800,000 vertices.

AbstractCollaborative Filtering with Implicit Feedbacks (e.g., brows- ing or clicking records), named as CF-IF, is demonstrated to be an effective way in recommender systems. Existing works of CF-IF can be mainly classified into two categories, i.e., point-wise regression based and pair- wise ranking based, where the latter one relaxes assumption and usually obtains better performance in empirical studies. In real applications, implicit feedback is often very sparse, causing CF-IF based methods to degrade significantly in recommendation performance. In this case, side information (e.g., item content) is usually introduced and utilized to address the data sparsity problem. Nevertheless, the latent feature representation learned from side information by topic model may not be very effective when the data is too sparse. To address this problem, we propose collaborative deep ranking (CDR), a hybrid pair-wise approach with implicit feedback, which leverages deep feature representation of item content into Bayesian framework of pair-wise ranking model in this paper. The experimental analysis on a real-world dataset shows CDR outperforms three state-of-art methods in terms of recall metric under different sparsity level.

AbstractFor secure and efficient operation of engineering systems, it is of great importance to watch daily logs of the system, which mainly consist of multivariate time-series obtained with many sensors. We focus on one of the most challenging issues in practical analyses of the sensor data, that is, temporal sparseness. To handle unevenly and sparsely spaced multivariate time-series, this work presents a novel method, which roughly models remaining temporal dependency of the data. The proposed model is a mixture of latent factor models with dynamic hierarchical structure that considers dependency between temporally close batches of observations, instead of every single observation. We conducted experiments with synthetic and real datasets, and confirmed validity of the proposed model quantitatively and qualitatively.

AbstractRecent pattern mining algorithms such as LAMP allow us to compute statistical significance of patterns with respect to an outcome variable. Their p-values are properly corrected to bound the family-wise error rate, which is the probability of at least one false discovery occurring. However, they are a poor fit for medical applications, due to their inability to handle potential confounding variables such as age or gender. We propose a novel pattern mining algorithm that evaluates statistical significance under confounding variables. Using a new testability bound based on the exact logistic regression model, the algorithm can rule out a large quantities of combination without testing them, limiting the amount of correction required for multiple hypotheses testing. Using synthetic data, we showed that our method could remove the bias introduced by confounding variables while still detecting true patterns correlated with the class. In addition, we presented a real-world case study of data integration using a confounding variable.

AbstractGraph-based similarity ranking plays a key role in improving image retrieval performance. Its current trend is to fuse the ranking results from multiple feature sets, including textual feature, visual feature and query log feature, to elevate the retrieval effectiveness. The primary challenge is how to effectively exploit the complementary properties of different features. Another tough issue is raised from the highly noisy features contributed by users, such as textual tags and query logs, which makes the exploration of such complementary properties difficult. This paper proposes a Multi-view Manifold Ranking (M2R) framework, in which multiple graphs built on different features are integrated to simultaneously encode the similarity ranking. To deal with the high noise issue inherent in the user-contributed features, a data cleaning solution based on visual-neighbor voting is embedded into M2R, thus called Robust M2R (RM2R). Experimental results show that the proposed method significantly outperforms the existing approaches, especially when the user-contributed features are highly noisy.

AbstractOnline spam comments often misguide users during online shopping. Existing online spam detection methods rely on semantic clues, behavioral footprints, and relational connections between users in review systems. Although these methods can successfully identify spam activities, evolving fraud strategies can successfully escape from the detection rules by purchasing positive comments from massive random users, i.e., user Cloud. In this paper, we study a new problem, textbf{Collective Marketing Hyping} detection, for spam comments detection generated from the user Cloud. It is defined as detecting a group of marketing hyping products with untrustful marketing promotion behaviour. We propose a new learning model that uses heterogenous product networks extracted from product review systems. Our model aims to mining a group of hyping activities, which differs from existing models that only detect a single product with hyping activities. We show the existence of the Collective Marketing Hyping behavior in real-life networks. Experimental results demonstrate that the product information network can effectively detect fraud intentional product promotions.

AbstractTreatments of cancer cause severe side effects called toxicities. Reduction of such effects is crucial in cancer care. To impact care, we need to predict toxicities at fortnightly intervals. This toxicity data differs from traditional time series data as toxicities can be caused by one treatment on a given day alone, and thus it is necessary to consider the effect of the singular data vector causing toxicity. We model the data before prediction points using the multiple instance learning, where each bag is composed of multiple instances associated with daily treatments and patient-specific attributes, such as chemotherapy,radiotherapy,age and cancer types. We then formulate a Bayesian multi-task framework to enhance toxicity prediction at each prediction point. The use of the prior allows factors shared across task predictors. Our proposed method simultaneously captures the heterogeneity of daily treatments and performs toxicity prediction at different prediction points. Our method was evaluated on a real-word dataset of more than 2000 cancer patients and had achieved a better prediction accuracy in terms of AUC than the state-of-art baselines.

AbstractWeb data extraction can be classified into two categories based on the extraction targets, record-level task and page-level task. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema verification, i.e. how to extract data by matching testing pages with an existing schema. In this paper, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. To speed up the processing, we utilize leaf nodes (of the DOM trees) as the processing units and dynamically adjust the encoding for better alignment. The proposed system works better than other page-level extraction systems in terms of schema accuracy and extraction efficiency. Overall, the extraction efficiency is dozens of times faster than state-of-the-art unsupervised approaches that extract data page by page without schema verification.

AbstractStatistical machine translation models are known to benefit from the availability of a domain bilingual lexicon. Bilingual lexicons are traditionally comprised of multiword expressions, either extracted from parallel corpora or manually curated. We claim that “patterns”, comprised of words and higher order categories, generalize better in capturing the syntax and semantics of the domain. In this work, we present an approach to extract such patterns from a domain corpus and curate a high quality bilingual lexicon. We discuss several features of these patterns, that, define the “consensus” between their underlying multiwords. We incorporate the bilingual lexicon in a baseline SMT model and detailed experiments show that the resulting translation model performs much better than the baseline and other similar systems.

AbstractAggregating local descriptors into super vectors achives excellent performance in image classification and retrieval tasks. Vector of locally aggregated descriptors(VLAD), which indexes images to compact representations by aggregating the residuals of descriptors and visual words, is a popular super vector encoding method among this kind. This paper will focus on the biggest difficulty of VLAD, the visual burstiness, reviste the basic assumptions and solutions along this line, then make modifications to two key steps of the initial VLAD process. The main contributions are twofold. Firstly, we start from local coordinate system(LCS) and propose the aggregated version(aggrLCS), which changes the objective and timing of coordinate rotation, for better captures of bursts. Secondly, an adaptive power-law normlization method is adopted to magnify the positive effect of power-law normalization by weighting each dimension repectively. Experiments on image retrieval tasks demonstrate that the proposed modifications show superior performance over the original and several variants of VLAD.

AbstractGroup Feature Selection (GFS) has proven to be useful in improving the interpretability and prediction performance of learned model parameters in many machine learning and data mining applications. Existing GFS models were mainly based on square loss and logistic loss for regression and classification, leaving the $epsilon$-insensitive loss and the hinge loss popularized by Support Vector Learning (SVL) machines still unexplored. In this paper, we present a Bayesian GFS framework for SVL machines based on the pseudo likelihood and data augmentation idea. With Bayesian inference, our method can circumvent the cross-validation for regularization parameters. Specifically, we apply the mean field variational method in an augmented space to derive the posterior distribution of model parameters and hyper-parameters for Bayesian estimation. Both regression and classification experiments conducted on synthetic and real-world data sets demonstrate that our proposed approach outperforms a number of competitors.

AbstractEntity resolution is a common data cleaning and data integration problem that involves determining which records in one or more data sets refer to the same real-world entities. It has numerous applications for commercial, academic and government organisations. For most practical entity resolution applications, training data does not exist which limits the type of classification models that can be applied. This also prevents complex techniques such as Markov logic networks from being used on real-world problems. In this paper we apply an active learning based technique to generate training data for a Markov logic network based entity resolution model and we learn the weights for the formulae in a Markov logic network. We evaluate our technique on real-world data sets and show that we can generate balanced training data and learn the weights for the formulae in the Markov logic network to support entity resolution.

AbstractThe accurate estimation of students’ grades in future courses is important as it can inform the selection of next term’s courses and create personalized degree pathways to facilitate successfull and timely graduation. This paper presents future-course grade predictions methods based on sparse linear models and low-rank matrix factorizations that are specific to each course or student-course tuple. These methods identify the predictive subsets of prior courses on a course-by-course basis and better address problems associated with the not-missing-at-random nature of the student-course historical grade data. The methods were evaluated on a dataset obtained from a major US public university. This evaluation showed that the course specific models outperformed various competing schemes with the best performing scheme achieving an av- erage RMSE across the different courses of 0.456 vs 0.675 for the best competing method.

Regular Presentation

AbstractThe efficiency of Public Transportation (PT) Networks is a major goal of any urban area authority. Advances on both location and communication devices drastically increased the availability of the data generated by their operations. Adequate Machine Learning methods can thus be applied to identify patterns useful to improve the Schedule Plan. In this paper, the authors propose a fully automated learning framework to determine the best Schedule Coverage to be assigned to a given PT network based on Automatic Vehicle location (AVL) and Automatic Passenger Counting (APC) data. We formulate this problem as a clustering one, where the best number of clusters is selected through an ad-hoc metric. This metric takes into account multiple domain constraints, computed using Sequence Mining and Probabilistic Reasoning. A case study from a large operator in Sweden was selected to validate our methodology. Experimental results suggest necessary changes on the Schedule coverage. Moreover, an impact study was conducted through a large-scale simulation over the affected time period. Its results uncovered potential improvements of the schedule reliability on a large scale.

AbstractSparse representation has been a powerful technique for modeling image data and thus image clustering. Sparse coding, as an unsupervised learning technique to extract sparse representations, can transform the original data into a low dimensional space, and simultaneously learn a dictionary that represents high-level semantics. Though existing sparse coding schemes are considering local manifold information of the data with graph or hypergraph regularization, more information from the the manifold structure should be exploited to utilize intrinsic manifold characteristics in the data. In this paper, we firstly propose a hypergraph incidence consistency regularization term (HIC) by minimizing the reconstruction error of the hypergraph incidence matrix with sparse codes to further leverage the hypergraph model and regulate the learned sparse codes. Moreover, a multi-hypergraph learning framework to automatically select the optimal manifold structure is integrated into the objective of sparse coding learning, resulting in multi-hypergraph consistent sparse coding (MultiCSC). We show that the MultiCSC objective function can be optimized efficiently, and that several existing sparse coding methods are special cases of MultiCSC. Extensive experimental results on image clustering demonstrate the effectiveness of our proposed method.

AbstractWe study the contextual bandit problem with linear payoff function. In the traditional contextual bandit problem, the algorithm iteratively chooses an action based on the observed context, and receives a reward instantly for the chosen action. Motivated by a practical need in many applications, we study the design of algorithms under the piled-reward setting, where the rewards are received as a pile instead of instantly. We present how the Linear Upper Confidence Bound (LinUCB) algorithm for the traditional problem can be naively applied under the piled-reward setting, and prove its regret bound. Then, we extend LinUCB to a novel algorithm, called Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which digests the observed contexts to choose actions more strategically before the piled rewards are received. We prove that LinUCBPR can match LinUCB in its regret bound under the piled-reward setting. Experiments on the artificial and real-world datasets demonstrate the strong performance of LinUCBPR in practice.

AbstractTo address the difficulties of “data noise sensitivity ”and “cluster center variance” in mainstream clustering algorithms, we propose a novel robust approach for identifying cluster centers unambiguously from data contaminated with noise; it incorporates the strength of {it homophilic degrees} and {it graph kernel}. We discover that in-degrees are self-organized to be homophilic layers if ordered by the associated out-degrees; and noise of low densities is aggregated to form the weakest layer, thus making it easy to find the boundary between clusters and noise. Having detected clusters from noise, we apply the diffusion kernel to the graph formed by clusters, so as to obtain graph kernel matrix, which is treated as the measurement of global similarities. Based on the belief that cluster centers have the highest intra-cluster densities and lowest inter-clusters similarities, the proposed approach manages identifying clusters precisely according to local data densities and global similarity matrix. Experiments on various synthetic and real-world databases verify the superiority of our algorithm in comparison with state-of-the-art algorithms.

AbstractThis paper presents a novel Indoor Positioning System (IPS) for objects of daily life equipped with passive RFID tags. The goal is to provide a simple to use, yet accurate, qualitative IPS that could be exploited inside housing enhanced with technology (sensors, effectors, etc.). With such a service, the housing, namely called smart home, could enable a wide range of services by being able to better understand the context and the current progression of activity of daily living. The paper shows that classical data mining techniques can be applied to raw data from RFID readers and passive tags. In particular, it explains how we built several datasets using a tagged object in a real smart home infrastructure. With these datasets, the accuracy of many kinds of decision trees was evaluated as well as the one of a Bayesian network, a multilayer perceptron and the nearest neighbors technique. Our method was proven very effective as most algorithms result in high accuracy for the majority of the smart home.

AbstractAbstract. Suppose you are a teacher, and have to convey a set of object-property pairs (‘lions eat meat’; or ‘aspirin is a blood-thinner’). A good teacher will convey a lot of information, with little effort on the student side. Specifically, given a list of objects (like animals or medical drugs) and their associated properties, what is the best and most intuitive way to convey this information to the student, without the student being overwhelmed? A related, harder problem is: how can we assign a numerical score to each lesson plan (i.e. way of conveying information)? Here, we give a formal definition of this problem of forming learning units and we pro- vide a metric for comparing different approaches based on information theory. We also design a all-engulfing algorithm, GROUPNTEACH, for this problem. Our proposed GROUPNTEACH is scalable (near-linear in the dataset size); it is effective, achieving excellent results on real data, both with respect to our pro- posed metric, but also with respect to encoding length; and it is intuitive, con- forming to well-known educational principles, such as grouping related concepts, and “comparing” and “contrasting”. Experiments on real and synthetic datasets demonstrate the effectiveness of GROUPNTEACH.

AbstractPrivacy-preserving data mining aims to keep data safe, yet useful. But algorithms providing strong guarantees often end up with low utility. We propose a novel privacy preserving framework that thwarts an adversary from inferring an unknown data point by ensuring that the estimation error is almost invariant to the inclusion/exclusion of the data point. By focusing directly on the estimation error of the data point, our framework is able to significantly lower the perturbation required. We use this framework to propose a new privacy aware K-means clustering algorithm. Using both synthetic and real datasets, we demonstrate that the utility of this algorithm is almost equal to that of the unperturbed K-means, and at strict privacy levels, almost twice as good as compared to the differential privacy counterpart.

AbstractPersistent efforts are going on to propose more accurate decision forest building techniques. In this paper, we propose a new decision forest building technique called Forest by Continuously Excluding Root Node (Forest CERN). The key feature of the proposed technique is to exclude attributes that participated in the root nodes of previous trees by imposing penalties on them to obstruct them appear in some subsequent trees. Penalties are gradually lifted in such a manner that those attributes can reappear after a while. Other than that, our technique uses bootstrap samples to generate predefined number of trees. The target of the proposed algorithm is to maximize tree diversity without impeding individual tree accuracy. We present an elaborate experimental results involving fifteen widely used data sets from the UCI Machine Learning Repository. The experimental results indicate the effectiveness of the proposed technique in most of the cases.

AbstractSeveral schemes have been proposed to support $k$-nearest neighbors ($k$-NN) query on encrypted cloud data. However, existing approaches either assume query users are fully-trusted, or require data owner to be online all the time. Query users in fully-trusted assumption can access the key to encrypt/decrypt outsourced data, thus, cloud server can completely break the data upon obtaining the key from any untrustworthy query user. The online requirement introduces much cost to data owner. This paper presents a new scheme to support $k$-NN query on encrypted cloud database while preserving the privacy of database and query points. Our proposed approach only discloses limited information about the key to query users, and does not require an online data owner. Theoretical analysis and extensive experiments confirm the security, and efficiency of our scheme.

AbstractClassic data analysis techniques generally assume that variables have single values only. However, the data complexity during the age of big data has gone beyond the classic framework such that variable values probably take the form of a set of stochastic measurements in- stead. We refer to the above case as the stochastic pattern-based symbolic data where each measurement set is an instance of an underlying stochastic pattern. In such a case, non existing classic data analysis approaches, such as the crystal item or fuzzy region ones, could apply yet. For this reason, we put forward a novel incremental hierarchical clustering algorithm for stochastic pattern discovery and symbolic data analysis (IHCPD). IHCPD is robust to pattern overlapping and value missing and well adapted for incremental learning. Extensive experiments on both synthetic data and real-life emitter parameter data have validated its efficiency and effectiveness.

AbstractTeam formation, which aims to form a team to complete a given task by covering its required skills, furnishes a natural way to help organizers complete projects effectively. In this work, we propose a new team hiring problem. Given a set of projects P with required skills, and a pool of experts X, each of which has his own skillset, compensation demand and participation constraint (i.e., the maximum number of projects the expert can participate in simultaneously), we seek to hire a team of participation-constrained experts T ? X to complete all the projects so that the overall compensation is minimized. We refer to this as the Participation Constrained Team Hire problem. To the best of our knowledge, this is the first work to investigate the problem. Since the problem is proven to be NP-hard, we design three novel efficient approximate algorithms as its solution, each of which focuses on a particular perspective of the problem. We perform extensive experimental studies, on both synthetic and real datasets, to evaluate the performance of our algorithms. Experimental results show that our algorithms work well in practice.

AbstractIn many application domains organizations require information from multiple sources to be integrated. Due to privacy and confidentiality concerns often these organizations are not willing or allowed to reveal their sensitive and personal data with other data owners. This has led to the emerging research discipline of privacy-preserving record linkage (PPRL) which aims to identify records in multiple databases of the same entity without revealing any private information about these entities. Linkage of multiple databases is significantly challenged with the increase of database size and the number of parties as the number of potential comparisons is growing exponentially. We propose a blocking approach for multi-party PPRL to efficiently and effectively prune the number of record sets that are unlikely to match. Our approach allows each party to perform the blocking independently except the initial agreement on parameter settings followed by a central hashing-based clustering step. We provide a theoretical analysis of our technique in terms of complexity, quality and privacy and we conduct an empirical study with large datasets. The results show that our approach is scalable with the size of the datasets and the number of parties, while providing better privacy and flexibility than previous multi-party private blocking approaches.

AbstractThe complexity of optimizations in semi-supervised dimensionality reduction methods has limited their usage. In this paper, an unsupervised and semi-supervised nonlinear dimensionality reduction method that aims at lower space complexity is proposed. First, a positive and negative competitive learning strategy is introduced to the single layered Self-Organizing Incremental Neural Network (SOINN) to process partially labeled datasets. Then, we formulate the dimensionality reduction of SOINN weight vectors as a quadratic programming problem with graph similarities calculated from previous step as constraints. Finally, an approximation of distances between newly arrived samples and the SOINN weight vectors is proposed to complete the dimensionality reduction task. Experiments are carried out on two artificial datasets and the NSL-KDD dataset comparing with Isomap, Transductive Support Vector Machine etc.. The results show that the proposed method is effective in dimensionality reduction and an efficient alternate transductive learner.

AbstractTransfer learning is a process to transfer knowledge learned in one or more source tasks to a related but more complex, unseen target task, in an effort to facilitate learning in the target task. Genetic programming (GP) is an evolutionary approach to generating computer programs for solving a given problem automatically. Transfer learning in GP has been investigated to solve complex Boolean and symbolic regression problems, but has not been used in image classification. In this paper, we propose a novel approach to use, for the first time, transfer learning in GP for image classification problems. Specifically, the proposed novel approach extends an existing state-of-the-art GP method by incorporating the ability to extract useful knowledge from simpler problems of a domain and reuse the extraced knowledge to solve complex problems of the domain. The proposed system has been compared with the baseline system (i.e. GP without using transfer learning) on multiclass texture classification problems from three widely-used texture datasets with different rotations and different levels of noise. The experimental results showed that the ability to reuse the extracted knowledge in the proposed GP method helps achieve better classification accuracy than the baseline GP method.

AbstractSocial media has become a valuable source of real-time information. Transport Management Centre (TMC) in Australian state government of New South Wales has been collaborating with us to develop TrafficWatch, a system that leverages Twitter as a channel for transport network monitoring, incident and event managements. This system utilises advanced web technologies and state-of-the-art machine learning algorithms. The crawled tweets are first filtered to show incidents in Australia, and then divided into different groups by online clustering and classification algorithms. Findings from the use of TrafficWatch at TMC demonstrated that it has potential to report incidents earlier than other data sources, as well as identifying unreported incidents. TrafficWatch also shows its advantages in improving TMC’s network monitoring capabilities to assess network impacts of incidents and events.

AbstractThe amount of data generated by systems is growing quickly because of the appearance of mobile devices, wearable devices, and The Internet of Things (IoT), to name a few. Because of that, the importance of personalized recommendations by recommender systems becomes more important for consumers inundated with vast amount of choices. Many different types of data are generated implicitly (for example, purchase history, browsing activity, and booking history), and less intrusive recommendation systems can be built upon implicit feedback. There are previous efforts to build a recommender system with implicit feedback by estimating the latent factors or learning the personalized ranking but these approaches do not fully take advantage of various types of information that can be created from implicit feedback such as implicit profiles or a popularity of items. In this paper, we propose a hybrid recommender system which exploits implicit feedback and demonstrate better performance of the proposed recommender system based on the expected percentile ranking and a precision-recall curve against two state-of-the-art recommender systems, Bayesian Personalized Ranking (BPR) and Implicit Matrix Factorization methods, using hotel reservation data.

AbstractTraditional cross-view machine learning and information retrieval mainly rest on correlating two sets of features in different views. However, features in different views usually have different physical interpretations. It may be inappropriate to map multiple views of data onto a shared feature space and directly compare them. In this paper, we propose a simple yet effective Cross- View Feature Hashing (CVFH) algorithm via a “partition and match” approach. The feature space for each view is bi-partitioned multiple times using B hash functions and the resulting binary codes for all the views can thus be represented in a compatible B-bit Hamming space. To ensure that hashed feature space is effective for supporting generic machine learning and information retrieval functionalities, the hash functions are learned to satisfy two criteria: 1) the neighbors in the original feature spaces should be also close in the Hamming space; and 2) the binary codes for multiple views of a same sample should be similar in the shared Hamming space. We apply CVFH to cross-view image retrieval. The experimental results show that CVFH can outperform the Canonical Component Analysis (CCA) based cross-view method.

AbstractFor the reason that deviation exists between GPS traces obtained by real-time positioning system and actual paths, real-time map matching which identifies the correct traveling road segment, becomes increasingly important. In order to effec-tively improve map matching accuracy, most state-of-art real-time map matching algorithms use machine learning which calls for time-consuming human labeling in advance. We propose an accurate real-time map matching method using online learning called OLMM. It takes into account a small piece of trajectory data and their matching result to support the subsequent matching process. We evaluate the effectiveness of the proposed approach using ground truth data. The results demonstrate that our approach can obtain more accurate matching results than ex-isting methods without any human labeling beforehand.

AbstractWe propose a new method for building a classifier ensemble, based on subgroup discovery techniques in data mining. We apply subgroup discovery techniques to a labeled training dataset to discover interesting subsets, characterized by a conjuctive logical expression (rule), where such subset has an unusually high dominance of one class. Treating these rules as base classifiers, we propose several simple ensemble methods to construct a single classifier. Another novel aspect of the paper is that it applies these ensemble methods, along with standard anomaly detection and classification, to automatically identify high potential (HIPO) employees – an important problem in management. HIPO employees are critical for future-proofing the organization in the face of attrition, economic uncertainties and business challenges. Current HR processes for HIPO identification are manual and suffer from subjectivity, bias and disagreements. Proposed data-driven analytics algorithms address some of these issues. We show that the new ensemble methods perform better than other methods, including other ensemble methods on a real-life case-study dataset of a large multinational IT services company.

AbstractAs obesity has become a worldwide problem, a number of health programs have been designed to encourage participants to maintain a healthier lifestyle. The program stakeholders often desire to know how effective the programs are and how to target the right participants. Motivated by a real-life health program conducted by an Australian supermarket chain, we propose a novel method to track customer behavior changes induced by the program and investigate the program’s effect on different segments of customers, split according to demographic factors like age and gender. The method: (1) derives customer preferences from the transaction data, (2) captures the customer behavior changes via a temporal model, (3) analyzes the program effectiveness on different customer segments, and (4) evaluates the program influence using a one-year data set obtained from a major Australian supermarket. Our results indicate that while overall the program had positive effect in encouraging customers to buy healthy food, its impact varied for the different customer segments. These results can inform the design of personalized health programs that target specific customers in the future and benefit more people. Our method can also be applied to other programs that use transaction data and customer profiles.

AbstractBoth uncertain data and high-dimensional data pose huge challenges to traditional clustering algorithms. It is even more challenging for clustering high dimensional uncertain data and there are few such algorithms. In this paper, based on the classical FINDIT subspace clustering algorithm for high dimensional data, we propose a constraint based semi-supervised subspace clustering algorithm for high dimensional uncertain data, UFINDIT. We extend both the distance functions and dimension voting rules of FINDIT to deal with high dimensional uncertain data. Since the soundness criteria of FINDIT fails for uncertain data, we introduce constraints to solve the problem. We also use the constraints to improve FINDIT in eliminating parameters’ effect on the process of merging medoids. Furthermore, we propose some methods such as sampling to get an more e?cient algorithm. Experimental results on synthetic and real data sets show that our proposed UFINDIT algorithm outperforms the existing subspace clustering algorithm for uncertain data.

AbstractConsumer credit scoring and credit risk management have been the core research problem in financial industry for decades. Recently, there is a growing popularity of applying online social data for personal credit scoring. In this paper, we target at inferring this particular user attribute credit, i.e., predicting whether a user is of the good credit class or not, from her social data. However, the existing credit scoring methods, mainly relying on the financial data, face severe challenges to tackle the heterogeneous social data. Moreover, social data only contains extremely weak signals about users’ credit since they definitely won’t generate sensitive credit related content on social media. To that end, we put forward a Latent User Behavior Dimension based Credit Model~(LUBD-CM) to capture these small signals for personal credit profiling. LUBD-CM learns users’ hidden behavior habits and topic distributions simultaneously, and represents each user at a much finer granularity. Specifically, we take a real-world Sina Weibo dataset as the testbed for social-data-based personal credit profiling evaluation. Extensive experiments conducted on the dataset demonstrate the effectiveness of our approach: 1) user credit label can be predicted using LUBD-CM with a considerable performance improvement over several state-of-the-art baselines; 2) the latent behavior dimensions have very good interpretability in personal credit profiling.

AbstractPredicting event occurrence at an early stage in longitudinal studies is an important problem which has high practical value. As opposed to the standard classification and regression problems where a domain expert can provide the labels for the data in a reasonably short period of time, training data in such longitudinal studies must be obtained only by waiting for the occurrence of sufficient number of events. The main objective of this work is to predict for which subject in the study event will occur at future based on few event information at the initial stages of a longitudinal study. In this paper, we propose a novel Early Stage Prediction (ESP) framework for building event prediction models which are trained at early stages of longitudinal studies. More specifically, we extended the Naive Bayes and Tree-Augmented Naive Bayes (TAN) methods based on the proposed framework, and developed two algorithms, namely, ESP-NB and ESP-TAN, to effectively predict event occurrence using the training data obtained at early stage of the study. The proposed framework is evaluated using a wide range of synthetic and real-world benchmark datasets. Our extensive set of experiments show that the proposed ESP framework is able to more accurately predict future event occurrences using only a limited amount of training data compared to the other alternative methods.

AbstractBayesian optimisation is an efficient technique to optimise functions that are expensive to compute. In this paper, we propose a novel framework to transfer knowledge from a completed source optimisation task to a new target task in order to overcome the cold start problem. We model source data as noisy observations of the target function. The level of noise is computed from the data in a Bayesian setting. This enables flexible knowledge transfer across tasks with differing relatedness, addressing a limitation of the existing methods. We evaluate on the task of tuning hyperparameters of two machine learning algorithms. Treating a fraction of the whole training data as source and the whole as the target task, we show that our method finds the best hyperparameters in the least amount of time compared to both the state-of-art and no transfer method.

AbstractThe Adaptive Multiple-hyperplane Machine (AMM) was recently proposed to deal with large-scale datasets. However, it has no principle to tune the complexity and sparsity levels of the solution. Addressing the sparsity is important to improve learning generalization, prediction accuracy and computational speedup. In this paper, we employ the max-margin principle and sparse approach to propose a new Sparse AMM (SAMM). We solve the new optimization objective function with stochastic gradient descent (SGD). Besides inheriting the good features of SGD-based learning method and the original AMM, our proposed Sparse AMM provides machinery and flexibility to tune the complexity and sparsity of the solution, making it possible to avoid overfitting and underfitting. We validate our approach on several large benchmark datasets. We show that with the ability to control sparsity, the proposed Sparse AMM yields superior classification accuracy to the original AMM while simultaneously achieving computational speedup.

AbstractWith the exponential growth of the web documents and the requirement of limited bandwidth for mobile devices, it becomes more and more difficult for users to get information they look forward to from the vast amount of information. Query-focused summarization gets more attention from both the research and engineering area in recent years. However, existing query-focused summarization methods don’t consider the conceptual relation and the concept importance that makes up the sentences, a concept is the title of a wikipedia article and can express an entity or action. In this article. We propose a novel method called Query-focused Multi-document Summarization based on Concept Importance (QMSCI). We first map sentence to concepts and get ranked weighted concepts by reinforcement between the concepts of sentences and concepts of the query in a bipartite graph, then we use the ranked weighted concepts to help to rank the sentences in a hyper-graph model, sentences that contain important concepts, related with the query and also central among sentences are ranked higher and comprise the summary. We experiment on the DUC datasets, the experimental result demonstrates the effectiveness of our proposed methods compared to the state-of-art methods.

AbstractThe goal of ensemble regression is to combine a set of regressors in order to improve the predictive accuracy. The key to a successful ensemble regression is to complementally generate base models and elaborately combine their outputs. Traditionally, the weighted average of the outputs is treated as the final prediction. This means each base model plays a constant role in the whole data space. In fact, we know the predictive accuracy of each base model varies across different data spaces. In this paper, we develop a dynamic weighted ensemble method from locality which is called Locally Weighted Ensemble. The weight of each base model varies with sample, which is realized by introducing soft-max function into the objective function. Besides, regularization is also included to make the objective function well-posed. The proposed method is evaluated on several UCI datasets. Compared with single models and other ensemble models, our proposed achieves better performance. From the experiments, we also find that the convergence of Locally Weighted Ensemble is fast.

AbstractThis paper presents a novel framework to detect shot boundaries based on the One-Class Support Vector Machines (OCSVM). Instead of comparing the difference between pair-wise consecutive frames at a specific time, we measure the distance between two OCSVM classifiers, which are learnt from two contextual sets, i.e., immediate past set and immediate future set. To speed up the processing procedure, the two OCSVM classifiers are updated in an online fashion by our proposed multi-instance incremental and decremental one-class support vector machine algorithm. Our approach, which inherits the advantages of OCSVM, is robust to noises such as abrupt illumination changes and large object or camera movements, and capable of detecting gradual transitions as well. Experimental results on some benchmark datasets compare favorably with the state-of-the-art methods.

AbstractIn this paper, we present R-OpenIE, a rule based open information extraction method using cascaded finite-state transducer. R-OpenIE defines contextual constraint declarative rules to generate relation extraction templates, which frees from the influence of syntactic parser errors, and it uses cascaded finite-state transducer model to match the satisfied relational tuples. It is noted that R-OpenIE creates inverted index for each matched state during the matching process of cascaded finite-state transducer, which improves the efficiency of pattern matching. The experimental results have shown that our R-OpenIE can achieve good adaptability and efficiency for open information extraction.

AbstractReal-world online learning applications often face data coming from changing target functions or distributions. Such changes, called the concept drift, degrade the performance of traditional online learning algorithms. Thus, many existing works focus on detecting concept drift based on statistical evidence. Other works use sliding window or similar mechanisms to select the data that closely reflect current concept. Nevertheless, few works study how the detection and selection techniques can be combined to improve the learning performance. We propose a novel framework on top of existing online learning algorithms to improve the learning performance under concept drifts. The framework detects the possible concept drift by checking whether forgetting some older data may be helpful, and then conduct forgetting through a step called unlearning. The framework effectively results in a dynamic sliding window that selects some data flexibly for different kinds of concept drifts. We design concrete approaches from the framework based on three popular online learning algorithms. Empirical results show that the framework consistently improves those algorithms on ten synthetic data sets and two real-world data sets.

AbstractThe overwhelming number of scientific articles over the years calls for smart automatic tools to facilitate the process of literature review. Here, we propose for the first time a framework of faceted recommendation for scientific articles (abbreviated as FeRoSA) which apart from ensuring quality retrieval of scientific articles for a query paper, also efficiently arranges the recommended papers into different facets (categories). Providing users with an interface which enables the filtering of recommendations across multiple facets can increase users’ control over how the recommendation system behaves. FeRoSA is precisely built on a random walk based framework on an induced subnetwork consisting of nodes related to the query paper in terms of either citations or content similarity. Rigorous analysis based an experts’ judgment shows that FeRoSA outperforms two baseline systems in terms of faceted recommendations (overall precision of 0.65). Further, we show that the faceted results of FeRoSA can be appropriately combined to design a better flat recommendation system as well. An experimental version of FeRoSA is publicly available at www.ferosa.org (receiving as many as 170 hits within the first 15 days of launch).

AbstractThe selection of metafeatures for metalearning (MtL) is often an ad hoc process. The lack of a proper motivation for the choice of a metafeature rather than others is questionable and may originate a loss of valuable information for a given problem (e.g., use of class entropy and not attribute entropy). We present a framework to systematically generate metafeatures in the context of MtL. This framework decomposes a metafeature into three components: meta-function, object and post-processing. The automatic generation of metafeatures is triggered by the selection of a meta-function used to systematically generate metafeatures from all possible combinations of object and post-processing alternatives. We executed experiments by addressing the problem of algorithm selection in classification datasets. Results show that the sets of systematic metafeatures generated from our framework are more informative than the non-systematic ones and the set regarded as state-of-the-art.

AbstractDrug-target interactions map patterns, associations and relationships between drugs and target proteins. Identifying interactions between drug and target is critical in drug discovery, but biochemically validating these interactions are both laborious and expensive. In this paper, we propose a novel interaction profiles based method to predict potential drug-target interactions by using matrix completion. Our method arranges the drug-target interactions in a matrix, whose entries include interaction pairs, non-interaction pairs and undetermined pairs, and finds its approximation matrix which contains the predicted values at undetermined positions. Our method learn an approximation matrix by narrowing the distance between the drug-target interaction matrix and its approximation subject that the values in the observed positions equal to the known interactions in the corresponding positions. As a consequence, our method can directly predict the interactions according to the high values at the undetermined positions. We evaluated our method by comparing against four counterpart methods on “gold standard” datasets. Our method significantly outperforms the counterparts, and achieves high AUC and F1-score. On average, the AUC of our method are 0.9708, 0.9778, 0.9123, 0.6640 and 0.9659 the F1-score are 0.8437, 0.8975, 0.7083, 0.5204 and 0.8281 on Enzyme, Ion channel, GPCR, Nuclear receptor and integrated datasets, respectively.

AbstractIn the area of pattern discovery, there is much interest in discovering small sets of patterns that characterize the data well. In such scenarios, when data is represented by a small set of characterizing patterns, an interesting problem is the comparison of datasets, by comparing the respective representative sets of patterns. In this paper, we address this problem. We propose a novel kernel function for measuring similarities between two sets of patterns. We define the kernel for injective serial episodes and itemsets. We also present an efficient algorithm for computing this kernel. We demonstrate the effectiveness of our kernel on sequential datasets and transaction databases.

AbstractHashing is an effective method of approximate nearest neighbor search (ANN) for the massive web images. In this paper, we propose a method that combines CNN with hash learning, where the features learned by the former are beneficial to the latter. By introducing a new loss layer and a new hash layer, the proposed method can learn the hash functions that preserve the semantic information as well as satisfy the desirable independent properties of hashing. Experiments show that our method outperforms the state-of-the-art methods by a large margin on image retrieval. And the comparisons with baseline models show the effectiveness of our proposed layers.

AbstractMatrix factorization is one of the most popular techniques for prediction problems in the fields of intelligent systems and data mining. It has shown its effectiveness in many real-world applications such as recommender systems. As a collaborative filtering method, it gives users recommendations based on their previous preferences (or ratings). Due to the extreme sparseness of the ratings matrix, active learning is used for eliciting ratings for a user to get better recommendations. In this paper, we propose a new matrix factorization model called ESVD which combines the classic matrix factorization method with a specific rating elicitation strategy. We evaluate the proposed ESVD method on the Movielens data set, and the experimental results suggest its effectiveness in terms of accuracy and efficiency, when compared with traditional matrix facterization methods and active learning methods.

AbstractConformal classifiers output confidence prediction regions—multi-valued predictions that are guaranteed to contain the true output value of each test pattern with some predefined probability. In order to fully utilize the predictions provided by a conformal classifier, it is essential that those predictions are reliable, i.e., that a user is able to assess the quality of the predictions made. Although conformal classifiers are statistically valid by default, the error probability of the prediction regions output are dependent on their size in such a way that smaller (and thus potentially more interesting) predictions are more likely to be incorrect. This paper proposes, and evaluates, a method for producing refined error probability estimates of prediction regions, that takes their size into account. The end result is a binary conformal confidence predictor that is able to provide accurate error probability estimates for those prediction regions containing only a single class label.

AbstractAlthough entity resolution is known to an important problem that has wide-spread applications in many areas, including e-commerce, health-care, the social sciences, and crime and fraud detection, one aspect that has been largely neglected is to monitor the quality of entity resolution and repair erroneous matching decisions over time. In this paper we develop an efficient method for incrementally repairing ER whenever possible. Our method is based on an efficient clustering algorithm that eliminates inconsistencies among matching decisions, and an efficient provenance indexing data structure that allows us to systematically trace the evidences of clustering for supporting ER repairing. We have evaluated our method over real-world databases, and our experimental results show that the quality of entity resolution can be significantly improved over time.

AbstractDiscord is the most unusual subsequence of a time series. Sequential discovery of discord is time consuming. As the scale of datasets increase unceasingly, datasets have to be kept in hard disk, which incurs degradation of the utilization of computing resource. Discord Furthermore, the results of discord discovery are non-combinable, which makes discord discovery hard to parallel. In this paper, we propose Parallel Discord Discovery (PDD). PDD accelerates discord discovery with multiple computing nodes. PDD stores large scale time series on distributed memory of multiple computing nodes to improve the utilization of computing resources. Computing nodes communicate with each other efficiently to ensure the correctness of the results. Experiments show that given 10 computing nodes, PDD is seven times faster than classic discovery method HOTSAX. PDD handles larger datasets than HOSTAX does. PDD achieves over 90% utilization of computing resources, nearly twice as disk aware method does.

AbstractDBSCAN is a classic density-based clustering technique, which is well known in discovering clusters of arbitrary shapes and handling noise. However, it is very time-consuming in density calculation when facing high dimensional data, which makes it inefficient in many areas, such as multi-document summarization, product recommendation, etc. Therefore, how to efficiently calculate the density on high dimensional data becomes one key issue for DBSCAN-based clustering technique. In this paper, we propose a fast algorithm for DBSCAN-based clustering on high dimensional data, named Dboost. In our algorithm, a ranked retrieval technique adaption named WAND# is novelly applied to improving the density calculations without accuracy loss, and we further improve this acceleration by reducing the invoking times of WAND#. Experiments were conducted on Netflix dataset and microblog corpora. The results showed that an acceleration of 70 times were achieved on Netflix dataset, and 100 more times can be expected on microblog data.

AbstractConsidering the present day trend people are more interested to receive a compact answer to their questions rather than acquiring a bunch of relevant web documents. This commenced the huge popularity of community question answering (cQA) services like Yahoo! Answers , Baidu Zhidao , Quora , StackOverflow etc., which form rich repositories of knowledge in the form of questions and user generated answers. In cQA archives, retrieval of similar questions which are already answered is of significant help to users. The main challenge while retrieving similar questions is the ‘lexico-syntactic’ gap between the current question and the historic questions. In this paper, we propose a novel approach called Deep Structured Topic Model (DSTM) to bridge the lexico-syntactic gap through deep semantics based ranking of the cQA question pairs which have similar alignment of latent topics. We initially retrieve similar question pairs that lie in vicinity in the latent topic vector space. These pairs are then re-ranked using deep layered semantic model which yields the most similar questions given a query. Experiments on large scale real-life cQA dataset shows that our approach outperforms the state-of-the-art translation and topic based baseline approaches.

AbstractIn this paper, we consider a generalized longest common subsequence problem, in which a constraining sequence of length $s$ must be included as a substring and the other constraining sequence of length $t$ must be included as a subsequence of two main sequences and the length of the result must be maximal. For the two input sequences $X$ and $Y$ of lengths $n$ and $m$, and the given two constraining sequences of length $s$ and $t$, we present an $O(nmst)$ time dynamic programming algorithm for solving the new generalized longest common subsequence problem. The time complexity can be reduced further to cubic time in a more detailed analysis. The correctness of the new algorithm is proved.

AbstractData stream processing is an important function in many emerging applications such as network traffic analysis, web applications, and financial data analysis. Computing summaries of data stream is challenging since streaming data cannot be permanently stored or accessed more than once. In this paper, we have proposed two counter based hierarchical (CHS) ε-approximation algorithms to create hierarchical summaries of one dimensional data. CHS maintains a data structure, where each entry contains the incoming data item and an associated counter to store its frequency. Since every item in streaming data cannot be stored, CHS only maintains frequent items (known as hierarchical heavy hitters) at various levels of generalization hierarchy by exploiting the natural hierarchy of the data. The algorithm guarantees accuracy of count within a ε bound. Furthermore, using aperiodic (CHS-A) and periodic (CHS-P) compression strategy the proposed technique offers improved space complexities of O(η/ε) andO(η/ε log?(εN)), respectively. We provide theoretical proofs for both space and time requirements of CHS algorithm. We have also experimentally compared the proposed algorithm with the existing benchmark techniques. Experimental results show that the proposed algorithm requires fewer updates per element of data, and uses a moderate amount of bounded memory. Moreover, precision-recall analysis demonstrates that CHS algorithm provides a high quality output compared to existing benchmark techniques. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark data sets from an international Internet Service Provider.

AbstractWhile sequential pattern mining (SPM) is an import application in uncertain databases, it is challenging in efficiency and scalability. In this paper, we extend the Apriori-like SPM framework to the distributed computing platform Spark and employ a dynamic programming (DP) approach to mine probabilistic frequent sequential patterns. Directly applying the DP method to Spark is usually impractical because its memory-consuming characteristic may cause heavy Java garbage collection overhead in Spark. Therefore, we develop a memory-efficient distributed DP algorithm and extend the prefix-tree to save intermediate results efficiently. The extensive experimental results in various scales prove that our method is orders of magnitude faster than straight-forward approaches.

AbstractAutomatic hashtag segmentation is used when analysing twitter data, to associate hashtag terms to those used in common language. The most common form of hashtag segmentation uses a dictionary with a probability distribution over the dictionary terms, constructed from sample texts specific to the given hashtag domain. The language used in Twitter is different to the common language found in published literature, most likely due to the tweet character limit, therefore dictionaries constructed to perform hashtag segmentation should be derived from a random sample of tweets. We ask the question “How large should our sample of tweets be to obtain a given level of segmentation accuracy?” We found that the Jaccard similarity between the correct segmentation and the predicted segmentation using a unigram model, follows a Zero-One inflated Beta distribution with four parameters. We also found that each of these four parameters are functions of the sample size (tweet count) for dictionary construction, implying that we can compute the Jaccard similarity distribution once the tweet count of the dictionary is known. Having this model allows us to compute the number of tweets required for a given level of hashtag segmentation accuracy, and also allows us to compare other segmentation models to this known distribution.

AbstractLogistic Regression (LR) has been the workhorse of statistics community and one of the state of the art machine learning classifier. It is based on the linear combination of input parameters and trains by optimizing the Conditional Log-Likelihood (CLL) of the data. Recently, it has been shown that one can effectively precondition LR with the use of the WANBIA-C trick — that results in speeding up LR many-fold. One can, however, train LR by optimizing the mean-square-error (MSE) instead of CLL. It is well-known that such settings leads to an Artificial Neural Network (ANN) with no hidden layer. In this work, we study the effect of WANBIA-C trick on LR that is optimizing the MSE. Optimizing MSE instead of CLL may lead to a lower bias classifier and hence result in better performance on big datasets. We will show that WANBIA-C trick can speed-up the convergence of LR with MSE significantly. We will show that optimizing LR with MSE leads to a lower bias classifier than LR with CLL. We also compare the performance to state of the art in classification such as Random Forest.

AbstractPersonalized predictive medicine necessitates modeling of patient illness and care processes, which inherently have long-term temporal dependencies. Healthcare observations, recorded in electronic medical records, are episodic and irregular in time. We introduce DeepCare, a deep dynamic neural network that reads medical records and predicts future medical outcomes. At the data level, DeepCare models patient health state trajectories with explicit memory of illness. Built on Long Short-Term Memory (LSTM), DeepCare introduces time parameterizations to handle irregular timing by moderating the forgetting and consolidation of illness memory. DeepCare also incorporates medical interventions that change the course of illness and shape future medical risk. Moving up to the health state level, historical and present health states are then aggregated through multiscale temporal pooling, before passing through a neural network that estimates future outcomes. We demonstrate the efficacy of DeepCare for disease progression modeling and readmission prediction in diabetes, a chronic disease with large economic burden. The results show improved modeling and risk prediction accuracy.

AbstractThis paper focuses on the problem of tagging quality evalu- ation in collaborative tagging systems. By investigating the dynamics of tagging process, we find that high frequency tags almost cover the main aspects of a resource content and can be determined stable much earlier than a tag set. Motivated by this finding, we design the swapping index and smart moving index on tagging quality. Then we study the inner re- lations between tags and proposed the novel semantic method on tagging quality. We evaluate the proposed methods against some real datasets. The proposed metrics are more efficient than previous methods, which are especially appropriate for a large number of web resources. The effec- tiveness is justified by the results in tag based applications. The results show that the smart metrics bring a little loss on the performance, while the semantic evaluation is better than current methods.

Abstractk-medoids algorithm is a partitional, centroid-based clustering algorithm which uses pairwise distances of data points and tries to directly decompose the dataset with $n$ points into a set of $k$ disjoint clusters. However, k-medoids itself requires all distances between data points that are not so easy to get in many applications. In this paper, we introduce a new method which requires only a small proportion of the whole set of distances and makes an effort to estimate an upper-bound for unknown distances using the inquired ones. This algorithm makes use of the triangle inequality to calculate an upper-bound estimation of the unknown distances. Our method is built upon a recursive approach to cluster objects and to choose some points actively from each bunch of data and acquire the distances between these prominent points from oracle. Experimental results show that the proposed method using only a small subset of the distances can find proper clustering on many real-world and synthetic datasets.

AbstractThis paper tackles the reciprocal recommendation task which has various applications such as online dating, employee recruitment and mentor-mentee matching. The major difference between traditional recommender systems and reciprocal recommender systems is that a reciprocal recommender has to satisfy the preference on both directions. This paper proposes a simple yet novel regularization term, the Mutual-Attraction Indicator, to model the mutual preferences of both parties. Given such indicator, we design a transfer-learning based CF model for reciprocal recommender. The experiments are based on two real world tasks, online dating and human resource matching, showing significantly improved performance over the original factorization model and state-of-the-art reciprocal recommenders.

AbstractSocial link identification SIL, that is to identify accounts across different online social networks that belong to the same user, is an important task in social network applications. Most existing methods to solve this problem directly applied machine-learning classifiers on features extracted from user’s rich information. In practice, however, only some limited user information can be obtained because of privacy concerns. In addition, we observe the existing methods cannot handle huge amount of potential account pairs from different OSNs. In this paper, we propose an effective SIL method to address the above two challenges by expanding known anchor links (seed account pairs belonging to the same person). In particular, we leverage potentially useful information possessed by the existing anchor link, and then develop a local expansion model to identify new social links, which are taken as a generated anchor link to be used for iteratively identifying additional new social link. We evaluate our method on two most popular Chinese social networks. Experimental results show our proposed method achieves much better performance in terms of both the number of correct account pairs and efficiency.

AbstractThis paper studies a new machine learning strategy called joint classification with heterogeneous labels (JCHL). Unlike traditional supervised learning problems, JCHL uses a single feature space to jointly classify multiple classification tasks with heterogeneous labels. For instance, biologists usually have to label the gene expression images with developmental stages and simultaneously annotate their anatomical terms. We would like to classify the developmental stages and at the same time classify anatomical terms by learning from the gene expression data. Recently, researchers have considered using Preferential random walk (PRW) to build different relations to link heterogeneous labels, thus the heterogeneous label information can be propagated by the instances. On the other hand, it has been shown that learning performance can be significantly enhanced if the dynamic propagation is exploited in PRW. In this paper, we propose a novel algorithm, called random walk with dynamic label propagation (RWDLP), for the JCHL problems. In RWDLP, a joint transition probability graph is constructed to encode the relationships among instances and heterogeneous labels, and we utilize dynamic label propagation in the graph to generate the possible labels for the joint classification tasks with heterogeneous labels. Experimental results have demonstrated the effectiveness of the proposed method.

AbstractWith the development of service oriented computing, web mashups which provide composite services are increasing rapidly in recent years, posing the challenge of searching appropriate mashups for a given query. To the best of our knowledge, most approaches on service discovery are mainly based on the semantic information of services, and the services are ranked by their QoS values. However, these methods can’t be applied to mashup discovery seamlessly, since they merely rely on the description of mashups, neglecting the information of service components. Besides, those semantic based techniques do not consider the compositive structure of mashups and their components. In this paper, we propose an efficient consistent regularization framework to enhance mashup discovery by leveraging heterogeneous information network between mashups and their components. Our model also integrates mashup discovery and ranking properly. Comprehensive experiments have been conducted on a real-world ProgrammableWeb.com. Experimental results show that our model achieves a better performance, when compared with Programma-bleWeb.com search engine and a state-of-the-art semantic based model.