Symbolic Data Analysis Workshop SDA 2015 November Orléans, France University of Orléans Symbolic Data Analysis Workshop SDA 2015 November Orléans University, France SPONSORS CNRS MAPMO Mathematics Laboratory University of Orléans Denis Poisson Federation Centre-Val de Loire Region Council Loiret District Council STEERING COMMITTEE Paula BRITO, University of Porto, Portugal Monique NOIRHOMME, University of Namur, Belgium ORGANISING COMMITTEE Guillaume CLEUZIOU, Richard EMILION, Christel VRAIN Secretary: Marie-France GRESPIER University of Orléans, France SCIENTIFIC COMMITTEE Javier ARROYO, Spain Lynne BILLARD, USA Paula BRITO, Portugal Chun-houh CHEN, Taiwan Guillaume CLEUZIOU, France Francisco DE CARVALHO, Brazil Edwin DIDAY, France Richard EMILION, France Manabu ICHINO, Japan Yves LECHEVALLIER, France Monique NOIRHOMME, Belgium Rosanna VERDE, Italy Gilles VENTURINI, France Christel VRAIN, France Huiwen WANG, China Symbolic Data Analysis Workshop SDA 2015 November Tuesday, November 17 Orléans University Campus IIIA Computer Science Building TUTORIAL 1rst floor, Room E19 09:00-09:50 Introduction to Symbolic Data Analysis Paula BRITO, FEP & LIAAD-INESC TEC, Univ. Porto, Portugal 09:50-10:15 Coffee Break 10:15-11:05 The Quantile Method for Symbolic Data Analysis Manabu ICHINO, SSE, Tokyo Denki University, Japan 11:05-11:55 The R SDA Package Oldemar RODRIGUEZ, University of Costa Rica 12:00-13:45 Welcome, Registration, Lunch L'Agora Restaurant, Orléans University Campus 13:55-14:00 Workshop Opening IIIA, Herbrand Amphitheatre 14:00-17:25 Workshop Talks 19:30 Workshop Dinner. 'Le Martroi' restaurant, 12 Place du Martroi, Orléans. Tram stop: 'De Gaulle' or 'République' Introduction to Symbolic Data Analysis Paula Brito FEP & LIAAD-INESC TEC, Univ. Porto, Portugal Symbolic Data, introduced by E. Diday is concerned with analysing data presenting intrinsic variability, which is to be explicitly taken into account. In classical Statistics and Multivariate Data Analysis, the elements under analysis are generally individual entities for which a single value is recorded for each variable - e.g., individuals, described by their age, salary, education level, marital status, etc.; cars each described by its weight, length, power, engine displacement, etc.; students for each of which the marks at different subjects were recorded. But when the elements of interest are classes or groups of some kind - the citizens living in given towns; teams, consisting of individual players; car models, rather than specific vehicles; classes and not individual students - then there is variability inherent to the data. To reduce this variability by taking central tendency measures - mean values, medians or modes - obviously leads to a too important loss of information. Symbolic Data Analysis provides a framework allowing representing data with variability, using new variable types. Also, methods have been developed which suitably take data variability into account. Symbolic data may be represented using the usual matrix-form data arrays, where each entity is represented in a row and each column corresponds to a different variable - but now the elements of each cell are generally not single real values or categories, as in the classical case, but rather finite sets of values, intervals or, more generally, distributions. In this talk we shall introduce and motivate the field of Symbolic Data Analysis, present into some detail the new variable types that have been introduced to represent variability, illustrating with some examples. We shall furthermore discuss some issues that arise when analysing data that does not follow the usual classical model, and present data representation models for some variable types. The Quantile Method for Symbolic Data Analysis Manabu Ichino School of Science and Engineering, Tokyo Denki University Keywords: Quantiles, Monotonicity, Visualization, PCA, Clustering Abstract The quantile method transforms the given (N objects) (d variables) symbolic data table to a standard {N (m+1) sub-objects} (d variables) numerical data table, where m is a preselected integer number that controls the granularity to represent symbolic objects. Therefore, a set of (m+1) d-dimensional numerical vectors, called the quantile vectors, represents each symbolic object. According to the monotonicity of quantile vectors, we present the following three methods for symbolic data analysis. Visualization: We visualize each symbolic object by m+1 parallel monotone line graphs [Ichino and Brito 2014]. Each line graph is composed of d-1 line segments accumulating the d zero-one normalized variable values. PCA: When the given symbolic objects have a monotone structure in the representation space, the structure confines the corresponding quantile vectors to a similar geometrical shape. We apply the PCA to the quantile vectors based on the rank order correlation coefficients. We reproduce each symbolic object as m series of arrow lines that connect from the minimum quantile vector to the maximum quantile vector in the factor planes [Ichino 2011]. Clustering: We present a hierarchical conceptual clustering based on the quantile vectors. We define the concept sizes of d-dimensional hyper-rectangles spanned by quantile vectors. The concept size plays the role of the similarity measure between sub-objects, i.e., quantile vectors, and it plays also the role of the measure for cluster quality [Ichino and Brito 2015]. References H-H. Bock and E. Diday (2000). Analysis of Symbolic Data - Exploratory Methods for Extracting Statistical Information from Complex Data. Heidelberg: Springer. L. Billard and E. Diday (2007). Symbolic Data Analysis - Conceptual Statistics and Data Mining. Chichester: Wiley. E. Diday and M. Noirhomme-Fraiture (2008). Symbolic Data Analysis and the SODAS Software. Chichester: Wiley. M. Ichino and P. Brito (2014). The data accumulation graph (DAG) to visualize multi-dimensional symbolic data. Workshop in Symbolic Data Analysis. Taipei, Taiwan. M. Ichino (2011). The quantile method for symbolic principal component analysis. Statistical Analysis and Data Mining, 4, 2, pp M. Ichino and P. Brito (2015). A hierarchical conceptual clustering based on the quantile method for mixed feature-type data. (Submitted to the IEEE Trans. SMC). Latest developments of the RSDA: AnR package for Symbolic Data Analysis Oldemar Rodríguez June 26, 2015 Abstract This package aims to execute some models on Symbolic Data Analysis. Symbolic Data Analysis was propose by the professor E. DIDAY in 1987 in his paper Introduction à l approche symbolique en Analyse des Données. Premiére Journées Symbolique-Numérique. Université Paris IX Dauphine. Décembre A very good reference to symbolic data analysis can be found in From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis of L. Billard and E. Diday that is the journal American Statistical Association Journal of the American Statistical Association June 2003, Vol. 98. The main purpose of Symbolic Data Analysis is to substitute a set of rows (cases) in a data table for an concept (second order statistical unit). For example, all of the transactions performed by one person (or any object) for a single transaction that summarizes all the original ones (Symbolic-Object) so that millions of transactions could be summarized in only one that keeps the customary behavior of the person. This is achieved thanks to the fact that the new transaction will have in its fields, not only numbers (like current transactions), but can also have objects such as intervals, histograms, or rules. This representation of an object as a conjunction of properties fits within a data analytic framework concerning symbolic data and symbolic objects, which has proven useful in dealing with big databases. In RSDA version 1.2, methods like centers interval principal components analysis, histogram principal components analysis, multi-valued correspondence analysis, interval multidemensional scaling (INTERSCAL), symbolic hierarchical clustering, CM, CRM, Lasso, Ridge and Elastic Net Linear regression model to interval variables have been implemented. This new version also includes new features to manipulate symbolic data through a new data structure that implements Symbolic Data Frames and methods for converting SODAS and XML SODAS files to RSDA files. Keywords Symbolic data analysis, R package, RSDA, interval principal components analysis, Lasso, Ridge, Elastic Net, Linear regression. University of Costa Rica, San José, Costa Rica; 1 References [1] Billard, L., Diday, E., (2003). From the statistics of data to the statistics of knowledge: symbolic data analysis. J. Amer. Statist. Assoc. 98 (462), [2] Billard, L. & Diday, E. (2006) Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons Ltd, United Kingdom. [3] Bock, H.-H., and Diday, E. (eds.) (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, Berlin: Springer-Verlag. [4] Diday E. (1987): Introduction à l approche symbolique en Analyse des Données. Premières Journées Symbolique-Numérique. Université Paris IX Dauphine. Paris, France. [5] Lima-Neto, E.A., De Carvalho, F.A.T., (2008). Centre and range method to fitting a linear regression model on symbolic interval data. Computational Statistics and Data Analysis 52, [6] Lima-Neto, E.A., De Carvalho, F.A.T., (2010). Constrained linear regression models for symbolic interval-valued variables. Computational Statistics and Data Analysis 54, [7] Rodríguez, O. (2000). Classification et Modèles Linéaires en Analyse des Données Symboliques. Ph.D. Thesis, Paris IX-Dauphine University. [8] Rodríguez, O. with contributions from Olger Calderon and Roberto Zuniga (2014). RSDA - R to Symbolic Data Analysis. R package version 1.2. [http://cran.r-project.org/package=rsda] 2 Symbolic Data Analysis Workshop SDA 2015 November Session Speakers SDA 2015 November University of Orléans, France Tuesday, November 17, afternoon 12:00-13:45 Welcome, Registration, Lunch L'Agora Restaurant, Orléans University Campus 13:55-14:00 Workshop Opening IIIA Computer Science Building Herbrand Amphitheatre Session 1: VARIABLE DEPENDENCIES Chair: Rosanna VERDE 14:00 14:25 Explanatory Power of a Symbolic Data Table Edwin DIDAY, University Paris-Dauphine, France 14:25 14:50 Methods for Analyzing Joint Distribution Valued Data and Actual Data Sets Masahiro MIZUTA, Hiroyuki MINAMI, IIC, Hokkaido University, Japan 14:50 15:15 Advances in regression models for interval variables: a copula based model Eufrasio LIMA NETO, Ulisses DOS ANJOS, Univ. Paraiba, Joao Pessoa, Brasil 15:15 15:40 Symbolic Bayesian Networks Edwin DIDAY, University Paris-Dauphine Richard EMILION, MAPMO, University of Orléans, France 15:40 16:10 Coffee Break Session 2: STATISTICAL APPROACHES Chair: Didier CHAUVEAU 16:10 16:35 Maximum Likelihood Estimations for Interval-Valued Variables Lynne BILLARD, University of Georgia, USA 16:35 17:00 Function-valued Image Segmentation using Functional Kernel Density Estimation Laurent DELSOL, Cecile LOUCHET, MAPMO, University of Orléans, France 17:00 17:25 Outlier Detection in Interval Data A. Pedro DUARTE SILVA, UCP Porto, Portugal Peter FILZMOSER, TU Vienna, Austria Paula BRITO, FEP & LIAAD-INESC TEC, Univ. Porto, Portugal 19:30 Workshop Dinner. 'Le Martroi' restaurant. 12, Place du Martroi. Orléans Tram stop: 'De Gaulle' or 'République' Explanatory Power of a Symbolic Data Table Edwin Diday (Paris-Dauphine University) The main aim of this talk is to study the «explanatory» quality of a symbolic data table. We give criterion based on entropy and discrimination which are shown to be complementary. We show that under some conditions the best descriptive variables of the concepts are also the best predictive one of the concepts. Methods for Analyzing Joint Distribution Valued Data and Actual Data Sets Masahiro Mizuta 1*, Hiroyuki Minami 1* 1. Advanced Data Science Laboratory, Information Initiative Center, Hokkaido University, JAPAN *Contact author: Keywords: Simultaneous Distribution Valued Data, Parkinsons Telemonitoring Data Analysis of distribution valued data is one of the hottest topics in SDA: especially, joint distribution (or, simultaneous distribution) valued data. In this talk, we introduce methods for them and show an open actual data. We assume that we have n concepts (or, objects) and each of concepts is described by distribution. Many methods are proposed. Key ideas are summarized as follows: (1) Use of distances between concepts, (2) Use o f parameters of distributions, (3) Use o f quantile function. When the concepts are described by joint distributions, the approach (1) is natural. Igarashi (2015) proposed a method based on it. There is room to adopt the approaches (2) and (3). In order to study methods for data analysis, good actual data sets are helpful. But, there are not so many good datasets of joint distribution valued data. Mizuta (2014) showed a dataset; Monitoring Post Data in around Fukushima Prefecture. Another good data set is Parkinsons Telemonitoring data set, which can be gotten from Web (https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring). I will introduce them. Acknowledgment: I wish to thank Mr. Igarashi. A part of this work is based on the results of Igarashi (2015). References M. Mizuta (2012). Analysis of Distribution Valued Dissimilarity Data. In Challenges at the Interface of Data Analysis, Computer Science, and Optimization. Springer, M. Mizuta (2014). Symbolic Data Analysis for Big Data. Proceedings of 2014 Workshop in Symbolic Data Analysis 59. K. Igarashi, H. Minami, M. Mizuta (2015). Exploratory Methods for Joint Distribution Valued Data and Their Application. Communications for Statistical Applications and Methods, 2015, Vol. 22, No. 3, , DOI: A. Irpino, R. Verde (2015). Basic Statistics for Distributional Symbolic Variables: A New Metricbased Approach. Advances in Data Analysis and Classification, Vol.9, No.2, Advances in regression models for interval variables: a copula based model Eufrásio Lima Neto 1,*, Ulisses dos Anjos 1 1. Department of Statistics, Federal University of Paraíba, João Pessoa, PB, Brazil. *Contact author: Keywords: Inference, Copulas, Regression, Interval Variable. Regression models are widely used to solve problems in many fields. However, the uses of inferential techniques play an important role in order to validate these models. Recently, some contributions were presented in order to fit a regression model for interval-valued variables. We start this talk discussing about some of these techniques. Then, it is stated a regression model for interval-valued variables based on copula theory that allows more flexibility for the model s random component. In this way, the main advance of the new approach is that is possible to consider inferential procedures over the parameters estimates as well as goodness-of-fit measures and residual analysis based on general probabilistic background. A Monte Carlo simulation study demonstrated asymptotic properties for the maximum likelihood estimates obtained form the copula regression model. Applications to real data sets are also considered.! References Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with normal and skew-normal distributions. Journal of Applied Statistics 39, Blanco-Fernández, A., Corral, N. and González-Rodríguez, G. (2011). Estimation of a flexible simple linear model for interval data based on set arithmetic. Computational Statistics Data Analysis 55, Diday, E. and Vrac, M. (2005). Mixture decomposition of distributions by copulas in the symbolic data analysis framework. Discrete Applied Mathematics 147(1), Lima Neto, E.A. and Anjos, U.U. (2015). Regression model for interval-valued variables based on copulas. Journal of Applied Statistics, Lima Neto, E.A., Cordeiro, G.M. and De Carvalho, F.A.T. (2011). Bivariate symbolic regression models for interval-valued variables, Journal of Statistical Computation and Simulation 81, Symbolic Bayesian Networks Edwin DIDAY 1, Richard EMILION 2,? 1. CEREMADE, University Paris-Dauphine 2. MAPMO, University of Orléans, France? Contact author: Keywords: Bayesian network, Conditional distribution, Dirichlet distribution, Independence test. Bayesian networks, see e.g. [1], are probabilistic directed acyclic graphs used for system behavior modelling through conditional distributions. They generally deal with coorelated categorical or real-valued random variables. We consider Bayesian networks dealing with probability-distribution-valued random variables. 1. Statistical setting Le X =(X 1,...,X j,...,x p ) be a random vector, p 1 being a integer and each X j taking values in the space of probability measures defined on a measurable space (V j, V j ), j =1,...,p. Let (X k,1,...,x k,j,...,x k,p ) k =1,...,K be a sample of size K of X. Consider k as a row index and j as a column one. 2. Motivation Actually the sample (X k,1,...,x k,j,...,x k,p ) k =1,...,K is not observed but only estimated from observed data. In symbolic data analysis (SDA), each observed data belong to a class among K disjoint classes, say c 1,...,c K. They can be either vectors in Q p j=1 V j or in some V j as seen in the two examples below which illustrate two different situations. The empirical distribution of the data in V j which belong to class c k is an estimation of the probability distribution X k,j. This distribution is considered as the j-th descriptor of class c k. 2.1 Paired Samples In the well-known Fisher s iris data set, K =3, c 1 = setosa, c 2 = versicolor, c 3 = virginica, p =4. The observations are 50 iris in each of these 3 classes. The observed samples are paired since each iris is described by a vector of 4 data. As an example, X 3,2 is the probability distribution of sepal width in virginica class. 2.2 Unpaired Samples Let c 1,...,c k be K students and p professors that grade several students exams. Let X k,j be the distribution of student c k grades given by professor j. It is seen here that the samples are unpaired since the exams and the number of exams can differ from one professor to another. 2.3 Dependencies Clearly, in the case of paired samples, within each class, data of descriptor j are correlated to data of descriptor j0 while this correlation is meaningless in the case of unpaired samples. However considering the K pairs of estimated distributions (X k,j,x k,j0 ),k =1,...,K, j,j0 =1,..., p, j 6= j0, it is seen that the random distributions X j and X j0 can be correlated. This motivates us to consider Bayesian networks dealing with probability distributions. 3. The case of finite sets Assume V j finite so that X k,j is a probability vector of frequencies which size can depend on j. Therefore, Bayesian networks are built by testing the independence (resp. the correlation) between the two random vectors X j and X j0. We have used the indep.etest() function implemented in the energy package for R [3]. Distributions and conditional distributions are estimated using kernels in the nonparametric case while Dirichlet distributions are used in the parametric case. 4. The case of densities Assume that each V j is a measurable subsets of some R d j and that X k,j has a density f k,j w.r.t. the Lebesgue measure. Independence tests can be performed and conditional distributions can be estimated using some functional data analysis methods either using a finite number of coordinates on some basis to be in the finite sets case, or using kernel estimators w.r.t. a distance on a function space [2]. References [1] Darwich, A. (2009). Modeling and Reasoning with Bayesian Networks. Cambridge University Press. [2] Ramsey, J.O. - Silverman, B.W. (2005) Functional Data Analysis. Springer. [3] Szekely, G.J. - Rizzo, M.L. (2013). The distance correlation t-test of independence in

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.