Abstract

According to the World Health Organization (http:// www.who.int/cancer/en), cancer is a leading cause of death worldwide. From a total of 58 million deaths in 2005, cancer accounts for 7.6 million (or 13%) of all deaths. The main types of cancer leading to overall cancer mortality are i) Lung (1.3 million deaths/year), ii) Stomach (almost 1 million deaths/year), iii) Liver (662,000 deaths/year), iv) Colon (655,000 deaths/year) and v) Breast (502,000 deaths/year). Among men the most frequent cancer types worldwide are (in order of number of global deaths): lung, stomach, liver, colorectal, oesophagus and prostate, while among women (in order of number of global deaths) they are: breast, lung, stomach, colorectal and cervical. Technological advancements in recent years are enabling the collection of large amounts of cancer related data. In particular, in the field of Bioinformatics, high-throughput microarray gene experiments are possible, leading to an information explosion. This requires the development of data mining procedures that speed up the process of scientific discovery, and the in-depth understanding of the internal structure of the data. This is crucial for the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad, Piatesky-Shapiro & Smyth, 1996). Researchers need to understand their data rapidly and with greater ease. In general, objects under study are described in terms of collections of heterogeneous properties. It is typical for medical data to be composed of properties represented by nominal, ordinal or real-valued variables (scalar), as well as by others of a more complex nature, like images, time-series, etc. In addition, the information comes with different degrees of precision, uncertainty and information completeness (missing data is quite common). Classical data mining and analysis methods are sometimes difficult to use, the output of many procedures may be large and time consuming to analyze, and often their interpretation requires special expertise. Moreover, some methods are based on assumptions about the data which limit their application, specially for the purpose of exploration, comparison, hypothesis formation, etc, typical of the first stages of scientific investigation. This makes graphical representation directly appealing. Humans perceive most of the information through vision, in large quantities and at very high input rates. The human brain is extremely well qualified for the fast understanding of complex visual patterns, and still outperforms the computer. Several reasons make Virtual Reality (VR) a suitable paradigm: i) it is flexible (it allows the choice of different representation models to better suit human perception preferences), ii) allows immersion (the user can navigate inside the data, and interact with the objects in the world), iii) creates a living experience (the user is not merely a passive observer, but an actor in the world) and iv) VR is broad and deep (the user may see the VR world as a whole, and/or concentrate on specific details of the world). Of no less importance is the fact that in order to interact with a virtual world, only minimal skills are required. Visualization techniques may be very useful for medical decisión support in the oncology area. In this paper unsupervised neural networks are used for constructing VR spaces for visual data mining of gene expression cancer data. Three datasets are used in the paper, representative of three of the most importanttypes of cancer in modern medicine: liver, stomach and lung. The data sets are composed of samples from normal and tumor tissues, described in terms of tens of thousands of variables, which are the corresponding gene expression intensities measured in microarray experiments. Despite the very high dimensionality of the studied patterns, high quality visual representations in the form of structure-preserving VR spaces are obtained using SAMANN neural networks, which enables the differentiation of cancerous and noncancerous tissues. The same networks could be used as nonlinear feature generators in a preprocessing step for other data mining procedures.

Introduction

According to the World Health Organization (http://www.who.int/cancer/en), cancer is a leading cause of death worldwide. From a total of 58 million deaths in 2005, cancer accounts for 7.6 million (or 13%) of all deaths. The main types of cancer leading to overall cancer mortality are i) Lung (1.3 million deaths/year), ii) Stomach (almost 1 million deaths/year), iii) Liver (662,000 deaths/year), iv) Colon (655,000 deaths/year) and v) Breast (502,000 deaths/year). Among men the most frequent cancer types worldwide are (in order of number of global deaths): lung, stomach, liver, colorectal, oesophagus and prostate, while among women (in order of number of global deaths) they are: breast, lung, stomach, colorectal and cervical.

Technological advancements in recent years are enabling the collection of large amounts of cancer related data. In particular, in the field of Bioinformatics, high-throughput microarray gene experiments are possible, leading to an information explosion. This requires the development of data mining procedures that speed up the process of scientific discovery, and the in-depth understanding of the internal structure of the data. This is crucial for the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad, Piatesky-Shapiro & Smyth, 1996). Researchers need to understand their data rapidly and with greater ease. In general, objects under study are described in terms of collections of heterogeneous properties. It is typical for medical data to be composed of properties represented by nominal, ordinal or real-valued variables (scalar), as well as by others of a more complex nature, like images, time-series, etc. In addition, the information comes with different degrees of precision, uncertainty and information completeness (missing data is quite common).

Classical data mining and analysis methods are sometimes difficult to use, the output of many procedures may be large and time consuming to analyze, and often their interpretation requires special expertise. Moreover, some methods are based on assumptions about the data which limit their application, specially for the purpose of exploration, comparison, hypothesis formation, etc, typical of the first stages of scientific investigation. This makes graphical representation directly appealing. Humans perceive most of the information through vision, in large quantities and at very high input rates. The human brain is extremely well qualified for the fast understanding of complex visual patterns, and still outperforms the computer. Several reasons make Virtual Reality (VR) a suitable paradigm: i) it is flexible (it allows the choice of different representation models to better suit human perception preferences), ii) allows immersion (the user can navigate inside the data, and interact with the objects in the world), iii) creates a living experience (the user is not merely a passive observer, but an actor in the world) and iv) VR is broad and deep (the user may see the VR world as a whole, and/or concentrate on specific details of the world). Of no less importance is the fact that in order to interact with a virtual world, only minimal skills are required.

Visualization techniques may be very useful for medical decisión support in the oncology area. In this paper unsupervised neural networks are used for constructing VR spaces for visual data mining of gene expression cancer data. Three datasets are used in the paper, representative of three of the most important types of cancer in modern medicine: liver, stomach and lung. The data sets are composed of samples from normal and tumor tissues, described in terms of tens of thousands of variables, which are the corresponding gene expression intensities measured in microarray experiments. Despite the very high dimensionality of the studied patterns, high quality visual representations in the form of structure-preserving VR spaces are obtained using SAMANN neural networks, which enables the differentiation of cancerous and noncancerous tissues. The same networks could be used as nonlinear feature generators in a preprocessing step for other data mining procedures.

Key Terms in this Chapter

Backpropagation algorithm: Algorithm to compute the gradient with respect to the weights, used for the training of some types of artificial neural networks. It was first described by P. Werbos in 1974, and further developed by D.E. Rumelhart, G.E. Hinton and R.J. Williams in 1986

Sammon Error: Error function to maximize structure preservation in projected data. It is defined as,where ij and ij are dissimilarity measures between two objects i, j in the original and projected space, respectively

SAMANN Neural Networks: Unsupervised feedforward neural networks for data projection. The classical way of training SAMANN networks was described by J. Mao and A.K. Jain in 1995. It consists of a gradient descent method where the derivatives of the Sammon error are computed in a similar way to the backpropagation algorithm.

Gene Expression: Process by which the inheritable information which comprises a gene, such as the DNA sequence, is made manifest as a physical and biologically functional gene product, such as protein or RNA

Data Mining: Nontrivial extraction of implicit, previously unknown and potentially useful information from data. Typically, analytical methods and tools are applied to data with the aim of identifying patterns, relationships or obtaining databases for tasks such as classification, prediction, estimation or clustering

Virtual Reality: Technology which allows the user to interact with a computer-simulated environment. Most current virtual reality environments are mainly visual experiences, displayed either on a computer screen or through special stereoscopic displays. Some advanced haptic systems include tactile information

Artificial Neural Networks: Interconnected group of simple units (neurons) that, as a function of the connections between the units and the parameters, can compute complex behaviors and find nonlinear relationships in data. They are used in applications such as robotics, signal processing, or medical diagnosis

Condor: Specialized workload management system for computer-intensive jobs in a distributed computing environment, developed at the university of Wisconsin-Madison (http://www.cs.wisc.edu/condor). It provides a job queuing mechanism, resource monitoring and management, scheduling policy, and priority scheme