How Can Data Mining Help Bio-Data Analysis?

ABSTRACT 2. HOW DATA MINING MAY HELP BIO-

Recent progress in data mining research has led to the de- DATA ANALYSIS?velopment of numerous eﬃcient and scalable methods for Here we list a few interesting themes on data mining thatmining interesting patterns in large databases. In the mean may help bio-data analysis.time, recent progress in biology, medical science, and DNAtechnology has led to the accumulation of tremendous amounts 1. Data cleaning, data preprocessing, and semanticof bio-medical data that demands for in-depth analysis. The integration of heterogeneous, distributed bio-medicalquestion becomes how to bridge the two ﬁelds, data min- databases.ing and bioinformatics, for successful mining of bio-medical Due to the highly distributed, uncontrolled generation anddata. In this abstract, we analyze how data mining may help use of a wide variety of bio-medical data, data cleaning, databio-medical data analysis and outline some research prob- preprocessing, and the semantic integration of such hetero-lems that may motivate the further developments of data geneous and widely distributed biomedical databases, suchmining tools for bio-data analysis. as genome databases and proteome databases, have become an important task for systematic and coordinated analysis ofKeywords bio-medical databases. This has promoted the research andBio-medical data analysis, data mining, bioinformatics, data development of integrated data warehouses and distributedmining applications, research challenges federated databases to store and manage the primary and derived bio-medical data, such as genetic data. Data clean-1. INTRODUCTION ing and data integration methods developed in data mining, such as [9; 3], will help the integration of bio-medical dataIn the past two decades we have witnessed revolutionary and the construction of data warehouses for bio-medical datachanges in biomedical research and bio-technology and an analysis.explosive growth of bio-medical data, ranging from those 2. Exploration of existing data mining tools for bio-collected in pharmaceutical studies and cancer therapy in- data analysis.vestigations to those identiﬁed in genomics and proteomicsresearch by discovering sequencing patterns, gene functions, With years of research and developments, there have beenand protein-protein interactions. The rapid progress of bio- many data mining, machine learning, and statistics analysistechnology and bio-data analysis methods has led to the systems and tools available for use in bio-data explorationemergence and fast growth of a promising new ﬁeld: bioin- and bio-data analysis. Comprehensive surveys and introduc-formatics. tion of data mining methods have been compiled into many textbooks such as [11; 6; 7]. There are also many textbooksOn the other hand, recent progress in data mining research on bioinformatics, such as [2; 8; 5; 4]. General data mininghas led to the developments of numerous eﬃcient and scal- and data analysis systems have been constructed for suchable methods for mining interesting patterns and knowledge analysis, such as SAS Enterprise Miner, SPSS, SPlus, IBMin large databases, ranging from eﬃcient classiﬁcation meth- Intelligent Miner, Microsoft SQLServer 2000, SGI MineSet,ods to clustering, outlier analysis, frequent, sequential and and Inxight VizServer. There are also some bio-speciﬁc datastructured pattern analysis methods, and visualization and analysis software systems, such as GeneSpring, Spot Fire,spatial/temporal data analysis tools. VectorNTI, COMPASS, and SMA (Statistics for Microar-The question becomes how to bridge the two ﬁelds, data ray Analysis) in R. These tools are evolving as well. Formining and bioinformatics, for successful data mining in bio- bio-data analysis, it is important to train researchers to mas-medical data. Especially, we should analyze how data min- ter and explore the power of these well-tested and popularlying may help eﬃcient and eﬀective bio-medical data analysis used data mining tools and packages. A lot of routine dataand outline some research problems that may motivate the analysis work can be done using such tools.further developments of powerful data mining tools for bio- With sophisticated bio-data analysis tasks, there is muchdata analysis. This is the motivation of this talk. room for research and development of advanced, eﬀective, and scalable data mining methods in bio-data analysis. Some interesting topics in this direction are illustrated as follows.

3. Similarity search and comparison in bio-data. 8. Privacy preserving mining of bio-medical data.One of the most important search problems in bio-data anal- Although information exchange is important, hospitals andysis is similarity search and comparison among bio-sequences research institutes may still be reluctant to give out preciousand structures. For example, gene sequences isolated from bio-medical data due to conﬁdentiality, liability, and otherdiseased and healthy tissues can be compared to identify concerns. Thus it is important to develop privacy preserv-critical diﬀerences between the two classes of genes. This ing data mining methods, such as [1], to maximally protectcan be done by ﬁrst retrieving the gene sequences from the privacy while achieving eﬀective data mining.two tissue classes, and then ﬁnding and comparing the fre-quently occurring patterns of each class. Usually, sequences 3. CONCLUSIONSoccurring more frequently in the diseased samples than inthe healthy samples might indicate the genetic factors of the Both data mining and bioinformatics are fast expanding re-disease; on the other hand, those occurring only more fre- search frontiers. It is important to examine what are thequently in the healthy samples might indicate mechanisms important research issues in bioinformatics and develop newthat protect the body from the disease. Similar analysis data mining methods for scalable and eﬀective bio-data anal-can be performed on microarray data and protein data to ysis. We believe that the active interactions and collabora-identify similar and dissimilar patterns. Moreover, since bio- tions between these two ﬁelds have just started and a lot ofdata usually contains noise or non-perfect matches, it is im- exciting results will appear in the near future.portant to develop eﬀective sequential or structural patternmining algorithms in the noisy environment, such as that 4. REFERENCESrecently reported in [12]. [1] R. Agrawal and R. Srikant. Privacy-preserving data4. Association analysis: identiﬁcation of co-occurring mining. In SIGMOD’00, pp. 439–450, Dallas, TX, Maybio-sequences or other correlated patterns. 2000.Currently, many studies have focused on the comparison of [2] A. Baxevanis and B. F. F. Ouellette. Bioinformatics: Aone gene to another. However, most diseases are not trig- Practical Guide to the Analysis of Genes and Proteinsgered by a single gene but by a combination of genes acting (2nd ed.). John Wiley & Sons, 2001.together. Association and correlation analysis methods canbe used to help determine the kinds of genes or proteins [3] T. Dasu, T. Johnson, S. Muthukrishnan, andthat are likely to co-occur in target samples. Such analysis V. Shkapenyuk. Mining database structure; or how towould facilitate the discovery of groups of genes or proteins build a data quality browser. In SIGMOD’02, pp. 240–and the study of interactions and relationships among them. 251, Madison, WI, June 2002.5. Frequent pattern-based cluster analysis. [4] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Bi- ological Sequence Analysis: Probability Models of Pro-Most cluster analysis algorithms are based on either Eu- teins and Nucleric Acids. Cambridge University Press,clidean distances or density [6]. However, bio-data often 1998.consists of a lot of features which form a high dimensionspace, and it is crucial to study diﬀerentials with scaling [5] W. J. Ewens and G. R. Grant. Statistical Methods inand shifting factors in multi-dimensional space and discover Bioinformatics: An Introduction. Springer-Verlag, Newpair-wise frequent patterns and cluster bio-data based on York, 2001.such frequent patterns. One interesting study taken mico-rarray data as examples is in [10]. [6] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.6. Path analysis: linking genes or proteins to diﬀer-ent stages of disease development. [7] T. Hastie, R. Tibshirani, and J. Friedman. The Ele- ments of Statistical Learning: Data Mining, Inference,While a group of genes/proteins may contribute to a dis- and Prediction. Springer-Verlag, New York, 2001.ease process, diﬀerent genes/proteins may become active atdiﬀerent stages of the disease. If the sequence of genetic [8] A. M. Lesk. Introduction to Bioinformatics. Oxfordactivities across the diﬀerent stages of disease development University Press, 2002.can be identiﬁed, it may be possible to develop pharmaceu- [9] V. Raman and J. M. Hellerstein. Potter’s wheel: Antical interventions that target the diﬀerent stages separately, interactive data cleaning system. In VLDB’01, pp. 381–therefore achieving more eﬀective treatment of the disease. 390, Rome, Italy, Sept. 2001.Such path analysis is expected to play an important role ingenetic studies. [10] H. Wang, J. Yang, W. Wang, and P. S. Yu. Clustering by pattern similarity in large data sets. In SIGMOD’02,7. Data visualization and visual data mining. pp. 418–427, Madison, WI, June 2002.Complex structures and sequencing patterns of genes and [11] I. H. Witten and E. Frank. Data Mining: Practical Ma-proteins are most eﬀectively presented in graphs, trees, cubes, chine Learning Tools and Techniques with Java Imple-and chains by various kinds of visualization tools. Such vi- mentations. Morgan Kaufmann, 2001.sually appealing structures and patterns facilitate patternunderstanding, knowledge discovery, and interactive data [12] J. Yang, P. S. Yu, W. Wang, and J. Han. Mining longexploration. Visualization and visual data mining therefore sequential patterns in a noisy environment. In SIG-play an important role in biomedical data mining. MOD’02, pp. 406–417, Madison, WI, June 2002.