Data Science

“So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.” Professor Hal Varian, Chief Economist at Google, speaking to the New York Times in February 2009.

Togaware Resources

OnePageR provides a growing collection of material to teach yourself R. Each session is structured around a series of one page topics or tasks, designed to be worked through interactively. The idea is to pace yourself through all of the one pagers in each of the sessions, at your own leisure.

Rattle is a free and open source data mining toolkit written in the statistical language R using the Gnome graphical interface. It runs under GNU/Linux, Macintosh OS X, and MS/Windows. Rattle is being used in business, government, research and for teaching data mining in Australia and internationally. Rattle can be purchased on DVD (or made available as a downloadable CD image) as a standalone installation. Please contact sales@togaware.com for details.

An extended version of the book (consisting of early drafts for the chapters published as above) is freely available as an open source book, The Data Mining Desktop Survival Guide (ISBN 0-9757109-2-3) The books simply explain the otherwise complex algorithms and concepts of data mining, with examples to illustrate each algorithm using the statistical language R. The book is being written by Dr Graham Williams, based on his 20 years research and consulting experience in machine learning and data mining.

A Data Mining Course was held at the Harbin Institute of Technology Shenzhen Graduate School, China, 6 December – 13 December 2006. This course introduced the basic concepts and algorithms of data mining from an applications point of view and introduced the use of R and Rattle for data mining in practise.

A Data Mining Workshop was held over two days at the University of Canberra, 27-28 November, 2006. This course introduced the basic concepts and algorithms for data mining and the use of R and Rattle.

Using R for Data Science

The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.

R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.

R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.

Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM’s Intelligent Miner).

Open standards are important for users, but vendors resist them for obvious reasons, and would prefer to lock you in to their products. A number of commercial tools claim support of, for example, the open standard PMML for interoperability (sharing models between applications). But the support is patchy and not worth the effort. We have started a PMML effort in R to attempt to address the desire for interoperability.

Specific commercial statistical products are excellent in handling very large datasets. But they are limited in the analytic algorithms they provide. Commercial vendors, naturally, need to be convinced of the usefulness of implementing new algorithms. On the other hand, a vast selection has been available for deployment in R for a long time.