Downloaded a bunch of data and wanted to start already. Then realised I faced the noob issue of not having the right infrastructure or tools set up. R and Rstudio are on my local machine, but the infrastructure to handle the large dataset is not.

Downloaded the cloudera virtual machine, it kept crashing for some reason when I try to add an additional CD/DVD drive. Finally decided to try Revolution Analytics’ Enterprise R (since I can use academic edition), although there might be some duplication – I already have the latest R & Rstudio installed, Revolution has a slightly older one in the installation for now and has their own gui. There’s a whole lot of installation needed too. I would have preferred a cleaner virtualised solution, but there’s no harm trying really.

Well after the bigdata.sg meet up I went to search a bit more about the definition of what is a data scientist – the growing buzzword for the moment. I like the one in the yahoo article that interviewed EMC Greenplum’s Steven Hillion. His take on the definition (the rest of the article can be found here):

To Hillion, data scientists are “analytically-minded, statistically and mathematically sophisticated data engineers who can infer insights into business and other complex systems out of large quantities of data.”

The skill set of the data scientist goes beyond the capabilities of what many would call “traditional business intelligence (BI).” Traditional BI is interested in the “what and the where,” while data scientists are interested in the “how and why,” Hillion says. “They’re interested in inferring things that are not already present in the data.”

I like the part where he mentions that they are “equal parts engineer, statistician and investigative journalist / forensic reporter”. I can relate to those, but something is missing – the programming/hacker skills? And of course the need to understand the business. They need to listen to people, understand what questions they’re asking, but then sort of read between the lines. Skill in mathematics, statistics, modeling and data mining are of course essential.

Went for my first Big Data Meetup last night. Short but interesting talks. It was nice to see many passionate people turning up on a Friday night for something probably not exactly part of their job.

David Smith of Revolution Analytics was there, as was their GM and consultant. He gave a talk on “Future of Big Data Analytics: Data Science holds the key to unlocking insight”, which gave a good description of what skills he thinks are necessary for someone to become a data scientist. And why machine learning alone is not enough. Basically data science sits in the sweet spot of the overlapping area of 3 circles in a venn diagram of – Hacking (computer, programming skills), Statistics and Substantial domain expertise (getting data source, good understanding of the data, relationships between variables, inherent assumptions…essentially to put data and analysis into context so that they are meaningful not just in mathematical/model terms). It encouraged me a lot that he came from a statistics background, at least I got 1 area covered.

Did not take much notes or photos of his slides, but it appears that data science involves a far amount of effort at the initial stage of finding data sources, massaging, mashups, linking/mapping and cleaning. That’s because data sources rarely exist in nicely structured formats, and you will need to source for them through relational databases, web scrapping and available APIs.

The analysis stage requires the muscles of statistics and machine learning. For large datasets, there is a need to move the code to the data i.e. leverage parallel computing, MapReduce to efficiently process the scale of data. In a nutshell, there are 3 layers as shown below:

Q2 how will it evolve? Data scientists embedded within companies, or specialized data science consulting firms to emerge?

Two more people talked about the Heritage Health prize and Kaggle. Kaggle really provided one of them a good platform to learn, practice and be validated (think: get a job interview if you win a prize). My plans are in the right direction at least. What’s left is execution. Kaggle, here I come (after exams).

There was a presentation on UP Singapore by a group called Newton Circus. Somewhat related because technical developers or data scientist-wannabes can contribute great in their quest in:

Leveraging rich data from the government partners, financial support from corporate partners, NGOs and community members will identify critical urban issues and solutions, and use designers, developers and hackers to prototype workable products

Last speaker was from HP Research lab, Dr Liu Xiaohui who talked about the Bamboo initiative that other than simplifying the cloud infrastructure used in solving big data problem, but will also ease the administration of the infrastructure. I don’t think I am doing justice to the initiative with my description, hopefully more information will become public soon.

Apparently there’s going to be a call for collaboration soon. Event to look out for: Cloud Asia. 14-17 May.

Not that I’m into football, but the video showed me some parallels with general learning and machine learning. Training and practice builds up competency. The more training and the more practice, the more basic competency you have. But match conditions (exams, tournaments, competitions) where there are ‘opponents’, they validate what you have learnt or trained for. As you win and get positive affirmation during match conditions, you gain experience (which include heightened competency of you have learnt from training and practice, and through scenarios you have not encountered before – and you applied what you have learnt).

So applying what you have learnt is like prediction. Training and practice are like sample data, allowing us to fit models and learn from it. A positive reaction in match condition is like how a good model is able to deal with response in a more generalised manner and not just limited to the sample data.

Plan for post-exam period is as follows:
– complete Stanford U’s machine learning course by Prof Andrew Ng
– catch up on Caltech’s Learning From Data e-course
– find interesting projects in Kaggle and participate
– watch webinar by Revolution Analytics
– do special term module in microeconomics if all goes well

So basically to work on R, Hadoop and Machine Learning. Let’s go for it!

It’s real fun to do projects with real datasets and R. My first time series project was enriching. We used
+ regression with trend, seasonal and cyclical component
+ ARIMA of the residual of above
+ Direct ARIMA and Seasonal ARIMA fitting
to model daily energy consumption of a factory of 2010 & 2011. Used January 2012 data for cross validation.

Many teams used datasets on exchange rates, CPI, tourism data etc. Something new learnt everyday, like the ADF and Philippe Perron Unit tests to test for stationarity of data. Need to confirm their null hypotheses though.

Recently I got to read up on machine learning, more of introductory and elementary stuff. The more I read, the more similarity (and of course differences as well) I found as compared to statistics. This post I found (Statistics vs. Machine Learning, fight!) shed more light on the two topics.