I've been watching some of the tweets coming out of Data Incite 2013. A few have caused my eye-brows to raise. Caveat on this, tweets are always a bit dangerous because people have to get their entire thought over in 140 characters and you don't know what context they are saying these things in. Reading some of my past tweets I sometime wonder what I was drinking at the time :-)

How do you automate curiosity, creativity and innovation, three of qualities a Data Scientist needs? Tricky!

There's a great book called The Genesis Machine by James P Hogan. In the story one of the scientists crucial to a project leaves. The government paymasters tell the remaining scientists just to reproduce his work and carry on. The scientists try to explain that it's just not possible to reproduce genius to order "You can't tell a Rembrandt to go paint a masterpiece today.". It's a bit like that for Data Science, we can automate some of the process that surround it, but we can't automate the core, the creativity and the curiosity.

Traditional Manual Analytics is Dead

Hmmmm, I don't think so, at least not for a very long time.

Firstly, manual analytics is one of the primary tools of a data scientist. You look at some data, you see something interesting, you do a bit of quick and dirty analysis on it. It's interesting but not quite right, you tweak it, you run some more analysis, you add some more data to the mix, it's getting better, so you .... repeat as necessary.

Secondly, it's too deeply embedded :-) There was a great keynote at the Strata EU Conference earlier this week from Felienne Hermans called "Spreadsheets: The Dark Matter of IT ...". She makes the point that while we have great BI tools now, but people are not going to stop using Excel. it's too useful and too easy to use. And guess where people do a lot of their manual analytics - ummm, that'd be Excel.

I presented Raspberry Flavoured Hadoop as an Ignite presentation last night at the Strata EU.

It's always fun doing an Ignite presentation. For those of you who don't know the format it's 20 slides in 5 minutes, with the slides auto-advancing every 15 seconds. Your not allowed any notes, so you have to prep in advance and make sure you know your slides and what your going to say. Saying that of course, as any presenter knows, "No presentation every survives contact with the audience" and that goes double for Ignite presentations.

Below is the annotated presentation with what I meant to say for each slide -the blue boxes. Those boxes weren't on the slides I presented so there was a bit of divergence at times :-) but I think most people found it amusing.

UPDATE: - They posted the videos of the Ignite presentations to YouTube (thanks to Doug Masten for finding it) so now you can watch it in all it's glory. If your really feeling bored you can look at the annotated slides above and see what I was going to say, compared to what I actually said.

Hadoop isn't that easy for a beginner to learn. It's a relatively new environment and the instructions tend to assume the implementer is quite computer literate and a fair number of Linux skills.

I'm a firm believer that the best way to learn things is by doing them., and especially having to fix them when it all goes horribly wrong :-).

So when I started looking at Hadoop a while ago I decided that the best way to learn it was to build an Hadoop cluster. That presented a number of problems. The first was of course, what to build it on.

To build a meaningful cluster your going to need at least five or six machines to build it on. There are various ways you can do this.

You can do it using virtual machines, and in fact this is probably the easiest way to do it. If you look around any number of people will offer you pre-built Hadoop VMs for you to play with. But that breaks the first rule of learning, your not doing the install so your not going to learn anything about how you install Hadoop and it's inner workings. You can certainly build your own VMs, but that divorces you from the hardware :-(

You can do it on a Cloud Service such as Amazon EC2 - but that can get expensive and it's still divorcing you from the hardware :-(

You can build it on a number of second hand or scrounged PCs. This'll certainly work and you will definitely get yours hand dirty with the hardware - probably very dirty as you clean out several years worth of grim that always infests older PCs. There are other disadvantages to this approach that may not be immediately obvious. The cost of running 5 or 6 PCs, the heat they generate, the amount of desk space they take up, and the objections from your better half about the jet engine like noise from the fans as you start them all up. A colleague of mine who followed this approach used to start his cluster up remotely for demo purposes but had to stop when his wife threatened to disassemble it if it he wasn't present when it started.

So what's the alternative?

Meet the Raspberry Pi, a credit-card sized computer that was launched about 18 months ago by the Raspberry Pi Foundation as an education tool. It's a complete computer with an ARM CPU, 512MB RAM, video, 10/100Mb ethernet, USB ports and SC card storage on a single board the size of a credit card. And the killer bit - it costs $35 (about £25).

You see where I'm going with this :-D

Make no mistake about it - there are challenges to using the Raspberry Pi - it's very resource limited. The CPU is a 700MHz ARM processor, the RAM is only 512MB and the network is only 100Mb. But overcoming challenges helps you learn - though you may lose some hair in the process ;-)

There's a great quote from Meet The Robinsons - "From failure you learn; from success, no so much". Implementing an Hadoop cluster on Raspberry Pi's certainly provided me with some failures :-)

To get started I built a single node setup - the good news is the hardware only costs about £40. A Raspberry Pi model B, a 16GB SD Card, a PSU and a network cable.

TIP: Only buy quality SD Cards and try to get a Class 10 card. I know a lot of people have problems with SD cards corrupting on Raspberry Pi's. So far I haven't had this happen to me (touch wood).

And just to be really adventurous I decided that I'd install Hadoop 2.2 as it the latest release version. Why is this adventurous? There are various blog entries around on how to install Hadoop 1.x, 2.0 & 2.1 (beta) on Raspberry Pi's but nothing on 2.2, and 2.2 introduced some changes in Hadoop that affects the installation.