The Prologue.

Recently I’ve been very curious, I know that alone makes people in tech really nervous. I was curious to find out the first mentions of BigData and Hadoop in this blog, April 2012 and the previous year I’d been doing a lot of reading on cloud technologies and moreover data, my thirty year focus is data and right now in 2017 I’m halfway through.

The edge as I saw it would be to go macro on data and insight, that had been my thought ten years earlier. The whole play with customer data was clear in my mind then. In 2002 though we didn’t have the tooling, we made it ourselves. Crude, yes. Worked, it did.

When I moved to Northern Ireland I kept talking about the data plays to mainly deaf ears, some got it. Most didn’t. “Hadoop, never heard of it”. Five years later everyone has heard of Hadoop… too late.

It’s usually about now we have a word cloud with lots of big data related words on it.

Small Data, Big Data Tools

Most of the stories I hear about Big Data adoption are just this, using Big Data tools to solve small data problems. On the face of it the amount of data an organisation has rarely amounts to the need for huge tooling like Hadoop or Spark. My guess is (and I’ve seen partially confirmed) that the larger platforms like Cloudera, MapR and Hortonworks compete on a very narrow field of real big customers.

Let’s be honest with ourselves, Netflix and Amazon sized data are more deviations of the mean than the mean itself and the probability of it being given to you is very small unless it’s made public.

I personally found out in 2012 when I put together Cloudatics, using big data tools is a very hard sell. Many companies just don’t care, not all understand the benefits and those who cared still didn’t see how it would apply to them. Your pipeline is slim, at a guess 100:1 ratio would apply, that was optimistic then let alone five years on.

Most of us aren’t near “Averaged Sized Data” let alone Big Data.

When first met Bruce Durling back in late 2013 (he probably regretted that coffee) we talked about all the tools, how there’s no need to write all this Java stuff when a few lines of Pig will do and how solving a specific problem with existing big data tools was far better than trying to launch a platform (yup, know that, already tried).

What Bruce and I also know that we work with average sized data…. it’s not big data but it’s not small data. Do we need Hadoop or Spark? Probably not, can we code and scale it on our own, yes we can. Do we have the skills to do huge data processing, you betcha.

I sat in a room a few weeks ago where mining 40,000 tweets was classed as a monumental achievement, I don’t want to burst anyone’s bubble, it’s not. Even 80 million tweets is not a big data problem, neither an average sized data one. On my laptop doing sentiment analysis took under a minute.

Now enter all life saving AI!

And guess what, it looks like the same mistake is going to be repeated. This time with artificial intelligence. It’ll save lives! It’ll replace jobs! It’ll replace humans! It can’t tell the difference between a turtle and a gun! All that stuff is coming back.

If you firmly believe that a black box is going to revolutionise your business then please be my guest. Just be ready with the legals and customer service department, AI is rarely 100% accurate.

Like big data you’ll needs tons of data to train your “I have no idea how it works it’s all voodoo” black box algorithm. The less you train the more error prone your predictions will be. Ultimately the only the only thing it will harm is the organisation who ran the AI in the fist place. Take it as fact that customers will point the finger straight back at you, very publicly, if you get prediction wildly wrong.

I’ve seen Google video and Amazon Alexa voice classification neural works do amazing things, the usual startup on the street may have access to the tools but rarely the data to train. And my key takeaway of learning since doing all that Nectar card stuff, without quality data and lots of it, you’re fight will be a hard one.

I think there is still a good few years at the R&D coalface trying to figure it all out where AI could fit properly. Yes jobs will be replaced by AI, new jobs will be created. Humans will sit aside robotic machines that take the heavy lifting away (that was going on for a long time before the marketers got hold of AI and started scaring the s**t out of people with it.

It’s not impossible to start something in the AI space and put it on the cloud, though, the costs can add up if you take your eye off the ball. The real question is, “do you really have to do it that way? Is there an easier method?”. Most crunching could be done on a database (not blockchain may I add), hell even an Excel spreadsheet is capable for some without the programming knowledge or money to spend on services.

Popular learning methods are still based on the tried and true methods: decision trees, logistical regression and k-means clustering, not black boxes. The numbers can be worked out away from code as confirmation, though who does that is a different matter entirely. The most well known algorithms can be reverse engineered: decision trees, Bayes networks, Support Vector Machines, Logistic Regression there’s maths laid down bare showing how they work. The rule of thumb is simple: if traditional machine learning methods are not showing good results then try a neural network (the backbone of AI) but only as a last resort, not the first go to.

If you want my advice try the tradition, well tested, algorithms first with the small data you have. I even wrote a book to help you…..

An interesting conversation came up during a tea break in London meeting this week. How do run R scripts from within Clojure? One was simple, the other (mine) was far more complicated (see the “More Complicated Ways” section below).

So here’s me busking my way through the simple way.

Run it from the command line

The Clojure Code

Using the clojure.java.shell package gives you access the Java system command process tools. I’m only interested in running a script so all I need is the sh command.

(ns rinclojure.example1
(:use [clojure.java.shell :only [sh]]))

The shfunction produces a map with three keys: an exit code (:exit), the output (:out) and an error (:err). I can evaluate the output map and ensure there’s no error code, anything that’s not zero, and dump the error or if all is well send out the output.

The R Code

I’ve kept this function simple, I’m only interested in running Rscript and checking the error code. If all is well then we show output, otherwise we send out the error.

The now preferred way to run R scripts from the command line is the Rscript command which is bundled with the R software when you download it. If I have R scripts saved then it’s a case of running them through Rscript and evaluating the output.

Here’s my R script.

myvec <- c(1,2,3,2,3,4,5,4,3,4,3,2,1)
mean(myvec)

Not complicated I know, just a list of numbers and a function to get the average.

Running in the REPL

Remember the error is from the running of the command and not within your R code. If you mess that up then those errors will appear in the :out value.

Easy enough to parse by removing the \n and the [1] line which R have generated. We’re not interacting with R only dumping out the output from it. After that there’s an amount of string manipulation to do.

Expanding to Multiline Output From R

Let’s modify the meantest.Rfile to give us something multiline.

myvec <- c(1,2,3,2,3,4,5,4,3,4,3,2,1)
mean(myvec)
summary(myvec)

Nothing spectacular I know but it has implications. Let’s run it through our Clojure command function.

We have no referencing to what the number means, if the min, max, average etc. At this point there would be more string manipulation required and you could convert them to keywords or just add your own.

More Complicated Ways.

With the R libraries exists the RJava package. This lets you run Java from R and R from Java. I wrote a chapter on R in my book back in 2014.

It’s not the easiest thing to setup but worth the investment. There is a Clojure project on Github that acts as a wrapper between R and Clojure, clj-jri. Once setup you run R as a REngine and evaluate the output that way. There’s far more control but it comes at the cost of complexity.

Keeping Things Simple

Personally I think it’s easier to keep things as simple as possible. Use Rscript to run the R code but it’s worth considering the following points.

Keep your R scripts as simple as possible, output to one line where possible.

Ensure that all your R packages are installed and working, it’s not idea to install them during the Clojure runtime as the output will become hard to parse. Also make sure that all the libraries are running on the same instance as your Clojure code.

In the long run have a set of solid string manipulation functions to hand for dealing with the R output. Remember, t’s one big string.

I promise this is the last part of The Darcey Coefficient, having gone through linear regression, neural networks and refining the accuracy of the prediction, it was only fair I ran the linear regressions against some live scores to see how it performed.

If you want to read the first four parts (yes four, I’m sorry) then they are all here.

Finally I need the data, a vector of vectors with Craig, Len, Bruno and Darcey’s score. I leave Darcey’s actual score in so have something to test against. What the predicted score was and what the actual score was.

Predicting Against Craig’s Scores

The difference between Craig and Darcey’s scores can fluctuate depending on the judges comments. The dance with Ed and Katya is a good example, Craig scored 2 and Darcey scored 6, so I’m not expecting great things from this regression, but as it was our starting point let’s test it.

Refinement is an iterative process, sometimes quick and sometimes slow. If you’ve followed the last few blog posts on score prediction (if not you can catch up here) I’ve run the data once and rolled with the prediction, basically, “that’s good enough for this”.

The kettle is on, tea = thinking time

This morning I was left wondering, as Strictly is on tonight, is there any way to improve reliability of the linear regression from the spreadsheet? The neural network was fine but for good machine learning you need an awful lot of data to get a good prediction fit. The neural net was level pegging with the small linear model, about 72%.

I’ve got two choices, create more data to tighten up the neural net or have a closer look at the original data and find a way of changing my thinking.

Change your thinking for better insights?

Let’s remind ourselves of the raw data again.

2,5,5,5
5,6,4,5
3,5,4,4
4,6,6,7
6,6,7,6
7,7,7,7

Four numbers, the scores from Craig, Len, Bruno and Darcey in that order. The original linear regression only looked at Craig’s score to see the impact on Darcey’s score.

That gave us the predition:

y = 0.6769x + 3.031

And a R squared value of 0.792, not bad going. The neural network took into account all three scores from Craig, Len and Bruno to classify Darcey’s score, it was okay but the lack of raw data actually let it down.

Refining the linear regression with new learning

If I go back to the spreadsheet, let’s tinker with it. What happens if I combine the three scores using the SUM() function to add them together.

Very interesting, the slope is steeper for a start. The regression now gives us:

y = 0.2855x - 1.2991

And the R squared has gone up from 0.792 to 0.8742, an improvement. And as it stands this algorithm is now more accurate than the neural network I created.

Concluding

It’s a simple change, quite an obvious on and we’ve taken the original hypothesis forward since the original post. How accurate is the linear regression? While I’ll find that out tonight I’m sure.

Last night I attended the Royal Irish Academy lecture “Show me your data, and I’ll tell you who you are” at Ulster University’s Magee Campus. It was an interesting lecture by Dr Brian MacNamee, one that sidestepped any technicalities and aimed for a general audience. It was a very good, informative and entertaining lecture.

One thing that I did notice was the mix of audience, students from school, members of the public and some lecturers from the university, no entrepreneurs in the room that I could see which was a shame though.

It was the amount of school ties in the room that inspired and prompted me to ask the question during the Q&A, “What do you see as the challenges to Machine Learning over the next 5 to 10 years?“.

Some Relics

Some of the machine learning algorithm that we’re using are old, take the decision tree for example, the ID3 algorithm designed by Ross Quinlan goes back to 1986, it’s on its thirtieth birthday. Threshold Logic, which because the foundation for Neural Networks dates back from the work of Warren McCulloch and Walter Pitts back in 1943, that’s 73 years ago.

Most of our modern machine learning systems are based on old technologies and algorithms, is there an opportunity to refine and redevelop these technologies? Is there an opening for new algorithms? I believe there is and the ones who will carry that torch are possibly the ones sat in the Magee lecture room last night proudly wearing their uniforms.

Thanks to Digital Circle for listing the event, I wouldn’t have known about it otherwise.

The @nitechrank was a simple index to report the changes in programming jobs in Northern Ireland. It wasn’t anything scientific and for all my hailing of the automating of all things possible, meaningless and trivial…. well it was me editing the tweets every morning, with a fresh cup of tea, half asleep and in my dressing gown 99% of the time*. Today I ran the last index as the picture became clear….

The Good News

In linear regression terms the jobs outlook is positive. It slopes upwards over time, not to say that there aren’t down days, the start of June was a bit of a surprise but also was the peak which happened on the day after the referendum. So with a low of 42 jobs and a high of 92 jobs yeah it got a little varied but nothing troubling.

Keep in mind though, the only data source, NiJobs.com seems to list larger companies so you can assume with a certain amount of confidence that startups won’t list there and look for developers word of mouth. One of the reasons that Java may favour so highly in the results as a whole.

For Developers

If you are interested in Clojure then the repository for the code is available on my Github account. Take from it what you will, it’s slung together to get the numbers out. Note there’s no support, the README should be clear enough to get things working. I won’t be doing any updates on this repo, it’s just there for anyone one who wants to look. I also included the original shell script that did the ranking originally but only for historical purposes.

For Data Folk

Everyday the NiTechrank updated a CSV file so in the data directory is the CSV data of all the indexes that were run. Have fun with it, I’ve not done any analysis on it.

Validating the Muse.

Interestingly a lot of the comments centred around decision trees which at least proves they are still popular but also proves that you can sit down with pen and paper and do some of the grunt work yourself to validate the model.

Now most people will load a dataset into something like Weka and let the system do all the work. And you know what, that’s okay, there’s nothing wrong with that. At the same time though I could with some effort work out the information gain, to find the potential root node in a tree, myself with a calculator and prove the model was good.

The same could be said for things like Apriori algorithms, Naive Bayes and Bayes Networks, Linear Regression, K-Means clustering and, at a push, Linear Support Vector Machines. If you have a pen, paper and a calculator you can start working things out.

When we get to neural networks that’s where things start to get hazy, really hazy.

The Media’s Love Affair with Neural Nets

Oh boy the tech press like a good ANN story. Whether it be the Deep Mind team beating Lee Se Dol in three games of Go or IBM’s Watson doing the whole Jeopardy thing or Google’s self driving car. They are all, without doubt, sexy AI stories that are going to generate discussion and debate. And the joys of debate is that it rises up polar opposites of public opinion, like it loathe it. It’s going to help us or it’s going to destroy us.

While the core concepts of neural networks are simple from the idea of perceptron weights to activation functions, they’re quite easy to grasp. The problems arise once the models have been created, the maths can become so black box like they are difficult to prove or write off. One thing’s for sure, over time and with enough iterations it will get better.

Like I said in the book, “One of the keys to understanding the artificial neural network is knowing that the application of the model implies you’re not exactly sure of the relationship of the input and output nodes. You might have a hunch, but you don’t know for sure. The simple fact of the matter is, if you did know this, then you’d be using another machine learning algorithm.”

Though there are good books on the subject the models themselves are always difficult to prove, with enough training you’ll get results. Even the publications which I hold in high esteem such as “Data Mining” by Wittan, Frank and Hall merely skirt around the mechanics of Neural Nets which, to be fair, made me feel a whole lot better when writing the book at the time.

Artificial Intelligence as Frameworks

What I believe we are seeing the chasm point of artificial intelligence frameworks. Some have been around for a while, Weka and RapidMiner for instance, and there are others that are new on the scene such as TensorFlow. The common thread though is that they are provide a starting point for machine learning and AI to the mass market developer.

It’s very much like the web frameworks of the Web 2.0 era of websites. The main tipping point was Ruby on Rails which obscured a lot of the hard work that was going on under the covers. This led to a plethora of web frameworks in a variety of languages where you really didn’t need to know what was going on technically, it was just a case of downloading, setting a few things up and then going through the motions of creating the objects you needed. There came a time where it was more important to know how to get the framework working than the underlying language that was doing all the work. I believe we’re at the same point with artificial intelligence.

Data Velocity + AI + Lack of Algorithmic Knowledge = Concern

While some of the machine learning algorithms have been around for forty plus years it’s only recently they’ve become in vogue due to computing processing power, vast amounts of data generated and the need for corporations to push the bottom line down to keep stake holders happy over the long term.

As we push vast AI knowledge into the hands of developers who may have no prior knowledge on how the stuff really works, is this a good idea? I don’t believe so and any corporation saying “we need to do data science” to a team that doesn’t know what they are doing is commercial suicide in my eyes.

If you look at Google and Tesla they’ve been analysing data over a long period of time. They’ve got the right people involved whether that be developers, quants and the hardcore maths folk in measure, refine and deliver. Even then it goes wrong. The first self driving caused crash, well it was bound to happen at some point. You’re working on a probability and regardless of the odds it can, and will, at some point go the way you weren’t expecting. The point with AI is though, you’re not delving into the algorithm to tweak it at that stage, if you do that then what is the knock on effect of the change? You don’t really know. So far down the line all you can do is provide more data for the algorithm to learn off.

Basically you’re left with un-answered questions but you just have to go with the system because the algorithm says so.

AI and Machine Learning Costs Money

To be done right these technologies take time to develop and deliver. They also need repeat testing to ensure that the application is behaving as you’d expect. All fine with supervised learning as you’ve already defined the outcomes of the training data you have. Unsupervised learning comes with it’s own set of issues that need to be closely looked at before it’s deployed to the real world.

While any developer can download these tools and use them I still firmly believe it’s vitally important to have a knowledge of how these algorithms work. It’s not the easiest thing in the world to do either. I’d rather have an explanation of what happened rather than shrugging my shoulders with a blank look on my face.

The plane crashed because the algorithm did it is just not a reasonable excuse.

A long while ago the powers that be decided to cut air passenger duty (APD) on long haul flights. Yesterday the Scottish government started a consultation on cutting the APD. So the question is, how many flights are actually effected by this APD cut in Northern Ireland?

Northern Ireland’s Air Passenger Duty

Any flight from Northern Ireland that flies direct to a destination over 2000 miles is classed as long haul and exempt from APD. Changeover routes, for example if you flew from Belfast International to London Heathrow and then legged it to Dubai, well that doesn’t count.

So how to do find out the flights that apply? With some open data and some Clojure.

Airports and Routes

Open flights has CSV files for airports, airlines, routes and potentially schedules. I’m only interested in the first two.

Routes are a source and destination airport by IATA code and an operating airline.

LH,3320,BFS,465,EWR,3494,Y,0,752

A Quick Checklist

So, we’ve got a question and we’ve got some open data. Now I need a check list of what needs doing to get to the answer.

Load the CSV files.

Get the airport info of the departure airport (BFS in our case)

Get the matching routes where BFS is the source airport.

Calculate the distance between the two lat/lon points of each airport.

Check the flight is classed as long haul.

Calculate the average.

Loading CSV Files

Always worth knowing how to do in Clojure, especially how to open a CSV file and convert it to a map that has keys. First of all we need to know the header information of the CSV file, as the routes and airports don’t have that information we have to find it out and create our own references.

Finding Specific Airport Information

With an IATA code I can find out information about the airport, which we’ve loaded in already. The get-airport function takes a string as a parameter with the IATA code and returns a map of the airport info.

The filter command rattles through each entry and will return you a sequence of maps that match the criteria, so it’s just a case of getting the first entry (there should only be one anyway).

Get The Matching Routes For the Source Airport

Now we’ve got some helper functions to do some of the work we can get to the meat of what needs to happen. Finding the matching routes is a case of filtering all the routes and finding ones that match our source airport.

Assuming the routes cvs file is loaded in we can use the filter function as so:

(filter #(= (:iata dept-airport) (:source-airport %)) routes)

Once I have routes I need to map through each of them with the aim of creating a map with the source, destination, lat/lon for each airport, a distance, the airline and whether the flight is long haul or not. While the function looks complex it’s actually fairly simple.

There’s two functions I need to create before we can test this, the distance and if the flight is classed as long haul.

Calculating Distances

With two latitude and longitude points we can calculate the distance in miles. I actually covered this a long while ago when I was first dabbling with Clojure (thrown in at the deep end may be a better description) when I started working with Mastodon C.

Is the flight long haul?

With a distance we can figure this out quite easily. My threshold is 2000 miles so I’m going to wrap that up as a piece of data in Clojure.

(def apd-threshold 2000)

Next it’s case of finding out if the distance is greater than the threshold.

(defn is-long-haul? [distance]
(> distance apd-threshold))

Calculating the Percentage

With the total number of routes and the total number of routes that are classed as long haul I can work out a percentage. Using a mixture of Clojure’s count function and filter function we can find out with ease.

I’m passing in the source airport IATA code and using the find-routes function created earlier to get a list of the all the matching routes from the source airport. The first count is filtered on whether the :long-haul value is true, the second count is the total number of routes in the list.

That’s the checklist complete. Now it’s time to see what percentage of the routes for an airport are actually classed as long haul.

Testing With the REPL

First of all I’m going to test the find-routes function, that will give you an idea of the data structure as it looks before the percentage is calculated.

The media has done a real good job of jumping on data, data science, big data, Hadoop, Spark and all the rest of the words that associate themselves around the core word, “data”.

What I’m finding more and more is that we’re expected to accept what is presented to us as verified and right. Guess what, that might not be the case. Your TED talk, your piece in the Economist and so on is all very well but if you can’t publish your data, your model and your path to getting to the conclusion then why should I believe you?

For any article I’m looking for the evidence, a link that takes me back to the beginning of your thoughts and processes. What was your hypothesis? “We mined 55,000 variables….”, great! What are they, where can I see them? Can I give you some feedback?

As we pedal more data stories (let’s put infographics to one side, they tend to be rubbish most of the time anyway) and expect the masses to accept as truth what you’ve presented, I’ll be the outlier on the sides, shouting and poking a stick annoying at your side, “where’s your raw data and where’s the model you used?”, why? Because I’m interested and want to know, but I want to verify it for myself.

The last post was the preamble, now it’s time to get some actual code down. And it turns out I’ve already kicked a hornet’s nest of opinion before I’ve written a line of Clojure. Seems that programmer’s shouldn’t be let near mathematical notation.

Mad Math, Fury Code

Let’s have a look at the scary math bit of The Birchbox Problem again.

Breaking Down The Problem

It’s always easier to break down a problem in to smaller chunks, well I believe so but I could come unstuck but it’s worth a shot. On the surface this looks pretty imposing but hopefully by breaking it down it will become easier. First of all there’s three distinct algorithms at play here. Looking back at the Birchbox post I see that M is a set of products and N is a set of customers.

There’s a few sightings of Sigma notation:

This basically is add all of them together. I wrote about that a short while ago. There is a gotcha with this algorithm though but more of that later.

Max, well that means maximum so there’s something else. The max of B. Okay, so far so good. I’ll tackle the matrix things H() and B() later. First I need some products and some customers to get things started.

Creating Products

Let’s create a vector of maps, each map is a product with an id, name and stock quantity.

Doesn’t feel like much but it’s a starting point. The next job is to put the next parts of the algorithm in. The main thing is, we’ve got started….

….though it’s all totally subject to change.

Calculating Happiness

The function is a bit of a domain knowledge quagmire for us as it’s going to be within the realms of Birchbox how it’s actually calculated. The happiness of a customer based on the product it’s presented with. The result is gathered and calculated from many sources such as customer reviews, star ratings, web site visits to product pages and even social media interaction for example.

This is called an objective function which is:

“ the result of an attempt to express a business goal in mathematical terms for use in decision analysis, operations research or optimization studies“

Which leaves everything a bit open ended but I can create a Clojure function to mock some form of happiness output. A random number will do for the time being while we get things defined. Then we can put some boundary to say anything greater than 75 means the customer is happy with the product.

birchbox.maths.problem> (rand 101)
;; => 69.79507413967906

Becomes…

(defn calc-happiness []
(rand 101))

Binary Assignments

Nice thing with binary it’s going to be 0 or 1. So the binary assignment (B) is basically saying is the product going in to the customer’s box, yes or no. That seems dependent on the result of the happiness rating.

R & D is Fun

I’m going park this part here. I’ve managed to navigate some ideas of what’s going on and got some code down. Whether it’s right or wrong well it’ll take a few more iterations to get a proper handle of what’s been coded is going to be any use or not. Saying that, it’s been fun.