Category Archives: Java

An interesting conversation came up during a tea break in London meeting this week. How do run R scripts from within Clojure? One was simple, the other (mine) was far more complicated (see the “More Complicated Ways” section below).

So here’s me busking my way through the simple way.

Run it from the command line

The Clojure Code

Using the clojure.java.shell package gives you access the Java system command process tools. I’m only interested in running a script so all I need is the sh command.

(ns rinclojure.example1
(:use [clojure.java.shell :only [sh]]))

The shfunction produces a map with three keys: an exit code (:exit), the output (:out) and an error (:err). I can evaluate the output map and ensure there’s no error code, anything that’s not zero, and dump the error or if all is well send out the output.

The R Code

I’ve kept this function simple, I’m only interested in running Rscript and checking the error code. If all is well then we show output, otherwise we send out the error.

The now preferred way to run R scripts from the command line is the Rscript command which is bundled with the R software when you download it. If I have R scripts saved then it’s a case of running them through Rscript and evaluating the output.

Here’s my R script.

myvec <- c(1,2,3,2,3,4,5,4,3,4,3,2,1)
mean(myvec)

Not complicated I know, just a list of numbers and a function to get the average.

Running in the REPL

Remember the error is from the running of the command and not within your R code. If you mess that up then those errors will appear in the :out value.

Easy enough to parse by removing the \n and the [1] line which R have generated. We’re not interacting with R only dumping out the output from it. After that there’s an amount of string manipulation to do.

Expanding to Multiline Output From R

Let’s modify the meantest.Rfile to give us something multiline.

myvec <- c(1,2,3,2,3,4,5,4,3,4,3,2,1)
mean(myvec)
summary(myvec)

Nothing spectacular I know but it has implications. Let’s run it through our Clojure command function.

We have no referencing to what the number means, if the min, max, average etc. At this point there would be more string manipulation required and you could convert them to keywords or just add your own.

More Complicated Ways.

With the R libraries exists the RJava package. This lets you run Java from R and R from Java. I wrote a chapter on R in my book back in 2014.

It’s not the easiest thing to setup but worth the investment. There is a Clojure project on Github that acts as a wrapper between R and Clojure, clj-jri. Once setup you run R as a REngine and evaluate the output that way. There’s far more control but it comes at the cost of complexity.

Keeping Things Simple

Personally I think it’s easier to keep things as simple as possible. Use Rscript to run the R code but it’s worth considering the following points.

Keep your R scripts as simple as possible, output to one line where possible.

Ensure that all your R packages are installed and working, it’s not idea to install them during the Clojure runtime as the output will become hard to parse. Also make sure that all the libraries are running on the same instance as your Clojure code.

In the long run have a set of solid string manipulation functions to hand for dealing with the R output. Remember, t’s one big string.

The Story So Far…

This started off as a quick look at Linear Regression in spreadsheets and using the findings in Clojure code, that’s all in Part 1. Muggins here decided that wasn’t good enough and rigged up a Neural Network to keep the AI/ML kids happy, that’s all in Part 2.

Darcey, Len, Craig or Bruno haven’t contacted me with a cease and desist so I’ll carry on where I left off….. making this model better. In fact they seem rather supportive of the whole thing.

Weka Has Options.

When you create a classifier in Weka there are options available to you to tweak and refine the model. With the Multilayer Perceptron that was put together in the previous post, that all ran with the defaults. As Weka can automatically build the neural network I don’t have to worry about how many hidden layers to define, that will be handled for me.

I do however want to alter the number of iterations the model runs (epochs) and I want to have a little more control over the learning rate.

Concluding

There’s not much more I can take this as it stands. The data is actually pretty robust that using Linear Regression would give the kind of answers we were looking for. Another argument would say that you could use a basic decision tree to read Craig’s score and classify Darcey’s score.

While I personally find it limiting to those that can apply for it (i.e. you need the cash first then claim it back afterwards, fine when you have £2.5K, £10K or £40K slushing around in the bank account but to most mere mortals it’s a hard one to pull off) the grant is a good way of testing the concept of a product.

What I find disappointing is the outcome of the work that’s gone in to these project, especially on the software front. The PoC is treated by some as “free” money to prop up the web developers, app developers and consultants in the province.

A lot of projects never see it past the early adopters but that means it shouldn’t stop others.

Code Ownership

We’re talking about public money, so my question to you is this, is the product of effort from a proof of concept grant public when it’s no longer required by the originator? Why do I ask this? Well there have been some good ideas that got the PoC money, made a cracking product and then died only because of sales and marketing. Sometimes the idea is dropped because the individual thinks it’s a dead duck. The question remains though, could someone else pick up the baton and do a better job?

Open Sourcing The Idea

As the money is essentially public, cannot the product of effort be public too? If we look at projects such as LittleDeliApp and Receet the effort was done via PoC money. In isolation they were good products but were going to take insane growth to make any profit. I wrote a post last year about such ideas.

If a project is classed as dead then I firmly believe that the public element of the product should be placed out in the open for someone else (or an organisation) to give it go. The hard part is in the marketing and selling and where a good chunk of money would be spent. There’s tons of folk who will code a product up for you…. but who conducts the rest of the orchestra?

While I’ve never taken PoC money (or any other public money to build product for that matter) I’ve always tried to open source what I possibly could, using this blog as the vehicle to teach and hopefully inform. From Hadoop and Spark to recommendation engines, even sourcing bus stop locations via an iPhone app, I’ve put it on github for all to see, learn and use from. We should be doing the same from a PoC project view point. These projects could teach the next wave of coders, leaders and marketers.

Northern Ireland’s Collaborative Function

With an open sourcing of dead PoC projects the work isn’t wasted, the public money potentially isn’t wasted and the originator hasn’t lost anything apart from a touch of ego bruising perhaps.

The projects out in the open you can open the gates of opportunity for others to make use of what’s been publicly financed. Big data projects, excellently built apps. Entrepreneurs could start a business with a good head start, the development companies could pitch for maintenance work while the team gets built. More positive stories of entrepreneur’s giving it a go in the global marketplace can only be good PR pieces of Invest Northern Ireland.

It’s Happening Elsewhere

While I’ve had the thought of what to do about existing deadpooled PoC projects I’ve not written about it. It wasn’t until work with MastodonC that all of this has been brought into sharp focus, where we open source everything we potentially can. The benefits are two fold, firstly people use our software and give us honest feedback on the product, secondly developers will fork the project, add to it and improve what’s there. It’s a win win.

Projects I’ve worked on try to be open from day one. The mantra is “make the repo public”, it certainly galvanises the attention on how you develop, what you publish and how you test. Exposure to developer ridicule makes you a better programmer.

And it’s not like MastodonC to open source just the little things, full platforms will get opened up wherever possible. The idea is to make them better over the long term.

Conclusion

For Northern Ireland to make better products and be more collaborative it starts with the publicly funded projects, things that the originator hasn’t really lost on. These code pieces, blueprints and plans should be opened up to give someone else a change. That person might see the link that the other couldn’t see.

If you want more successful startups we’re going to have to open up and share a lot more.

I’m going to get supported and slated in equal measure I feel but I’ve seen this so many times now that it’s becoming the elephant in the room so I’m going to comment all the same.

Dear Founder…..

What we do know, especially in Northern Ireland, is that there’s a lack of developer talent that is willing to work on a startup from the initial stages, no sweating out the product in the small hours. Founders have little option than to go to development houses to get their concepts built so that they can be proven to the market.

Buyer Beware

When you are shopping around make sure you ask this simple question:

“The work that you do, do I own ALL the code?”

Hint, if they say “no” or “we use some of our own custom software” politely end the discussion immediate and walk away.

Emphasis on the question is on the word ALL. In order for your business to survive you have to be able to adapt your code at any time. Software houses are not their with your best interests at heart (regardless of what they might actually say to you, they’re a business they need to survive too, it’s all about recurring revenue). If you don’t own ALL the code then you can’t adapt quickly or adapt at all.

If open source libraries are mentioned check the licensing agreements on them, not all open source is free. And make sure your developer in waiting shows you want libraries they are using and get the links so you can see them too.

I’ve seen many company start well and within time end up like

Trust me, it will hurt your revenue far more than it will hurt the development house.

My Advice To You, Founder

With the big wedge of cash (yours, an investor’s or the government’s funding) you are the customer who can call the shots. So demand 100% source code ownership, in your hands, in a Github account. In the event you need someone else to do some work as you grow, well then you can.

Even better is get friendly with a coder, even if they have a full time job, coders like to code so if you offer them a rate they’ll support you too. Have a developer fallback plan, you owe it to your business and your investor if she/he is putting the money in.

Review any SLA’s you have with development houses and see exactly what you are getting for your money. Insist on a monthly statement of how many hours were actually spend on your business. Complain bitterly if you need too, the customer is king here though every development house would make you think you are nothing without them.

Basically you need the following for a fairly run of the mill web/mobile tech startup:

Someone who knows your web side code (PHP, Ruby, Java, Python or what have you)

An iOS developer if you have an Apple supported mobile product.

A developer who knows good Android development.

A server guy or gal who’ll advice, stress test and update your server (a lot of development houses will steer clear of the hard stuff and just code)

One person can cover all of those roles, well they are rare but they do exist. I’d look to spread the workload where possible. In Northern Ireland we are acutely aware of a complete lack of good CTO material for startups but try and find a technical person that can articulate comments and ideas to the development house, nothing a developer house hates more than a person who knows what they do.

With so many new ideas coming out development houses are only too happy to greet you with open arms and discuss your dreams and visions. Corny as it might sound it’s a long term relationship so make sure you’ve done a bit of dating first to find a suitable match.

Ultimately though, make sure you’ve got your prenup in order for when you want to move, you need 100% of your code with you in order to continue your life once the developer separation happens.

Sometimes it’s too easy to rest on what you know well, anything for an easier life. Less challenges more ease. Grab a cup of tea and carry on. Sometimes shock to the system will wake you up and make you think a little harder.

Cue The Crippling XFactor Back Story

That’s been me with Java, for a long long time. Considering I’ve been working with it from the start in 1995. While I know other languages like PHP, Ruby and Python I’ve always gone back to Java to get things done for the simple reason is, I get things done.

Clojure and Scala came along, I dabbled with both and Scala was easier from a Java perspective and I took the easy way out. When you’re not exposed to something though it’s easy to go back to first principles of do what you know best. Clojure I couldn’t wrap my head around. I usually ended up with this sort of expression….

The Time Will Come….

There will be a day when you can’t avoid it. And over the last few months I’ve not been able to avoid Clojure. It’s staring me in the face and it’s what my client uses. The best way for me to learn is to have a goal in mind and code my way to a conclusion. Now fortunately for me I’m with some of the best Clojure programmers going, highly involved in the community, a lovely team to boot.

With some pair programming things are taking shape and that’s great. So my current face now seeing the power of Clojure is….

There are times when you need to do you first solo flight and see what happens. So I did my first solo flight.

Calculating GeoDistance

i want to take two latitude and longitude points and calculate the number of miles between them. In Java it’s going to look something like this:

So what’s I’m trying to do is write this program in Clojure and come up with the same output for the same test.

In the Java there are two private functions one to convert from degrees to radians and vice versa (deg2rad and rad2deg). So I’m going to write these first. They’re basic functions taking in the degrees and, for degrees to radians, multiplying by the value of PI and then dividing by 180.

During 2014 while I was writing the book “Machine Learning – Hands-On For Developers and Technical Professionals” it became very clear that I was going to have to tackle an issue that I’d done well to avoid most of my 27 year long career in computing…. In the UK the concept of mathematical notation was never really made clear, it was just put in the whole “algebra” camp and left at that. Things may have changed now (hopefully) but it’s left a bit of a gap when I actually needed it.

Scary Monsters

Writing the book I’d keep coming across mathematical notation that would prove concepts, they were everywhere. And to the ageing programmer that was well versed in experience but not so in academic training well it got a bit scary. Especially this big foreboding scary one….

There’s Something About Sigma

Perhaps it’s because it’s big, it looks serious and it looks like it means business. It means “sum”, add it all up. That’s it. Something so foreboding to show something so simple. Let’s take a simple example: What’s being said here is “add up every value of i *3 from 1 to 100” or (1*3) + (2*3) + (3*3)+……(99*3) + (100*3) From a programming perspective what I have here is a for loop. An iterator starting at one and finishing at 100. The iterator starts at the value below the sigma (1) and runs each time until the top value is reached. The action performed on each value of i is to the right of the sigma. In Java it would look like:

It’s been staring me in the face for a while, all the training, advising and talking about this Hadoop thing. It needs an AppStore, a seamless platform for one click installing and running of Hadoop applications within the framework and cluster.

Up until now most MapReduce work has been hand coded by in house development teams working to a specific requirement. Hadoop2 and it’s Yarn component extends on that and makes applications easier to distribute across the cluster, like I said in yesterday’s post about deploying point of sale software across kiosks.

The issue is that Hadoop takes an amount of skill to get things running, thinking of getting data in and running scripts at so on. We’re a way off from a one touch solution for app and data deployment and processing.

It’s coming, I’m sure it is. The next two years will be very interesting.

MapR got the ball rolling with an application gallery, not an AppStore par se but a great start.

The commonality with most data projects is that they start with this rather ambiguous, “well, we’ve got this data”, statement. And without a care in the world it’s spin up the Amazon EMR instances, yes instances, and whack all the data up there. The mainstream tech media will focus on the commodity hardware and Yahoo having 50,000 nodes in it’s cluster but the fact remains for most customers we’re talking small amounts of nodes.

The peak number of clusters really only count for a small number of the companies using it (the Google’s, Facebook’s, Twitter’s and Yahoo’s of this world). For most people the single core may actually be fine when dealing with batch processing, especially when those runs are planned during quiet times where the data isn’t going to be changing fast.

Adding cores adds latency, regardless of whether within the local network or a wide area network. When you are combining a number of latency points such as network access, disk i/o and the actual processing the total time added can start to hurt.

Hadoop is usually copying blocks of 64Mb data to a node for processing, done with TCP and RPC. Why 64Mb? Well it’s a Goldilocks number, not too big, not too small but just right.

So the key to powering up any single core system is to find ways of reducing any latency you can. Whether that’s making HDFS a complete in-memory implementation making block read/writes faster or just adding a ton more machine RAM to make things faster will give big time differences (well as much as the JVM will let you add) machines using 48Gb will find other uses for disk caching and so on, it all adds to the performance.

Once you run out of options you are entering the realms of hardware acceleration and there are a few companies working on this now, notable NI company is Analytics Engines. At this point it doesn’t really become a Hadoop issue, just eeking the juices out of the machine you are working on.

The Hadoop cluster count has become the ego point for Hadoop developers, without so much thinking about the cost considerations of running such a complex cluster of machines (devops won’t save you now). I’ve found that even the mid-tier racks will give you up to 6TB of storage and a enough RAM to sink a small startup, in the grand scheme of things SME’s don’t have much to worry about, perhaps we don’t need everything on the cloud after all…..

The book has been on pre order for a little while already but I didn’t want say anything until there was a cover. So he we are, the book, coming on nicely and I’ve tried to keep things more developer focused than mathematical. A get-things-done book more than a theory book.

General release isn’t until November in the US and December elsewhere though I’m trying to get it launched for StrataConf Europe in November.

The final phase, time to visualise.

Visualisation is not always the be all and end all of BigData but it is important in terms of telling the story.

At the end of the last post we had mined the data, got counts of all the hashtags and also whittled it down to three brands we wanted to focus on.

Visualising in D3

D3 (Data Driven Documents) is a Javascript based library and takes a lot of the work out of creating graphs, maps and all sorts of other things. From our perspective it’s handy as we can load in CSV/TSV files with ease.

Loading D3 is as simple as

<script src="http://d3js.org/d3.v3.min.js"></script>

I’m going to create a Pie Chart based on example on the D3 site. You can have a look at the HTML on my github repo for this blog post series.

Spring XD/Hadoop/D3 Considerations

In the four posts in this series we’ve covered data consumption, storage, processing and visualisation. With Spring XD it’s going to continue gathering data until we say stop. The Hadoop job we ran was a one off, there’s nothing to stop us putting these things in a cron job to update every hour. Though we do have to keep an eye on the time it’s taking to run those jobs.

Also remember D3 takes time to load and parse on the client side so don’t over run the user with too much information. From an aesthetic point of view too much information would be confusing for the reader.

As with all these things it’s a case of getting your hands dirty and trying things out.