Category Archives: Spark

With two weeks in August I’ve learned some new things. The exchange rate will remain against us but it doesn’t change our resolve, we’ll jump on aeroplanes and go to Spain, we just don’t return home with the donkey and the sombrero now. And yes, Ryanair purposely do hard landings to save on tyre wear and shave turnaround times.

The UK could learn a thing or two on how to charge for public transport. Buses and trams are cheap and people use them. Alicante town and Altea are lovely.

Benidorm is what you make it, it’s not all the mad drinking that the UK media play out. During the high season mobility scooters are a lot less common, in October it’s mobility gridlock.

Finally Belfast International’s international arrivals could do with a lick of paint and keep immigration on the same level (ie, the ground floor). Just saying, it’s depressingly grey to come back to. Heck knows what out of country visitors make of it.

Aside from all that it’s a good time for me to catch up on reading as I don’t get a huge amount of time. So here’s what was in the bookshelf, in the carry on bag and in my shoulder bag. Never be without a book…..

The Evolution of Everything: How Small Changes Transform Our World (Matt Ridley)

A surprise find in a small newspaper/bookshop in Benidorm. The book is broken up in the different areas of science, philosophy, business, technology, economics and so on. And it’s a great read, plenty of new things to learn that I wasn’t aware of. It’s not a technology book but there are some very interesting points to learn from.

Around about 23 people came up with the idea of the lightbulb, during the same period of time as Eddison did. So how to does a company/person claim more patents on “inventing” something when the idea is usually shared?

A Truck Full of Money (Tracy Kidder)

The story of Paul English who was one of the founders of Kayak. It’s a read about English, not about Kayak though that does feature in and out of the book. It’s a good grounding in his thought process, which can be all over the shop (so not just me then). Sometimes the writing tends to go on a bit, I think it could have been shorter.

Women In Tech (Tarah Wheeler)

I bought this for my not-so-wee-one but it’s taken permanent residence in the living room table for everyone to read. While Tarah has written and curated a brilliant book on women in tech the information is really a must read for anyone wanting to be in tech. Like I said in a previous post, I wish I had this book thirty years ago.

High Performance Spark (Holden Karau and Rachel Warren)

I’m blessed, I get to do some interesting Spark work at Mastodon C but finding good reading material on the subject can be hard. The general rule of thumb is if Holden has been involved then I read it.

The is about getting the most out of Spark from SparkQL, ML and how to get the best performance out of RDD’s. The code is in Scala as you’d expect but that shouldn’t be a worry if you use Python, Clojure or Java. You’ll figure it out, that’s what you’re paid to do.

It’s been a while since I looked at any Spark code, I’ve just been working on other things. There’s been a few comments on the blog about running Spark jobs from the command line shell.

Test Data

First let’s have some text data to work off. We’ll do a basic word count on it. Nothing to hand apart from my Tensor Flow algorithmic book generation.

I Wordlessly Kate and I gaze at the elevator at the end. I have never understood what you’re going to do with my safety. I groan as my body is rigid, tension radi- ating out me in front of me. He looks so remorseful, and in the same color as the crowd arrives and in my apartment. The thought is crippling. But and I don’t want to go to me that I want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to you. I don’t want to be beholden to him — and I can tell him about 17 miles a deal. “Did you have to compro- mise. I giggle. “Wench. Food, now, please.” “Since you want to talk about you in my own way, and I am going to be very surprised, not to see you. Ax (Your fiancee) I ask softly. He looks so vulnerable — and I don’t know if it’s my heightened way of the ‘old,’ son. I have a hairdresser arriving at your mom?” “Yes.” He grins at me and winks, making me flush. He smirks at me. “What is it?” I ask. He gazes at me, his eyes dark and earnest. “Find out the elevators, of the first time in a half-bear — and I have to go to church . . . Date: June 10, 2011 16:05 To: Christian Grey Twiddling Christian and I don’t know if it’s not at the rules are a hostile Anthem, “Every Breath You Take.” I do you have to do with you?” he asks. “I don’t want to go to work for a living, and I’ll be very persuasive,” he murmurs, and his eyes are alight with humor. “He’s like a drink,” Jack mut- ters, locking the eggs. I crack through my body. But what I do to make you uncomfortable.” I shake my head to fetch him at the same COURTESY to a child. “I thought you were in the apartment or you^?

I’m putting this up as I got a nice email from a reader who was having trouble with running the Britney example. And as developers know, bad examples are enough to put people off…. actually they’re toxic.

See what I did there…

Classifier4J

The Classifier4J library is old so it’s not on any Maven repository I’m aware of. So we have to go old school and go old fashion download jar file. You can find the Classifier4J library at http://classifier4j.sourceforge.net/

Compiling

Open a terminal window and go to the example code for the book. In chapter2 is the Britney code. Keep a note of where you’ve downloaded the Classifier4J jar file as you’ll need this in the Java compile command.

$ javac -cp /path/to/Classifier4J-0.6.jar BritneyDilemma.java

Executing

There should be a .class file in your directory now. Running the Java class is a simple matter. Notice we’re referencing the package and class we want to execute.

One of the highlights of my job as a Data Engineer (I’m not a data scientist) is that I get to do some very cool stuff with text mining and all that data schizz.

So to that end I’m using Apache Spark, Clojure and Sparkling a lot. With that in mind I do a lot of bag of words, word vectors and such things to get topics and classifications from word documents. And it’s at this point that SparkML fails like a complete worn out donkey because it’s one of those small overlooked elements that you come across once in a while.

In topic modelling though it’s nice to know (actually pretty important) which document was labelled with which terms. So anything using SparkML’s hashingTF function has no trace of which document the term frequency came from. Which is rather pointless and, let’s face it, pretty annoying.

While I personally find it limiting to those that can apply for it (i.e. you need the cash first then claim it back afterwards, fine when you have £2.5K, £10K or £40K slushing around in the bank account but to most mere mortals it’s a hard one to pull off) the grant is a good way of testing the concept of a product.

What I find disappointing is the outcome of the work that’s gone in to these project, especially on the software front. The PoC is treated by some as “free” money to prop up the web developers, app developers and consultants in the province.

A lot of projects never see it past the early adopters but that means it shouldn’t stop others.

Code Ownership

We’re talking about public money, so my question to you is this, is the product of effort from a proof of concept grant public when it’s no longer required by the originator? Why do I ask this? Well there have been some good ideas that got the PoC money, made a cracking product and then died only because of sales and marketing. Sometimes the idea is dropped because the individual thinks it’s a dead duck. The question remains though, could someone else pick up the baton and do a better job?

Open Sourcing The Idea

As the money is essentially public, cannot the product of effort be public too? If we look at projects such as LittleDeliApp and Receet the effort was done via PoC money. In isolation they were good products but were going to take insane growth to make any profit. I wrote a post last year about such ideas.

If a project is classed as dead then I firmly believe that the public element of the product should be placed out in the open for someone else (or an organisation) to give it go. The hard part is in the marketing and selling and where a good chunk of money would be spent. There’s tons of folk who will code a product up for you…. but who conducts the rest of the orchestra?

While I’ve never taken PoC money (or any other public money to build product for that matter) I’ve always tried to open source what I possibly could, using this blog as the vehicle to teach and hopefully inform. From Hadoop and Spark to recommendation engines, even sourcing bus stop locations via an iPhone app, I’ve put it on github for all to see, learn and use from. We should be doing the same from a PoC project view point. These projects could teach the next wave of coders, leaders and marketers.

Northern Ireland’s Collaborative Function

With an open sourcing of dead PoC projects the work isn’t wasted, the public money potentially isn’t wasted and the originator hasn’t lost anything apart from a touch of ego bruising perhaps.

The projects out in the open you can open the gates of opportunity for others to make use of what’s been publicly financed. Big data projects, excellently built apps. Entrepreneurs could start a business with a good head start, the development companies could pitch for maintenance work while the team gets built. More positive stories of entrepreneur’s giving it a go in the global marketplace can only be good PR pieces of Invest Northern Ireland.

It’s Happening Elsewhere

While I’ve had the thought of what to do about existing deadpooled PoC projects I’ve not written about it. It wasn’t until work with MastodonC that all of this has been brought into sharp focus, where we open source everything we potentially can. The benefits are two fold, firstly people use our software and give us honest feedback on the product, secondly developers will fork the project, add to it and improve what’s there. It’s a win win.

Projects I’ve worked on try to be open from day one. The mantra is “make the repo public”, it certainly galvanises the attention on how you develop, what you publish and how you test. Exposure to developer ridicule makes you a better programmer.

And it’s not like MastodonC to open source just the little things, full platforms will get opened up wherever possible. The idea is to make them better over the long term.

Conclusion

For Northern Ireland to make better products and be more collaborative it starts with the publicly funded projects, things that the originator hasn’t really lost on. These code pieces, blueprints and plans should be opened up to give someone else a change. That person might see the link that the other couldn’t see.

If you want more successful startups we’re going to have to open up and share a lot more.

The Story So Far

In previous posts I’ve covered basically loading data in Spark (with Sparkling in Clojure) and doing some half funky stuff with it. That’s all very well and a good point for starting with, but it’s a touch limiting. Ultimately it’s very easy to get some numbers out, crack some percentages and plot a 2d graph, Google Map or infographic.

What I want to do is something far more interesting than that (in my eyes), use some machine learning to create new things based on what we have.

Markov Chains

With a sufficient amounts of text we can do some interesting things. The nicer thing about Markov Chains is they are simple in terms of how they work.

With a corpus of text loaded we can create some fresh output text. More text, better results. A Markov Chain is will randomly walk an existing lookup, based on the corpus text, and randomly select the next word to use. By looking at the previous words in the original corpus the chain can weight what the next random word should be.

Examples I’ve seen have created Paul Graham startup stories and Garfield cartoons. I could create my own St Vincent song, in fact that’s what I’ll do.

How To Create New St Vincent Songs

“Jase, I think you might like this….”, said my dear friend, sound engineer and my soundscape recordist, Dez Rae. He was right. That was in 2010/2011 before rock royalty beckoned for Annie Clark (and rightly so)… I bought what I could on the spot, it was so unique.

The great thing is the variety of songs, no two come near each other and no two albums are the same.

The Corpus of Annie Clark

In a text editor I’ve copied/pasted the lyrics from the Strange Mercy album.

I spent the summer on my back
Another attack
Stay in just to get along
Turn off the TV, wade in bed
A blue and a red
A little something to get along
Best find a surgeon
Come cut me open
Dressing, undressing for the wall
If mother calls
She knows well we don't get along

An album full of lyrics (all copyright to Annie Clark I hasten to add), all the blank lines taken out, that’s our corpus.

Markov Chain Code In Clojure

Now I need some code to so the Markov Chain, I’m not writing it this time, someone else has done the work far better than I could of in Clojure so I’m using his.

Like I said, with a corpus of text loaded in the program will look at next words and create a lookup of words and scores. When I generate new sentences the next word will be governed by the lookup table and word scores. Simple.

markov.core> (-main "/Users/jasonbell/Documents/stvincentlyrics.txt")
("Oh little one I guess it makes my mulling days, through my lesson" "Chloe in just to get along" "Your hometown is" "I've told whole lies" "Let's not a party I owe you ever really care for me?" "But when you ever really stare at you could take us?" "Chloe in the tiger" "My own heels" "Did you say it was the piles\"" "While you" "Heal my clothes on" "But when you went off the tiger" "I've told whole lies" "Bodies, can't you can limp beside you ever really stare?" "Tried so they left more")

Which looks pretty neat….

Oh little one I guess it makes my mulling days, through my lesson Chloe in just to get along Your hometown is I’ve told whole lies Let’s not a party I owe you ever really care for me? But when you ever really stare at you could take us? Chloe in the tiger My own heels Did you say it was the piles While you Heal my clothes on But when you went off the tiger I’ve told whole lies Bodies, can’t you can limp beside you ever really stare? Tried so they left more

It’s still copyright to Annie Clark, they’re still her words just a little more random. If I was going for a title, “My Mulling Days” would be a front runner.

I could have put all the lyrics from all the albums in and come up with a more refined lyric set, but as a test and a wee tribute to one of my favourite artist’s, it’s a good start.

Do We Need An Executive?

So it looks like Stormont is getting a longer break than was originally planned. Which means that NI open data is going to be thin on the ground for new MLA questions. So in the meantime let’s turn the building into a Data Centre (we could ask Arlene if INI will fund it, she’s still there, she’s managed to hold on things….)

So I’ve got my new data centre.

With no MLA’s asking questions though we want to generate some to give the impression that something is happening up there. All those potential FDI clients will want to see the powerhouse working…. If we do a well enough job we would let the Markov Chains just do the work altogether but let’s not get ahead of ourselves just yet.

Repurposing NIAssembly Spark Code

I’m going to extract the question text from the MLA questions. I’m going to use the NI Assembly Spark code (you can read part 1 and part 2 if you want to know the inner workings) and extract just the text.

(:questiontext question)) qs))) mqrdd))
#'mlas.core/qtext
mlas.core> (spark/first qtext)
("To ask the First Minister and deputy First Minister for an update on the delivery of their Programme for Government 11/15 commitments." "To ask the First Minister and deputy First Minister for an update on the delivery of their Programme for Government 11/15 commitments." "To ask the Minister of Enterprise, Trade and Investment whether any of his departmental responsibilities have been affected by the actions of any proscribed organisations since 2011.")
mlas.core>

That’s the first element of the RDD and it has three questions. There’s a lot more…. a whole lot more.

I want to save this out as a text file which requires a bit more mapping.

Random MLA Question Generation

With 94,000+ questions to train my Markov Chain I’m expecting some interesting results. I only want to generate one question at a time so I can remove the loop (where I was generating 15 lines for generating St Vincent lyrics.

I’m going to run this from the REPL so I’m not reloading and reindexing all the text. Let’s create some MLA questions for next week.

markov.core> (def markov (transform (lazy-lines "/Users/jasonbell/Documents/mlaquestions.txt")))
#'markov.core/markov
markov.core> (generate-sentence markov)
"To ask the First Minister of Finance and deputy First Minister what steps are entitled to ensure greater weight is the reasons that no reports into the Housing Executive Gateway Reviews his Department has been allocated to outline the Minister for Social Services and to a CCEA test; and (vi) South Armagh city area."
markov.core> (generate-sentence markov)
"To ask the Ethnic Development what recruitment process used to detail, broken down by (i) who are assessed as possible help graduates in the Minister and Personnel for each spouse or not personally signed off a whole."
markov.core> (generate-sentence markov)
"To ask the cost, of Ulster in the Minister of order an organisation, broken down by Health and Learning for exemption."
markov.core> (generate-sentence markov)
"To ask the Minister of the last three years."
markov.core> (generate-sentence markov)
"To ask the First Minister what sentences would bring forward to July bonfires on the progression on planning application for rural area of Health, Social Services Directive; and location and what they are assisting these guidelines; and Leisure for Social Services and Rural Development what additional counselling, including those in 2008/09."
markov.core> (generate-sentence markov)
"To ask the First Minister and (ii) if so (ii) whether students with identities outside the number of the Employment and whether the Office of the Environment Minister."
markov.core>

To be honest that was far too much fun!

Taking It Further

If you have access to plenty of text then you can run Markov Chains to produce new content with little difficulty. For a more refined method it’s worth looking at Artificial Neural Networks which is being used by some publishers for content creation.

All in all, to save Northern Ireland from having no news whatsoever…. well I’ve done my bit 🙂

Channel Your Inner Nate Silver

So you’ve read the Smart Cities book, you’ve followed every Nate Silver post in 538….. now to put it all into practice. An opportunity to do some very serious future cities planning with the Greater London Authority and MastodonC.

Would you like to join an ambitious and forward looking unit of analysts, researchers and data experts, working for one of the world’s truly global cities?

The Greater London Authority (GLA) is working with the big data analytics specialist, Mastodon C to create a solution ‘Witan’ which allows subject experts and policy makers to integrate different types of hard and soft model, in order to explore scenarios for the futures of their cities.

You will play a key role in the project, working closely with GLA staff and Mastodon C’s team. You will have the opportunity to help build up a secure City Data counterpart to the GLA’s award winning open data London DataStore. As well as designing reproducible procedures to shape and clean the data, you will actively seek opportunities to link datasets together as part of creating an analytical data store.

You will also gather user stories from policy teams and analysts and devise/apply tests for Witan modules as they progress through Alpha and Beta releases.

This is a great opportunity to develop your skills and experience, but you will need to bring with you a strong technical background including practical application of data science in a work setting.

In addition to a good salary package, we offer an attractive range of benefits including 30 days annual leave, interest free season ticket loan, interest free bicycle loan, childcare voucher scheme and a career average pension scheme.

London’s diversity is its biggest asset and we strive to ensure our workforce reflects London’s diversity at all levels. Applications from Black, Asian and Minority ethnic candidates will be particularly welcomed as they are currently under-represented in this area of our organisation.

If you have a question about the role then please contact the Resourcing Team by email on glajobs@london.gov.uk quoting reference GLA2981.

Moving On From The NI Assembly

There was plenty of scope from the NI Assembly blog posts I did last time (you can read part 1 and part 2 for the background). While I received a lot of messages with “why don’t you do this” and “can you find xxxxxx subject out” it’s not something I wish to do. Kicking hornets nests isn’t really part of my job description.

Saying that when there’s open data for the taking then it’s worth looking at. Recently the Detail Data project opened up a number of datasets to be used. Buried within is the prescriptions that GP’s or Nurse within the practice has prescribed.

Acquiring the Data

The details of the prescription data are here: http://data.nicva.org/dataset/prescribing-statistics (though the data would suggest it’s not really statistics, just raw CSV data), the files are large but nothing I’m worrying about in the “BIG DATA” scheme of things, this is small in relative terms. I’ve downloaded October 2014 to March 2015, that’s a good six months worth of data.

Creating a Small Sample Test File

When developing these kind of jobs before jumping into any code it’s worth having a look at the data itself. See how many lines of data there are, this time as it’s a CSV file I know it’ll be one object per line.

You might have noticed the final two keys, noname1 and noname2. The reason for this is simple, there are commas on the header row but no names so I’ve forced them to have a name to keep the importing simple.

Whereas the NI Assembly data was in JSON format so I had the keys already defined, this time I need to use the zipmap function to mix the values at the head keys together. This gives us a handy map to reference instead of relying on the element number of the CSV line. As you can see I’m grouping all the prescriptions by their GP key.

Counting The Prescription Frequencies

This function is very similar to the frequency function I used in the NI Assembly project, by mapping each prescription record and retaining the item prescribed I can then use the frequencies function to get counts for each distinct type.

Getting The Top 10 Prescribed Items For Each GP

Suppose I want to find out what are the top ten prescribed items for each GP location. As the function I’ve got has the frequencies with a little tweaking we can return what I need. First I’m using sort-by to sort on the function, this will give me a sort smallest to largest, using the reverse function then flips it on it’s head and gives me largest to smallest. With me only wanting ten items I then use the take function to return the first ten items in the sequence.

Creating The Output File Process

So with two simple functions we have the workings of a complete Spark job. I’m going to create a function to do all the hard work for us and save us repeating lines in the REPL. This function will take in the Spark context, the file path of the raw data files (or file if I want) and an output directory path where the results will be written.

What’s going on here then? First of all we load the raw data in to a Spark Pair RDD and then using the thread last function we calculate the item frequencies and then reduce all the RDD’s down to a single RDD with the coalesce function. Finally we output everything to our output path. First of all I’ll test it from the REPL with the sample data I created earlier.

There we are, GP 610 prescribed 91 loaves of Gluten Free Bread over the six month period. The blood glucose testing strips are also high on the agenda but that would come as no surprise for any one who is diabetic.

So Which GP’s Are Prescribing What?

The first number in the raw data is the GP id. In the DetailData notes for the prescription data I read:

“Practices are identified by a Practice Number. These can be cross-referenced with the GP Practices lists.“

As with the NI Assembly data I can load in the GP listing data and join the two by their key. Sadly on this occasion though I can’t, the data just isn’t there on the page. I’m not sure if it’s broken or removed on purpose. Shame but I’m not going to create a scene.

A picture speaks a thousand words so they say, so it makes sense to attempt to visualise a diagram of the data we’ve worked on.

What’s A Sankey Diagram?

A sankey diagram is basically a collection of node labels with connections, these connections are weighted by value, the higher the value the thicker the conncetion.

The data is based on a CSV file with a source node, target node and a value. Simple as that.

Reusing the Spark Work

In the previous post we left Spark and the data in quite a nice position. A Pair RDD with the member id as a key and a vector of department names with the question frequencies.

The first job is to transform this in to a CSV file. As we left it we had a Pair RDD with [k, v] with the v being a map of department names and the value of that map the frequency. So in reality we’ve got [k, [{ k v, k v, k v…….k v}].

For example, let’s look at the first element in our RDD after doing a spark/collect on the Pair RDD.

mlas.core> (first dfvals)
["8" {"Department of Culture, Arts and Leisure" 32, "Department of the Environment" 96, "Department for Social Development" 76, "Department of Agriculture and Rural Development" 53, "Department for Employment and Learning" 40, "Department for Regional Development" 128, "Northern Ireland Assembly Commission" 18, "Department of Education" 131, "Department of Health, Social Services and Public Safety" 212, "Department of Justice " 38, "Department of Finance and Personnel" 105, "Office of the First Minister and deputy First Minister" 151, "Department of Enterprise, Trade and Investment" 66}]

The first element is the member id and second element is the department/frequency map. Remember this is for one MLA, there are still 103 in the RDD altogether.

A couple of things to note, notice the use of [k v] being passed in to the map function. Secondly as I’m using the println function the result of the map function is going to be nil. The last line is the result of the Clojure map function.

In part we’ve got two thirds of the CSV output already done with the target node and the value, I need to redo the Spark function so instead of the member id being the key I want the name of the MLA in question.

With that RDD we have all the elements for the required CSV file. A source node (the MLA’s name), a target node (the department) and a value (the frequency). Notice that I’m also removing the commas from the MLA name and the department name, otherwise I’ll break the sanaky diagram when it’s rendered on screen.

So far so good, checking on my desktop and there’s a CSV file ready for me to use. I just need to add the header (source,target,value) to the top line. In all honesty I should really insert that header row at the start of the vector.

Creating The Sankey Diagram

Where possible it’s best to learn from example and in all honesty I’m not a visualisation kinda guy. So when the going gets tough, the tough Google D3 examples.

So there’s a handy Sankey diagrams with CSV files that I can use. So a small amount of copy/paste to create the index.html and sankey.js files, all I have to do is copy the sankey.csv that Spark just output for us. I’ve extended the length of the canvas to paint the sankey diagram on to.

Appending a couple of CSV output files to sankey.csv will give us a starting point. If I reload the page (Dropbox doubles as a very handy web server for static pages if you put html files in the Public directory) you end up with something like the following.

Okay, it’s not perfect but it’s certainly a starting point. Just imagine how it would look with all the MLA’s….. maybe’s later.

Conclusion

Once again I’ve rattled through some Spark and Clojure but we’re essentially reusing what we have. The D3 outputs take some experimentation and time to get right. Keep in mind if you have a lot of nodes (notice how I’m only dealing with two MLA’s at the moment) the rendering can take some time.