Quick post using Python3 and the Seaborn statisitcal visualization package to start trying to understand the UK gender pay gap data released this week. All UK companies with more than 250 employees are required to provide data on how their female and male employees are paid differently. I decided to drill down to look at how, according to the data self-reported by companies, pay varies by gender in the electricity sector.

I've provided my workings in a jupyter notebook. If you want to run the examples and don't have Jupyter and Seaborn installed I'd recommend installing these quickly and easily via Anaconda.

So things didn't quite turn out quite as anyone expected in the snap UK general election...

I wanted to create a visualisation of the results which contrast the seats won with the % of the popular vote, and came up with this infographic.
The nice thing about the two semi-circular charts I generated is that they can be nested within each other.

Four years on from the London Olympics he's only gone and done it again - the double double 5000m/1000m.

. Once again, I tracked the tweets using the twitter streaming API (search terms #gomo,#motime,@mo_farah,#mofarah) before, during and after the race.

The interesting things is, well, the distribution of tweets over time is pretty similar to last time. Even the absolute rates in tweets per second are similar, despite the fact the race started at 01.37am British Summer Time. You can compare them youselves by looking at my original post from 2012.

Apache Spark is a relatively new data processing engine implemented in Scala and Java that can run on a cluster to process and analyze large amounts of data. Spark performance is particularly good if the cluster has sufficient main memory to hold the data being analyzed. Several sub-projects run on top of Spark and provide graph analysis (GraphX), Hive-based SQL engine (Shark), machine learning algorithms (MLlib) and realtime streaming (Spark streaming). Spark has also recently been promoted from incubator status to a new top-level project.

In this series of blog posts, we'll look at installing spark on a cluster and explore using its Python API bindings PySpark for a number of practical data science tasks. This first post focuses on installation and getting started.

This snippet, twitstreamer, is a simple command line tool, written in python3, for retrieving tweets via the twitter streaming API, v1.1. The tweets are written to standard output as CSV or JSON formatted lines.

I started on Coursera's Social Network Analysis course and was looking around for
some network data to start analyzing. I've seen a talk by Matt Biddulph at a Big Data London meetup
(blog post)
on analyzing Wikipedia data and wondered if something similar could be easily done with news data.

It was fairly easy to grab some newspaper articles using the Guardian Open Platform.
I then used the python-based Natural Lanuage Toolkit to extract named entities (in particular the names of people) from the articles.
A network could then be constructed using names as the nodes, and connecting nodes with a link if at least two articles included both names.

The resulting network could then be loaded into Gephi, an excellent tool for visualizing and anayzing networks.

Another sports related post, this time inspired by Mo Farah's amazing double gold medals (in the 5000m
and 10000m) over the last couple of weeks at the London Olympics.

I used the gRaphael Charting Library and the
Twitter search API to show
how the rate at which tweets containing the hashtag #gomo varied before during and
just after the 5000m London Olympics final.
Hover over the chart to display the text for selected tweets.

The main features of the chart are a small peak just before the race starts followed by the huge peak after Mo wins.
And I thought it was a long way to jog to the bus stop when running late in the morning!

The visualization plots matches played (x-axis) against points accumulated (y-axis). Click on "Add club" button to compare the progress against that of the other clubs playing in the England and Wales FA Championship.