Uncategorized

Return on investment is often measured as the gain from an investment divided by the cost of the investment, sometimes expressed as a percentage. For example if a marketing campaign cost £10K but brought in £20K of revenue on top of the usual sales then the ROI is 200%.

(Note ROI is arguably more properly defined as (gain – cost)/cost but I’ve found that most of the people and industries that I’ve worked with slip naturally into the first definition: gain/cost. In any case both definition capture the same idea. Thanks to Eduardo Salazar for pointing this out.)

Now if you are just given the ROI you’ll find you are missing any of idea of scale. The same ROI could be achieved with a revenue gain of £200 and with one of £200 million. So it would be nice to see cost, revenue and ROI visualised all in one go. There are a few ways to do this but after playing around I came up with the following representation which personally I like the best. It’s a simple scatterplot of cost against revenue but since all points on straight lines radiating from the origin have the same ROI it’s easy to overlay that information. If r is the ROI the the angle of the corresponding spoke is arctan(r).

Note you can drag about the labels. That’s my preferred solution for messy scatterplot labelling.

Hopefully it’s obvious that the idea is to read off the ROI from the position of the point relative to the spokes. The further out toward the circumference the greater the scale of the success or the disaster, depending on the ROI.

To modify the graph for your own purposes just take the code from here and substitute in your data where var dataset is defined. You can change which ROIs are represented by altering the values in the roi array. If you save the code as html and open in a browser you should see the graph. Because d3 is amazing the graph should adapt to fit your data.

What do we do?

T here’s a huge interest in data science at the moment. Businesses understandably want to be a part of it. Very often they assemble the ingredients (the software, the hardware, the team) but then find that progress is slow.

Coppelia is a catalyst in these situations. Rather than endless planning we get things going straight away with the build, learn and build again approach of agile design. Agile and analytics are a natural fit!

Projects might be anything from using machine learning to spot valuable patterns in purchase behaviour to building decision making tools loaded with artificial intelligence.

The point is that good solutions tend to be bespoke solutions.

While we build we make sure that in-house teams are heavily involved – trained on the job. We get them excited about the incredible tools that are out there and new ways of doing things. This solves the problem of finding people with the data science skill set. It’s easier to grow your technologists in-house.

The tools are also important. We give our clients a full view of what’s out there, focusing on open source and cloud based solutions. If a client wishes to move from SAS to R we train their analysts not just in R but in the fundamentals of software design so that they build solid, reliable models and tools.

We teach the shared conventions that link technologies together so that soon their team will be coding in python and building models on parallelised platforms. It’s an investment for the long term.

Finally we know how important it is for the rest of the business to understand and get involved with these projects. Visualisation is a powerful tool for this and we emphasize two aspects that are often forgotten: interactivity (even if it’s just the eye exploring detail) and aesthetics: a single beautiful chart telling a compelling story can be more influential than a hundred stakeholder meetings.

Why are we different?

O ne thing is that we prioritise skills over tools. There are a lot of people out there building tools but they tend to be about either preprocessing data or prediction and pattern detection for a handful of well defined cases. We love the tools but they don’t address the most difficult problem of how you turn the data into information that can be used in decision making. For that you need skilled analysts wielding tools. Creating the skills is a much harder problem.

Coppelia offers a wide range of courses, workshops and hackathons to kickstart your data science team. See our solutions section for a full description of what we offer.

Another difference is that we are statisticians who have been inspired by software design. We apply agile methods and modular design not just to the tools we build ourselves but also to traditional analytical tasks like building models.

Collaboration using tools like git and trello has revolutionised the way we work. Analysis is no longer a solitary task, it’s a group thing and that means we can take on bigger and more ambitious projects.

But what is most exciting for us is our zero overhead operating model and what it enables us to do. Ten years ago if we’d wanted to run big projects using the latest technology we’d have had to work for a large organisation. Now we can run entirely on open source.

For statistical analysis we have R, to source and wrangle data we have python, we can rent hardware by the hour from AWS and use it to parallelise jobs using hadoop.

Even non-technical tasks benefit in this way: marketing using social media, admin through google drive, training on MOOCs, design using inkscape and pixlr, accounting on quick file.

Without these extra costs hanging over us we are free to experiment, innovate, cross disciplines and work on topics that interest us, causes we like. Above all it gives us time to give back to the sources which have allowed us to work in this way: publishing our code, sharing insights through blogging, helping friends and running local projects

What are we excited about?

A nything where disciplines are crossed. We like to look at how statistics and machine learning can be combined with AI, music, graphic design, economics, physics, philosophy.

We are currently looking at how the problem solving frameworks in AI might be applied to decision making in marketing.

Bayesian statistics and simulation for problem

solving always seem to be rich sources of ideas. We’re also interested in how browser technology allows greater scope for communication. We blog in a potent mixture of text, html, markdown and javascript.

What technology are we into?

I t’s a long list but most prominent are R, python, distributed machine learning

(currently looking at Spark), and d3.js. Some current projects include a package to convert

In my previous post (Part I) I went over the basic metacharacters and special signs in regex. In this second part I will be showing you how to simplify regular expressions.

Let’s get started

Repetition metacharacters

So far, we looked at expressions that always match the pattern to a single position in the text. Repetition metacharacters make the expression much more flexible by expanding the pattern to a specified number of characters.

‘* (proceeding item zero or more times)+ (proceeding item one or more times)? (proceeding item zero or one time)

The .(dot) had to be escaped to be interpreted literally and each set of digits is repeated a minimum of one and a maximum of three times. Shortly we will learn how to simplify it even further.

Tip

There are usually many ways to write a regex expression matching your needs and no one perfect way! As long as it does the job and you are sure it matches exactly what you want don’t worry too much about what it looks like!

Grouping metacharacters

Using () around the groups of characters will enable repetation of that group.

Tip

Don’t group in the character sets (within []) as () will have the literal meaning.

Tip

Look behind can’t be used with repetitions or optional expressions. It also doesn’t work in JavaScript (hence can’t be tested in regexpal).

Tip

It tends not to work very well in text editors.

Differences between programming languages

Here is a quick summary of the major differences between regex in different programming languages. It is not exhaustive, but will give you the idea of the scope of differences. Again, it is always good to let Google know what language you are working in while searching for regex solutions.

Here’s a chart I drew for myself to understand the relationships between chords in music theory. Doesn’t seem to have much to do with machine learning and statistics but in a way it does since I found it a lot easier to picture the chords existing in a sort of network space linked by similarity. Similarity here is defined as the removal or addition of a note, or the sliding of a note one semitone up or down. What’s wrong with me!

I was doing some simulation and I needed a distribution for the difference between two proportions. It’s not quite as straightforward as the difference between two normally distributed variables and since there wasn’t much online on the subject I thought it might be useful to share.

So we start with

We are looking for the probability mass function of

First note that the min and max of the support of Z must be since that covers the most extreme cases ( and ) and ( and ).

Then we need a modification of the binomial pmf so that it can cope with values outside of its support.

when and 0 otherwise.

Then we need to define two cases

1.
2.

In the first case

since this covers all the ways in which X-Y could equal z. For example when z=1 this is reached when X=1 and Y=0 and X=2 and Y=1 and X=3 and Y=4 and so on. It also deals with cases that could not happen because of the values of and . For example if then we cannot get Z=1 as a combination of X=4 and Y=5. In this case thanks to our modified binomial pmf the probablity is zero.

For the second case we just reverse the roles. For example if z=-1 then this is reached when X=0 and Y=1, X=1 and Y=2 etc.

Regular readers of the blog will have noticed that I haven’t been a regular contributor to the blog over the last year. There are some good reasons/excuses for that, predominantly around buying and renovating our new home, getting married and starting a new job at Facebook. Having an Arsenal season ticket also doesn’t help.

I am hoping that now that most of the renovations are done, I’m five months into married life and 13 months into the new job, that should allow for some more writing, so watch this space.

In the mean time, there is an awesome role for a senior researcher available in my team, working to understand and quantify advertising effectiveness. The full description is available here and it’s based in either New York or California.

Needless to say that Facebook is an incredible place to work, with more data than you could ever hope for and lots of interesting, yet unanswered, questions to ask of the data.

Within the team, you’ll be working with other researchers who have a strong background in data science and research more widely, including Eurry Kim and Dan Chapsky who both recently joined the team.

If you’re interested in the role and think you’d be a good fit, then please submit your CV/Resume or alternatively, if you’ve got any questions, please reach out using the form below!

It is still quite common to hear the career progression of an analyst described as one upwards from the hard graft of coding and “getting your hands dirty” towards the enviable heights of people management and strategic thinking. Whenever I hear this it reminds me of the book Conspicuous Consumption by the American economist Thorstein Veblen. It examines, in a very hypothetical way, the roots of economic behaviour in some of our basic social needs: to impress others, to dominate and to demonstrate status. It’s not a happy book.

His principal concept is the distinction between exploit and drudgery.

The institution of a leisure class is the outgrowth of an early discrimination between employments, according to which some employments are worthy and others unworthy. Under this ancient distinction the worthy employments are those which may be classed as exploit; unworthy are those necessary employments into which no appreciable element of exploit enters.

He sees this division arising early in history as honour and status are assigned to those are successful in making others do what they want and “at the same time, employment in industry becomes correspondingly odious, and, in the common-sense apprehension, the handling of the tools and implements of industry falls beneath the dignity of able-bodied men. Labour becomes irksome.”

We’re no longer hairy barbarians using the vanquished for foot-stools but, as Veblen points out, the distinction persists no matter how illogical and unprofitable. Coding is still somehow viewed as necessarily inferior to the boardroom meeting even if it is the genius piece of code that makes or breaks a business.

Attitudes are changing, but still so slowly that there is a desperate shortage of people with both the experience and the hands-on skills to get things done.

So if you are a good analyst, and you love what you do, then this is my advice to you: when they come to lure you from your lovely pristine scripts, resist. Stay where you are. The 21st century is going to need you!

I saw an interesting “challenge” on StackOverflow last night to create an XKCD style chart in R. A couple of hours later & going in a very similar direction to a couple of the answers on SO, I got to something that looked pretty good, using the sin and cos curves for simple and reproducible replication.

Using the newly redesigned Google Insights for Search, I looked at searches for clegg and pleb over the last 30 days. A quick manipulation into a csv and applying the XKCD theme and some geom_smoothing gives this:

Looks like Andrew Mitchell might be Nick Clegg’s new best friend in terms of deflecting some of the attention away from the sorry song…

And here’s the code (note that you need to have installed the Humor Sans font using install_fonts() ):

I recently stumbled across a series of blog posts from the folks at IDV that visualised the archive of recorded pirate attacks which has been collected by the US National Geospatial-Intelligence Agency. It’s a dataset of 6000+ pirate attacks which have been recorded over the last 30 or so years.

This first map shows where the attacks have been recorded, with four clear areas standing out when the data is aggregated into hexagon bins:

Map showing areas where pirate attacks have been recorded

Zooming in on the area around Yemen, there’s a clear ramp up in the number of attacks since 2008, which saw a 570% increase compared to the previous year. As noted by the IDV analysis, most attacks take place on Wednesday and during the spring and autumn.

Number of attacks recorded in the Aden region by year

The reaction to the massive increase in attacks in 2008 seems to have been ships not travelling as close to the shore in 2009, leading to more attacks happening further out to sea. This can clearly be seen by looking only at the attacks in 2008 and 2009:

Number of Pirates attacks in the Aden area in 2008 and 2009 (Distance is in degrees)

Within the dataset, as well as information around the location of the attack, there are also descriptions of the attacks, which lends itself well to some text analysis to understand where there have been changes in the nature of the attacks above and beyond their distance.

Some analysis of the descriptions of the attacks reveals that the nature of the attacks also changed in 2010, with more featuring terms such as security and speedboats (full details of how these topic groups have been created is below). The analysis was used to identify five different types of attacks.

From the chart below, Topics 4 and 5came to prominence in 2008, with Topic 5 maintaining it’s share in 2010 before Topic 2 then increasing in number in 2011 and 2012. This is just scratching the surface with what can be done with topic analysis and given that all the documents are related to pirate attacks, there’s not the variation compared to what you would see in say, news articles about many subjects. There’s a good walk through of using the TopicModels package here.

Attacks in the Yemen region classified into one of five topics based on description

And what are these topics? The table below shows the top 10 terms for each of the 5 topics – they’re not as clear cut as you’d hope (mainly because there are a fair few verbs and numbers in there at the moment), but give an idea of some differences – skiffs versus speedboats, topic 2 featuring “security”, the numbers involved and months of the year all hit at different aspects of the attacks which have been picked out.

Next is to turn the data into a Planar Point Pattern in order to allow for calculation of nearest coastal point for each attack. The same technique is then used to create a similar file for the coastline. Use the nncross function to find the distance from each attack to the nearest point on the coast (and identify which point that is).

Here’s a very short post to highlight one of the “highlights” of my week that I thought was worth sharing with the wider community.

One of the things I find great about R is the rapidly evolving ecosystem where new packages are being constantly created and others are being updated.

Up until now, I’ve found this to be a very good thing, but experienced the other side this week, where an upgrade to a package broke a pretty big script that I’d been working on.

The “quick” solution in my case was to use the CRAN archive to download the source for earlier versions of the package (and it’s dependencies) that was causing issue and then build and install them to overwrite the upgrade – knowledge of how to build a package from source in Windows came in very handy and you can read more about how to do it in a previous D&L post.

The longer term solution is a lot trickier, particularly where R is being used in a collaborative environment and reproducibility is important, either between machines or over time.

I suspect one solution is to have a common library that is centrally maintained and then changing the default location that R looks for installed packages (which is also a handy solution when dealing with not having to download all packages against once you’ve upgraded base R).

On top of this, I suspect there would then need to be some type of test suite which, once updates to packages were available, checked in a development environment that existing scripts and processes still worked. None of this is new to software and IT folk, but it’s a novel issue at the moment for the analyst community I suspect.

Machine Learning and Analytics based in London, UK

Interested in our products or if you’d just like to chat about opportunities, please feel free to get in touch at [email protected] We’d love to hear from you.

Your Name (required)

Your Email (required)

Your Message

Please prove you are not a bot by entering the following in the field below