Meta

PyConUK is in its 9th year and this year it’ll host its first Science Track aimed at scientists (not “data scientists” but real lab-coat-wearing scientists). I’m speaking in that track, yay (“Ship Data Science Products!“)! This track is part of the main conference, it all runs during September 19-21. Here’s a tiny reminder from the first 2007 event.

If you’d like to learn about Python’s role in helping researchers with their work, enabling reproducible research and the spread of digital literacy in the sciences, you should attend this track. This track can be attended for just £99 (without attending the rest of the conferece), this is a bit of a steal given you’ll get 3 days of great networking and learning.

The Software Sustainability Institute is involved and PyConUK is looking for sponsors, this is a great way to spread your message into a scientific community and to over 300 attendees. For details you should contact PyConUK directly (pyconuk-sponsorship@python.org).

Other speakers include members of PyDataLondon (I’m a co-org) and the wider UK Python community.

We’ve just run our 2nd PyDataLondon conference, we’ve had around 300 attendees, 3 keynotes, 3 tracks over 3 days. It has been fab! We’ve grown 50% on last year along with 20% female speakers and 20% female attendees (both up on last year). I’m really happy with the results of all the hard work of our conference committee. Here’s Helena giving our opening keynote:

Our monthly meetup is now at 1,650 members and our 13th meetup is scheduled for Tues July 7th at AHL (near Bank tube) – go RSVP now! If you have questions about Pythonic data science – you’ll get them answered with 200+ folk at our meetups (probably in the pub after – buy beer and talk to folk!).

I gave a talk entitled “Ship It!“, breaking down 10 years of experience on building, running and deploying successful data science projects. It reflects on recent experiences consulting on automated contract recruitment over 1.5 years with ElevateDirect here in London. I looked at 10 years of my consulting projects, removed those that failed (noting reasons why) and then categorised those that worked into the 4 groups that I start the talk with. After that I build on lessons as the groups build into each other.

Peadar Coyle (@springcoil) spoke on deployment recently at PyConItaly, his talk is worth a watch. You’ll probably want to catch up on his PyMC tutorial that we had over the weekend at PyDataLondon.

In my talk and during the closing notes I made a point to everyone – if there’s one simple thing you do today to help support open source projects (particularly if you use them, but don’t contribute to them in other ways) – please please Cite the Project in Public. scikit-learn has a citations page, this helps them raise money from funding bodies, they justify the funding by showing how it helps companies do more business. All you have to do is write a paragraph’s testimonial and send it to your favourite project. The scikit’s, scipy, numpy, ML tools, matplotlib etc – they’d all love to have new testimonials. It’ll take you 15 minutes, please go do it.

Since the conference was a huge success it means a good chunk of money was raised for NumFOCUS, the non-profit that backs the PyData conferences. As a result the awards and scholarships that they provide to the community including the John Hunter scholarship, diversity grants and women in tech, grants for development on tools like AstroPy, IPython, SymPy and Software Carpentry will get a huge boost. Good job all!

“”If you want to support open source projects publicly say you use them and write testimonials” – @ianozsvald at #pydataldn15 YES PLEASE.” @drmaciver of Hypothesis

I’ll call out a new project that I mentioned- DSADD (Data Scientists Against Dirty Data – now known as Engarde), a set of decorators to apply to Pandas DataFrames to set constraints on your data. This helps when dealing with dirty data.

Our team (my co-chair Emlyn and team Cecilia, Graham, Florian, Slavi and Calvin) did a wonderful job, along with Leah and James (our International Team [they make all the background stuff happen – particularly Leah!]), and Bloomberg’s team including Amy, Kenny and Darren:

The party last night was in a local Bier Keller with a live Oompah Band (don’t ask!). Much conversation was had

It was encouraging to see more folk using Python 3.4 at the conference, though still 2.7 was in the majority. I wonder how news that the next Ubuntu (15.10 Wily Werewolf) is switching to Python 3.5 in October will help with people’s transition?

If you’re interesting in hearing about PyDataLondon 2016, join this announce list. It’ll be almost-zero-volume for the next 6 months, I’ll do something with it once we’re planning the next conference.

Finally – if you’re after a Data Science Job, I run a very-low-volume jobs list (mostly for London but for the UK in general), read about it here. My ModelInsight also runs data science Python training in London, we announce new training courses on this list. All the lists are MailChimp (so you can unsubscribe instantly at any time), I rarely post to the lists and I keep it all relevant.

I’ve just had a fab couple of days at PyConSE in Stockholm, I really enjoyed giving the opening keynote (thanks!) and attending two days of interesting talks. The Saturday was packed with data science talks (see below), it felt like a mini PyData or EuroSciPy, most cool!

Note – I’ll be updating this write-up a little over the next couple of days (it is the end of the conf and I’m rather shattered right now!).

The slides and video for my Data Science Deployed talk are below:

I’d like to acknowledge Ferenc Huszár (Balderton) and Thomas Stone (Prediction.io) for feedback on early ideas for my talk – cheers gents!

I also plugged PyDataBerlin, our upcoming PyDataLondon (June 19-21, CfP open for just 1 more week) and EuroSciPy on stage, hopefully we’ll see a few more international visitors. I should also have plugged PyConUK too as there’s now a Science Track too!

The following talks from yesterday will interest you, I hope the videos come online soon:

I have a vague idea to write-up these topics more in the future, I’m calling this Building Data Science Products with Python. There’s a mailing list, I’ll email to ask questions a little over the coming months to figure out if/how I should write this.

We’ve just had our 12th meetup – we’re fully a year old, we’ve nearly 1,500 members and now we’re planning our second conference (the Call for Proposals is open for just another 10 days!). Python Data Science has grown crazily-popular in the last couple of years!

Here’s a photo from last week’s meetup, that’s over 220 people at our new host hedge-fund AHL (they’re hiring):

Our two speakers were:

Slavi Marinov talking on using gensim for topic classification for financial prediction

Lasse Bohling talking on using statistics for football prediction at footballradar.com

Slides are linked in the meetup comments. We’ll take a break for a month to run the conference (June 19-21), then we’ll pick up again in July.

This weekend the #talkpay tag has shown people outing their salaries, to democratise some of this information. This provides some interesting data for visualisation. If you’re curious about a discussion around salary data then @patio11’s blog entry is a good starting point.

@echen grabbed some of the data, I took a copy of the online sheet and made the following code to visualise the salaries. This is a very simplistic analysis, it is mostly US data, there’s no filtering for location (you’d expect San Francisco to pay significantly more than many other US cities).

First, here’s a histogram of the majority of the salaries listed (ignoring the top-9 which go up to $1.1 million which distort the plot):

Next we can filter by some text terms, here’s a similar histogram for software developers. Note the interesting peaks at $80k and $120k, then smaller but obvious bumps at $150k, $200k and $250k:

There’s much less data for teachers but you can get an idea of the difference in likely salaries:

Finally we can plot a normed (summed to 1.0) cumulative histogram, you can think of the data as probabilities to get an idea of the proportion of people who earn less/more than a certain amount:

It is worth remembering that the data is thin, just 800 samples, it is also self-reported so most of the reports will be from people who are confident in being public. It is likely that the true distribution of salaries is lower, as people who aren’t confident are less likely to publish.

PyDataLondon 2015 will take place June 19-21 at Bloomberg’s HQ in Central London, we’ll have 300 people, multiple tracks and a very solid set of speakers and teachers. You should come. You should probably speak and share your knowledge. In fact – you should submit a talk to our Call for Proposals, it opens this weekend and closes May 18th So You Don’t Have Long!

We have a set of Themes for the talks:

Medical and Bioinformatics

Tools (libraries, IDEs, hardware – whatever feels like a tool)

FinTech and Economics

Ecommerce and AdTech

Other goodies (including Art, Open Data, Data Journalism, NGOs, Gaming, IoTs and Robotics – but open to whatever you think is going to be interesting)

The first three topics are definitely of interest to companies in London, Tooling is important to everyone and the “Other goodies” theme is the catch-all for stuff that’s of interest beyond the normal body of companies we know about. The CfP is only open for less than 3 weeks so don’t hang around! Get a title and short abstract down on paper first and then you can fill in the rest online easily enough.

This conference builds upon PyDataLondon 2014 Conference, we had 200 people last year at the top of Canary Wharf last year. This year we’ll be 50% bigger and in the centre of London. You want to come along!

Please forward this around to people who will find it interesting! We’re keen to have an even wider community than our usual 1,400 PyDataLondon meetup members, we’re friendly for non-Python talks (data science is our focus) and we’d love submissions from people around R, SAS, Julia, Hadoop and the like. Our CfP review committee is 50% female, 50% male, more industrial than academic and they’re all deeply active in the field. We want speakers covering beginner, intermediate and expert data science topics, don’t hold off if you’ve never spoken before, we’d love for you to get involved.

If you’re hiring then you’ll probably want to sponsor – we’ve already closed the first few sponsorship slots and the next set are under discussion so you should get in touch quickly. By sponsoring you’ll be visible to our 300 world-class actively-practising data scientists and you’ll get to meet the creative academic minds and active businesses in our London data science community. Seriously, you should sponsor and get involved, don’t hang around or you’ll be left with that little table at the end of the corridor and you don’t want that!

If you’re interested in the above then you might also be interested in PyConSweden (May 12-13) – I’m giving the Opening Keynote on Data Science Deployed (it’ll be written up here later) and there’s a set of very nice data science talks in the schedule. Very shortly after we’ll have PyDataBerlin on May 29-30 in the heart of Berlin, go grab your tickets before they sell out.

Even if you can’t make our conferences do please join our monthly PyDataLondon meetup and get involved in our very active community. You’ll find slides from past presenters in the Comments for each of the meetups.

Early last year Chris and I founded ModelInsight, a boutique Python-focused Data Science agency in London. We’ve grown well, I figure some reflection is in order. In addition the Data Science scene has grown very well in London, I’ll put some notes on that down below too.

Through consulting, training, workshops and coaching we’ve had the pleasure of working with the likes of King.com, Intel, YouGov and ElevateDirect. Each project aimed to help our client identify and use their data more effectively to generate more business. Projects have included machine learning, natural language processing, prediction, data extraction for both prototyping and deploying live services.

I’ve particularly enjoyed the training and coaching. We’ve run courses introducing data science with Python, covering stats and scikit-learn and high performance Python (based on my book), if you want to be notified of future courses then please join our training announce list.

With the coaching I’ve had the pleasure of working with two data scientists who needed to deploy reliably-working classifiers faster, to automate several human-driven processes for scale. I’ve really enjoyed the challenges they’re posing. If your team could do with some coaching (on-site or off-site) then get in touch, we have room for one more coaching engagement.

I’ve also launched my first data-cleaning service at Annotate.io, it aims to save you time during the early data-cleaning part of a new project. I’d value your feedback and you can join an announce list if you’d like to follow the new services we have planned that’ll make data-cleaning easier.

All the above occurs because the Data Science scene here in London has grown tremendously in the last couple of years. I co-organise the PyDataLondon meetup (over 1,400 members in a year!), here’s a chart showing our month-on-month growth. At Christmas it turned up a notch and it just keeps growing:

Each month we have 150-200 people in the room for strong Data Science talks, in a couple of months we’ll have our second conference with 300 people at Bloomberg (CfP announce list). We’re actively seeking speakers – join that list if you’d like to know when the CfP opens.

I’ve been privileged to speak as the opening keynoter on The Real Unsolved Problems in Data Science last year at PyConIreland, I’ve just spoken on data cleaning at PyDataParis and soon I’ll keynote on Data Science Deployed at PyConSE. I’m deeply grateful to the community for letting me share my experience. My goal is to help more companies utilise their data to improve their business, if you’ve got ideas on how we could help then I’d love to hear from you!

I’m also thinking of writing a book on Building Python Data Science Products, see the link for some notes, it’ll cover 15 years of hard-won advice in building and shipping successful data science products using Python.

I’m at PyDataParis, this is the first PyData in France and we have a 300-strong turn-out. In my talk I asked about the split of academic and industrial folk, we have 70% industrialists here (at least – in my talk of 70 folk). The bulk of the attendees are in the Intro track and maybe the split is different in there. All slides are up, videos are following, see them here.

Here’s a photo of Gael giving a really nice opening keynote on Scikit-Learn:

I spoke on data cleaning with text data, I packed quite a bit into my 40 minutes and got a nice set of questions. The slides are below, it covers:

Data extraction from text files, PDF, HTML/XML and images

Merging on columns of data

Correctly processing datetimes from files and the dangers of relying on the pandas defaults

Ideas on automating visualisation for new, messy datasets to get a “bird’s eye view”

Tips on getting started – make a Gold Standard!

One question concerned the parsing of datetime strings from unusual sources. I’d mentioned dateutil‘s parser in the talk and a second parser is delorean. In addition I’ve also seen arrow (an extension of the standard datetime) which has a set of parsers including one for ISO8601. The parsedatetime module has an NLP module to convert statements like “tomorrow” into a datetime.

I don’t know of other, better parsers – do you? In particular I want one that’ll take a list of datetimes and return one consistent converter that isn’t confused by individual instances (e.g. “1/1″ is MM/DD or DD/MM ambiguous).

I’m also asking for feedback on the subject of automated feature extraction and automated column-join tools for messy data. If you’ve got ideas on these subjects I’d love to hear from you.

In addition I was reminded of DiffBot, it uses computer vision and NLP to extract meaning from web pages. I’ve never tried it, can any of you comment on its effectiveness? Olivier Grisel mentioned pyquery to me, it is an lxml parser which lets you make jquery-like queries on HTML.

update I should have mentioned chardet, it detects encodings (UTF8, CP1252 etc) from raw text, very useful if you’re trying to figure out the encoding for a collection of bytes off of a random data source! libextract (write-up) looks like a young but nice tool for extracting text blocks from HTML/XML sources, also goose. boltons is a nice collection of bolton-tools to the standard library (e.g. timeutils, strutils, tableutils). Possibly mETL is a useful tool to think about the extract, transform and load process.

Camilla Montonen has just spoken on Rush Hour Dynamics, visualising London Underground behaviour. She noted graph-tool, a nice graphing/viz library I’d not seen before. Fabian has just shown me his new project, it collects NLP IPython Notebooks and lists them, it tries to extract titles or summaries (which is a gnarly sub-problem!). The AXA Data Innovation Lab have a nice talk on explaining machine learned models.

This year our consulting is branching out – we’ve already helped a new medical start-up define their data offering, I’m mentoring another data scientist (to avoid 10 years of my mistakes!) and we’re deploying new text mining IP for existing clients. We’ve got new private training this April for Machine Learning (scikit-learn) and High Performance Python (announce list) and Spark is on my radar.

Apache Spark maxing out 8 cores on my laptop

Python’s role in Data Science has grown massively (I think we have 5 euro-area Python-Data-Science conferences this year) and I’m keen to continue building the London and European scenes.

I’m particularly interested in dirty data and ways we can efficiently clean it up (hence my Annotate.io lightning talk a week back). If you have problems with dirty data I’d love to chat and maybe I can share some solutions.

For PyDataLondon-the-conference we’re getting closer to fixing our date (late May/early June), join this announce list to hear when we have our key dates. In a few weeks we have our 10th monthly PyDataLondon meetup, you should join the group as I write up each event for those who can’t attend so you’ll always know what’s going on. To keep the meetup from degenerating into a shiny-suit-fest I’ve setup a separate data science jobs list, I curate it and only send relevant contract/permie job announces.

The latest PySpark (1.2) is feeling genuinely useful, late last year I had a crack at running Apache Spark 1.0 and PySpark and it felt a bit underwhelming (too much fanfare, too many bugs). The media around Spark continues to grow and e.g. today’s hackernews thread on the new DataFrame API has a lot of positive discussion and the lazily evaluated pandas-like dataframes built from a wide variety of data sources feels very powerful. Continuum have also just announced PySpark+GlusterFS.

One surprising fact is that Spark is Python 2.7 only at present, feature request 4897 is for Python 3 support (go vote!) which requires some cloud pickling to be fixed. Using the end-of-line Python release feels a bit daft. I’m using Linux Mint 17.1 which is based on Ubuntu 14.04 64bit. I’m using the pre-built spark-1.2.0-bin-hadoop2.4.tgz via their downloads page and ‘it just works’. Using my global Python 2.7.6 and additional IPython install (via apt-get):

Note the IPYTHON=1, without that you get a vanilla shell, with it it’ll use IPython if it is in the search path. IPython lets you interactively explore the “sc” Spark context using tab completion which really helps at the start. To run one of the included demos (e.g. wordcount) you can use the spark-submit script:

For my use case we were initially after sparse matrix support, sadly they’re only available for Scala/Java at present. By stepping back from my sklean/scipy sparse solution for a minute and thinking a little more map/reduce I could just as easily split the problem into number of counts and that parallelises very well in Spark (though I’d love to see sparse matrices in PySpark!).

I’m doing this with my contract-recruitment client via my ModelInsight as we automate recruitment, there’s a press release out today outlining a bit of what we do. One of the goals is to move to a more unified research+deployment approach, rather than lots of tooling in R&D which we then streamline for production, instead we hope to share similar tooling between R&D and production so deployment and different scales of data are ‘easier’.

I tried the latest PyPy 2.5 (running Python 2.7) and it ran PySpark just fine. Using PyPy 2.5 a prime-search example takes 6s vs 39s with vanilla Python 2.7, so in-memory processing using RDDs rather than numpy objects might be quick and convenient (has anyone trialled this?). To run using PyPy set PYSPARK_PYTHON:

$ PYSPARK_PYTHON=~/pypy-2.5.0-linux64/bin/pypy ./pyspark

I’m used to working with Anaconda environments and for Spark I’ve setup a Python 2.7.8 environment (“conda create -n spark27 anaconda python=2.7″) & IPython 2.2.0. Whichever Python is in the search path or is specified at the command line is used by the pyspark script.

The next challenge to solve was integration with ElasticSearch for storing outputs. The officialdocs are a little tough to read as a non-Java/non-Hadoop programmer and they don’t mention PySpark integration, thankfully there’s a lovely 4-part blog sequence which “just works”:

The above creates a list of 4 dictionaries and then sends them to a local ES store using “myindex” and “mytype” for each new document. Before I found the above I used this older solution which also worked just fine.

Running the local interactive session using a mock cluster was pretty easy. The docs for spark-standalone are a good start:

sbin $ ./start-master.sh

# the log (full path is reported by the script so you could `tail -f `) shows
# 15/02/17 14:11:46 INFO Master:
# Starting Spark master at spark://ian-Latitude-E6420:7077
# which gives the link to the browser view of the master machine which is
# probably on :8080 (as shown here http://www.mccarroll.net/blog/pyspark/).

#Next start a single worker:

sbin $ ./start-slave.sh 0 spark://ian-Latitude-E6420:7077
# and the logs will show a link to another web page for each worker
# (probably starting at :4040).

#Next you can start a pySpark IPython shell for local experimentation:

$ IPYTHON=1 ~/data/libraries/spark-1.2.0-bin-hadoop2.4/bin/pyspark
--master spark://ian-Latitude-E6420:7077
# (and similarity you could run a spark-shell to do the same with Scala)

#Or we can run their demo code using the master node you've configured setup: