Tag: data-kind

This has been a lovely and sunny weekend in London, but I didn’t see any of it because I spent it all crunching dataframes and calculating numbers at my first Data Dive.

Data Dives are events organized by an international organization called DataKind, in which a bunch of data scientists volunteer to dedicate their time to solve data analysis for non-profit companies. For example I have been analysing data for My Help at Home, a company that helps elderly people finding local carers, trying to understand which factors influence the demand and costs of private carers.

DataKindUk has a strict no-sharing policy regarding the results of the Data Dive, in order to protect the data made available by the charities. However in the case of My Help at Home we used only publicly available data, so I guess I can show some of the results, based on the number of Homes, Agencies and Hospitals in UK:

Here are a few thoughts about the experience:

I’ve decided that I will start introducing myself as a data scientist rather than a bioinformatician. Most people from outside the academia do not really understand what a bioinformatician is, and it is easier to explain them that you are a data analyst or scientist working on genetic and biological data. In the end the definition is correct – bioinformaticians truthfully are a specialized type of data scientists.

This has been an opportunity to get in contact with the “real world” of data science outside the academia. Most of the people I met work for the private sectors, like financing, consulting, gambling, and journalism. I only met a couple of people from the academia, and they were both complaining about the lack of organization and planning at the university.

Thanks to dplyr and related libraries, R has become a really powerful tool for merging and assembling datasets. It helped me a lot during the phase of data cleaning and assembly, and I think that for these tasks it is much better than python or bash. I would recommend to anyone starting learning R to skip all the basic syntax and start directly with dplyr (e.g. see the tutorial I wrote for the PEB workshop).

The majority of people used python, in particular the ipython notebooks, for most of the tasks. Currently I am a R and dplyr person, but for machine learning tasks I am starting to think that python and scikit-learn can actually be more powerful.

People working in consulting, who for their work need to able to easily create nice and interactive graphs, used visual solutions such as tableau rather than munching with R or other programming tools. For example, the interactive graph above was created in a couple of minutes with noveau.