An introduction to machine learning on small scale datasets (PyData)

An introduction to machine learning on small scale datasets – identifying Irish farmers who plant forests on their farms.

The purpose of this talk is to illustrate the differences between explanatory modelling (classical statistics) and predictive modelling (machine learning) as these two approaches are often conflated. The scikit-learn machine learning library was used to classify Irish farmers who planted forests on their land. The dataset was relatively small providing data on 799 Irish farmers and approximately 135 different variables. Prior to classifying farmers, irrelevant and redundant variables were removed from the dataset using a feature wrapper technique which improves the predictive power of models. This illustrates the power of machine learning for inductive analysis by uncovering previously unknown relationships between variables (features). As the Ipython notebooks were computationally demanding the final code was run on gaia, a high performance computer within UCD using runipy. Earlier versions of the Ipython notebooks were run on Amazon EC2 using StarCluster which makes high performance computing available to the general public at reasonable cost.

About the speaker

I started my professional career as a civil engineer. I completed a MSc. in applied geographical information systems (GIS) at Kingston University London and worked with Mallon Technology Ltd. I am just finishing a PhD with UCD that combines behavioural economics and machine learning techniques to identify Irish farmers who plant forests. My main interest is using python code and high performance computers to investigate complex patterns in human behaviour that are driven by a multitude of factors.