Discussion papers

Abstract:

With the growing availability of ‘big’ data, increasing computer power, and improved data storage capacities, machine learning techniques are now frequently employed in order to make sense of data. Yet, the social sciences have been slow to adopt these techniques, and there is little evidence of their use in some academic fields. This thesis explores the methods most commonly utilised in social science research, that is, linear regression and null hypothesis significance testing, in order to identify how machine learning methods might complement these more established methods.

A case study exploring the Troubled Families programme provides a practical example of how machine learning techniques can be utilised on complex, interlinked social data in order to provide deeper understanding and more insight into the data. Eleven different types of families were identified using cluster analysis, and analysis was performed in order to understand how the family’s lives changed after joining the TF programme when compared to before. The analysis provided insight into the various types of families that existed and the problems that they had. It also highlighted that, had the data been analysed on an overall global level, it would have been prone to an averaging effect whereby many of the changes that occurred were not apparent; analysis on the cluster-level resulted in identification of cluster-level patterns, and a greater understanding of the data.

This thesis demonstrated that machine learning techniques, such as cluster analysis and decision tree learning, can be effectively utilised on complex ‘real-life’ social science datasets. These methods can identify hidden groups and relationships, and important predictors in a dataset, provide a better understanding of the structure of the data, and aid in generating research questions and hypotheses.