Machine Learning in Geoscience with Scikit-learn. Part 2: inferential statistics and domain knowledge to select features for oil prediction

In the first post of this series I showed how to use Pandas, Seaborn, and Matplotlib to:

load a dataset

test, clean up, and summarize the data

start looking for relationships between variables using scatterplots and correlation coefficients

In this second post, I will expand on the latter point by introducing some tests and visualizations that will help highlight the possible criteria for choosing some variables, and dropping others. All in Python.

The target to be predicted is oil production from a marine barrier sand. We have measured production (in tens of barrels per day) and 7 unknown (initially) predictors, at 21 wells:

Hang on tight, and read along, because it will be a wild ride!

I will show how to:

1) automatically flag linearly correlated predictors, so we can decide which might be dropped. In the example below (a matrix of pair-wise correlation coefficients between variables) we see that X2, and X7, the second and third best individual predictors of production (shown in the bottom row) are also highly correlated to X1, the best overall predictor. This is flagged in the first column from the left in the image below.

2) automatically flag predictors that fail a critical r test, as shown in the next image

3) create a table to assess the probability that a certain correlation is spurious, in other words the probability of getting at least the correlation coefficient we got with our the sample, or even higher, purely by chance.

I will not recommend to run these tests and apply the criteria blindly. Rather, I will suggest how to use them to learn more about the data, and in conjunction with domain knowledge about the problem at hand (in this case oil production), make more informed choices about which variables should, and which should not be used.

Blogroll

Meta

Go ahead if you want to use my code, modify it, improve it, for non-commercial AND for commercial use. You are also welcome to download and reuse my media files - unless otherwise stated. With both code and images, please give full and clear credit to Matteo Niccoli as the author and mycarta.wordpress.com as the source.
WordPress bloggers are welcome to reblog my posts. For republishing outside of WordPress or any other request, please e-mail me at: matteo@mycarta.ca