Machine learning in geoscience with scikit-learn. Part 1: checking, tidying, and analyzing the dataset

The idea behind this series of articles is to show how to predict P-wave velocity, as measured by a geophysical well log (the sonic), from a suite of other logs: density, gamma ray, and neutron, and also depth, using Machine Learning.

I will explore different Machine Learning methods from the scikit-learn Python library and compare their performances.

To wet your appetites, here’s an example of P-wave velocity, Vp, predicted using a cross-validated linear model, which will be the benchmark for the performance of other models, such as SVM and Random Forest:

In the first notebook, which is already available on GitHub here, I show how to use the Pandas and Seaborn Python libraries to import the data, check it, clean it up, and visualize to explore relationships between the variables. For example, shown below is a heatmap with the pairwise Spearman correlation coefficient between the variables (logs):

Stay tuned for the next post / notebook!

PS: I am very excited by the kick-off of the Geophysical Tutorial (The Leading Edge) Machine Learning Contest 2016. Check it out here!

Blogroll

Meta

Go ahead if you want to use my code, modify it, improve it, for non-commercial AND for commercial use. You are also welcome to download and reuse my media files - unless otherwise stated. With both code and images, please give full and clear credit to Matteo Niccoli as the author and mycarta.wordpress.com as the source.
WordPress bloggers are welcome to reblog my posts. For republishing outside of WordPress or any other request, please e-mail me at: matteo@mycarta.ca