Scikit-learn, the machine learning library built for Python over 10 years ago is an excellent resource for estimating data and can integrate into geospatial workflows. Helpfully, when choosing an estimator, scikit-learn supplies an interactive diagram to choose the best estimator for the job.

One of the big things about machine learning is the need to acquire large high quality labelled data sets. The diagram above shows that some methods need at least 100k samples. A significant effort is required to collect this data if collected manually.

I wondered whether I could use scikit learn to predict terrain classification based on a group of pixels?

Data collection

In total I gathered 517 point samples. They were labelled 1-7 and I have taken a copy and saved this to a csv to be used later.

1 = Water

2 = Scrubland

3 = Building

4 = Road

5 = Trees

6 = Shadows

7 = Grass

Not too dissimilar to how I would gather data for a supervised classification.

The same location is used on both images – on the left I’ve only used Planet data and on right I’ve used Pan Sharpened Sentinel 2 data using Planet data. The image on the right provided a better image to classify. I created a square buffer around these points (using this script in the QGIS Python console); this took a fair amount of trial and error but eventually I settled on an 8m square buffer – giving me 8m2 pixel to classify.

These square buffers are going to be used to clip the underlying satellite image in order to get our sample pixels. To do this I need to create 517 (my sample size) 8m2 polygons. In QGIS this can be done using Vector – Data Management Tools – Split Vector Layer (I’ve got 517 shapefiles – that’s a lot of files!)

Loop over each file and clip out using gdalwarp and then convert to .jpg with gdal_translate.

Running the classifier

From my data in tests I am getting about ~80% prediction success. Which looks like this:

but not this when the label is road but the model predicts building

Next steps?

As scikit learn has several other ensemble methods it is pretty straight forward to compare each method to get the best fit to the data. Something that is certainly worthy of further investigation.

For a machine learning dataset this is still pretty small. Increasing the size of the labelled data should improve my classifier. Running a QC on the 517 samples the raster data used were buffers around manually collected point data. There will almost certainly be pixels that are mis-assigned – this could explain the road label above.

Ideally I’d like to click on an image and use the classifier to predict what I am clicking on and report back. Now that would be smart!