I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy.

Before running this you will need to install the pandas, statsmodels and patsy packages. If you’re using conda you should be able to do this by running the following from the terminal:

conda install statsmodels patsy

(and then say yes when it asks you to confirm it)

importpandasaspdfromstatsmodels.formula.apiimportols

Before we can do any regression, we need some data – so lets read some data on cars:

df=pd.read_csv("http://web.pdx.edu/~gerbing/data/cars.csv")

You may have noticed from the code above that you can just give a URL to the read_csv function and it will download it and open it – handy!

Anyway, here is the data:

df.head()

Model

MPG

Cylinders

Engine Disp

Horsepower

Weight

Accelerate

Year

Origin

0

amc ambassador dpl

15.0

8

390.0

190

3850

8.5

70

American

1

amc gremlin

21.0

6

199.0

90

2648

15.0

70

American

2

amc hornet

18.0

6

199.0

97

2774

15.5

70

American

3

amc rebel sst

16.0

8

304.0

150

3433

12.0

70

American

4

buick estate wagon (sw)

14.0

8

455.0

225

3086

10.0

70

American

Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the corr() method:

df.corr()

MPG

Cylinders

Engine Disp

Horsepower

Weight

Accelerate

Year

MPG

1.000000

-0.777618

-0.805127

-0.778427

-0.832244

0.423329

0.580541

Cylinders

-0.777618

1.000000

0.950823

0.842983

0.897527

-0.504683

-0.345647

Engine Disp

-0.805127

0.950823

1.000000

0.897257

0.932994

-0.543800

-0.369855

Horsepower

-0.778427

0.842983

0.897257

1.000000

0.864538

-0.689196

-0.416361

Weight

-0.832244

0.897527

0.932994

0.864538

1.000000

-0.416839

-0.309120

Accelerate

0.423329

-0.504683

-0.543800

-0.689196

-0.416839

1.000000

0.290316

Year

0.580541

-0.345647

-0.369855

-0.416361

-0.309120

0.290316

1.000000

Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:

model=ols("MPG ~ Year",data=df)results=model.fit()

The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The ols method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called df). We then call fit() to actually do the regression.

We can easily get a summary of the results here – including all sorts of crazy statistical measures!

results.summary()

OLS Regression Results

Dep. Variable:

MPG

R-squared:

0.337

Model:

OLS

Adj. R-squared:

0.335

Method:

Least Squares

F-statistic:

198.3

Date:

Sat, 20 Aug 2016

Prob (F-statistic):

1.08e-36

Time:

10:42:17

Log-Likelihood:

-1280.6

No. Observations:

392

AIC:

2565.

Df Residuals:

390

BIC:

2573.

Df Model:

1

Covariance Type:

nonrobust

coef

std err

t

P>|t|

[95.0% Conf. Int.]

Intercept

-70.0117

6.645

-10.536

0.000

-83.076 -56.947

Year

1.2300

0.087

14.080

0.000

1.058 1.402

Omnibus:

21.407

Durbin-Watson:

1.121

Prob(Omnibus):

0.000

Jarque-Bera (JB):

15.843

Skew:

0.387

Prob(JB):

0.000363

Kurtosis:

2.391

Cond. No.

1.57e+03

We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:

We can see that bringing in some extra variables has increased the $R^2$ value from ~0.3 to ~0.8 – although we can see that the P value for the Horsepower is very high. If we remove Horsepower from the regression then it barely changes the results:

We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called Origin. Patsy automatically treats strings as categorical variables, so we don’t have to do anything special – but if needed we could wrap the variable name in C() to force it to be a categorical variable.

You can see here that Patsy has automatically created extra variables for Origin: in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily – see here.

Just for reference, you can easily get any of the statistical outputs as attributes on the results object:

This is just a very brief reminder about something you might run into when you’re trying to get your code to work on multiple platforms – in this case, OS X, Linux and Windows.

Basically: file names/paths are case-sensitive on Linux, but not on OS X or Windows.

Therefore, you could have some Python code like this:

f = open(os.path.join(base_path, 'LE72020252003106EDC00_B1.tif'))

which you might use to open part of a Landsat 7 image – and it would work absolutely fine on OS X and Windows, but fail on Linux. I initially assumed that the failure on Linux was due to some of the crazy path manipulation stuff that I had done to get base_path – but it wasn’t.

It was purely down to the fact that the file was actually called LE72020252003106EDC00_B1.TIF, and Linux treats LE72020252003106EDC00_B1.tif and LE72020252003106EDC00_B1.TIF as different files.

I’d always known that paths on Windows are not case-sensitive, and that they are case-sensitive on Linux – but I’d naively assumed that OS X paths were case-sensitive too, as OS X is based on a *nix backend, but I was wrong.

If you really have problems with this then you could fairly easily write a function that checked to see if a filename exists, and if it found that it didn’t then tried searching for files using something like a case-insensitive regular expression – but it’s probably just easiest to get the case of the filename right in the first place!

This is a quick post to brief describe a problem I ran into the other day when trying to debug someone’s code – the answer may be entirely obvious to you, but it took me a while to work out, so I thought I’d document it here.

The problem that I was called over to help with was a line of code like this:

t.do_something(a=5, b=10)

where t was an instance of a class. Now, this wasn’t the way that I usually write code – I tend to only use keyword arguments after I’ve already used positional arguments – but it reminded me that in Python the following calls to the function f(a, b) are equivalent:

f(1, 2)
f(1, b=2)
f(a=1, b=2)

Anyway, going back to the original code: it gave the following error:

do_something() got multiple values for argument 'a'

which I thought was very strange, as there was definitely only one value of a given in the call to that method.

If you consider yourself to be a reasonably advanced Python programmer than you might want to stop here and see if you can work out what the problem is. Any ideas?

When you’ve had a bit of a think, continue below…

I had a look at the definition of t.do_something(), and it looked like this:

You may have noticed the problem now – although at first glance I couldn’t see anything wrong… There was definitely only one parameter called a, and it definitely wasn’t being passed twice…so what was going on?!

As you’ve probably noticed by now…this method was missing the self parameter – and should have been defined as do_something(self, a, b). Changing it to that made it work fine, but it’s worth thinking about exactly why we were getting that specific error.

Firstly, let’s have a look at a more ‘standard’ error that you might get when you forget to add self as the first argument for an instance method. We can see this by just calling the method without using keyword arguments (that is, t.do_something(1, 2)), which gives:

TypeError: test_noself() takes 2 positional arguments but 3 were given

Now, once you’ve been programming Python for a while you’ll be fairly familiar with this error from when you’ve forgotten to put self as the first parameter for an instance method. The reason this specific error is produced is that Python will always pass instance methods the value of selfas well as the arguments you’ve given the method. So, when you run the code:

t.do_something(1, 2)

Python will change this ‘behind the scenes’, and actually run:

t.do_something(t, 1, 2)

and as do_something is only defined to take two arguments, you’ll get an error. Of course, if your function had been able to take three arguments (for example, if there was an optional third argument), then you would find that t (which is the value of self in this case) was being passed as the first argument (a), 1 as the value of the second argument (b) and 2 as the value of the third argument (which could have been called c). This is a good point to remind you that the first argument of methods is only called self by convention – and that Python itself doesn’t care what you call it (although you should always call it self!)

From this, you should be able to work out why you’re getting an error about getting multiple values for the argument a… What’s happening is that Python is passing self to the method, as the first argument (which we have called a), and is then passing the two other arguments that we specified as keyword arguments. So, the ‘behind the scenes’ code calling the function is:

t.do_something(t, a=1, b=2)

But, the first argument is called a, so this is basically equivalent to writing:

t.do_something(a=t, a=1, b=2)

which is obviously ambiguous – and so Python throws an error.

Interestingly, it is quite difficult to get into a situation in which Python throws this particular error – if you try to run the code above you get a different error:

SyntaxError: keyword argument repeated

as Python has realised that there is a problem from the syntax, before it even tries to run it. You can manage it by using dictionary unpacking:

def f(a, b):
pass
d = {'a':1, 'b':2}
f(1, **d)

Here we are defining a function that takes two arguments, and then calling it with a single positional argument for a, and then using the ** method of dictionary unpacking to take the dictionary d and convert each key-value pair to a keyword argument and value combination.

So, congratulations if you’d have solved this problem far quicker than me – but I hope it has made you think a bit more about how Python handles positional and keyword arguments. A few points to remember:

Always remember to use self as the first argument of your methods! (This would have stopped this problem ever happening!)

But remember that the name self is just a convention, and Python will pass the instance of your class to your first argument regardless what it is called, which can cause weird problems.

All positional arguments can be passed as keyword arguments, and vice-versa – they are entirely interchangeable – which, again, can cause problems if this isn’t what you intended.

A key – but challenging – part of learning to program is moving from writing technically-correct code “that works” to writing high-quality code that is sensibly decomposed into functions, generically-applicable and generally “good”. Indeed, you could say that this is exactly what Software Carpentry is about – taking you from someone bodging together a few bits of wood in the shed, to a skilled carpenter. As well as being challenging to learn, this is also challenging to teach: how should you show the progression from “working” to “good” code in a teaching context?

I’ve been struggling with this recently as part of some small-group programming teaching I’ve been doing. Simply showing the “before” and “after” ends up bombarding the students with too many changes at once: they can’t see how you get from one to the other, so I want some way to show the development of code over time as things are gradually done to it (for example, moving this code into a separate function, adding an extra argument to that function to make it more generic, renaming these variables and so on). Obviously when teaching face-to-face I can go through this interactively with the students – but some changes to real-world code are too large to do live – and students often seem to find these sorts of discussions a bit overwhelming, and want to refer back to the changes and reasoning later (or they may want to look at other examples I’ve given them). Therefore, I want some way to annotate these changes to give the explanation (to show why we’re moving that bit of code into a separate function, but not some other bit of code), but to still show them in context.

Exactly what code should be used for these examples is another discussion: I’ve used real-world code from other projects, code I’ve written specifically for demonstration, code I’ve written myself in the past and sometimes code that the students themselves have written.

So far, I’ve tried the following approaches for showing these changes with annotation:

Making all of the changes to the code and providing a separate document with an ordered list of what I’ve changed and why.Simple and low-tech, but often difficult for the students to visualise each change

The same as above but committing between each entry in the list.Allows them to step through git commits if they want, and to get back to how the code was after each individual change – but many of the students struggle to do this effectively in git, and it adds a huge technological barrier…particularly with Git’s ‘interesting’ user-interface.

The same as above, but using Github’s line comments feature to put comments at specific locations in the code.Allows annotations at specific locations in the code, but rather clunky to step through the full diff view of commits in order using Github’s UI.

I suspect any solution will involve some sort of version control system used in some way (although I’m not sure that standard diffs are quite the best way to represent changes for this particular use-case), but possibly with a different interface on it.

Is this a problem anyone else has faced in their teaching? Can you suggest any tools or approaches that might make this easier – for both the teacher and students?

I use data from the AERONET network of sun photometers a lot in my work, and do a lot of processing of the data in Python. As part of this I usually want to load the data into pandas – but because of the format of the data, it’s not quite as simple as it could be.

So, for those of you who are impatient, here is some code that reads an AERONET data file into a pandas DataFrame which you can just download and use:

404: Not Found

For those who want more details, read on…

Once you’ve downloaded an AERONET data file and unzipped it, you’ll find you have a file called something like 050101_161231_Chilbolton.lev20, and if you look at the start of the file it’ll look a bit like this:

You can see here that we have a few lines of metadata at the top of the file, including the ‘level’ of the data (AERONET data is provided at three levels, 1.0, 1.5 and 2.0, referring to the quality assurance of the data), and some information about the AERONET site.

In this function we’re just going to ignore this metadata, and start reading at the 5th line, which contains the column headers. Now, you’ll see that the data looks like a fairly standard CSV file, so we should be able to read it fairly easily with pd.read_csv. This is true, and you can read it using:

df = pd.read_csv(filename, skiprows=4)

However, you’ll find a few issues with the DataFrame you get back from that simple line of code: firstly dates and times are just left as strings (rather than being parsed into proper datetime columns) and missing data is still shown as the string ‘N/A’. We can solve both of these:

No data: read_csv allows us to specify how ‘no data’ values are represented in the data, so all we need to do is set this: pd.read_csv(filename, skiprows=4, na_values=['N/A']) Note: we need to give na_values a list of values to treat as no data, hence we create a single-element list containing the string N/A.

Dates & times: These are a little harder, mainly because of the strange format in which they are provided in the file. Although the column header for the first column says Date(dd-mm-yy), the date is actually colon-separated (dd:mm:yy). This is a very unusual format for a date, so pandas won’t automatically convert it – we have to help it along a bit. So, first we define a function to parse a date from that strange format into a standard Python datetime:

dateparse = lambda x: pd.datetime.strptime(x, "%d:%m:%Y %H:%M:%S")

I could have written this as a normal function (def dateparse(x)), but I used a lambda expression as it seemed easier for such a short function. Once we’ve defined this function we tell pandas to use it to parse dates (date_parser=dateparse) and also tell it that the first two columns together represent the time of each observation, and they should be parsed as dates (parse_dates={'times':[0,1]}).

That’s all we need to do to read in the data and convert the right columns, the rest of the function just does some cleaning up:

We set the times as the index of the DataFrame, as it is the unique identifier for each observation – and makes it easy to join with other data later.

We remove the JulianDay column, as it’s rather useless now that we have a properly parsed timestamp

We drop any columns that are entirely NaN and any rows that are entirely NaN (that’s what dropna(axis=1, how='all') does).

We rename a column, and then make sure the data is sorted

aeronet = aeronet.set_index('times')
del aeronet['Julian_Day']
# Drop any rows that are all NaN and any cols that are all NaN
# & then sort by the index
an = (aeronet.dropna(axis=1, how='all')
.dropna(axis=0, how='all')
.rename(columns={'Last_Processing_Date(dd/mm/yyyy)': 'Last_Processing_Date'})
.sort_index())

You’ll notice that the last few bits of this ‘post-processing’ were done using ‘method-chaining’, where we just ‘chain’ pandas methods one after another. This is often a very convenient way to work in Python – see this blog post for more information.

So, that’s how this function works – now go off and process some AERONET data!

I now use Anaconda as my primary Python distribution – and my company have also adopted it for use on all of their developer machines as well as their servers – so I like to think I’m a relatively knowledgeable user. However, the other day I came across a wonderful feature that I’d never known about before…revisions!

The best way to explain is by a quick example. If you run conda list --revisions, you’ll get an output like this:

In this output you can see a number of specific versions (or revisions) of this environment (in this case the default conda environment), along with the date/time they were created, and the differences (installed packages shown as +, uninstalled shown as - and upgrades shown as ->). If you want to revert to a previous revision you can simply run conda install --revision N (where N is the revision number). This will ask you to confirm the relevant package uninstallation/installation – and get you back to exactly where you were before!

So, I think that’s pretty awesome – and really handy if you screw things up and want to go back to a previously working environment. I’ve got a few other hints for you though…

Firstly, if you ‘revert’ to a previous revision then you will find that an ‘inverse’ revision is created, simply doing the opposite of what the previous revision did. For example, if your revision list looks like this:

You can see that the changes for revision 3 are just the inverse of revision 2.

One more thing is that I’ve found out that all of this data is stored in the history file in the conda-meta directory of your environment (CONDA_ROOT/conda-meta for your default environment and CONDA_ROOT/envs/ENV_NAME/conda-meta for any other environment). You don’t want to know why I went searching for this file (it’s a long story involving some stupidity on my part), but it’s got some really useful contents:

Specifically, it doesn’t just give you the list of what was installed, uninstalled or upgraded – but it also gives you the commands you ran! If you want, you can extract these commands with a bit of command-line magic:

(For reference, the command-line magic gets the content of the history file, searches for all lines starting with # cmd, and then splits the line by spaces and extracts everything from the 3rd group onwards)

I find environment.yml files to be a bit of a pain sometimes (they’re not always cross-platform compatible – see this issue), so this is quite useful as it actually gives me the commands that I ran to create the environment.

There was a question recently on the Py6S mailing list about what data sources are best to use to provide atmospheric parameters (such as AOT, water vapour and ozone) for use with 6S, other atmospheric Radiative Transfer Models (such as MODTRAN) or other atmospheric correction algorithms (such as ATCOR). In the spirit of ‘reply to public’ I decided I’d post the response here, as it’d certainly be easier for other people to find in future!

The first thing to say is if you’re doing atmospheric correction, then by far the best solution is to derive as many of the parameters as possible from the satellite data that you are trying to correct. Sometimes this is relatively easy (for example, when using MODIS data, as there are standard MODIS products for AOT, water vapour and ozone) and sometimes harder (there are no standard atmospheric products for SPOT, and algorithms for estimating atmospheric parameters from Landsat are relatively limited). The main reasons for this are:

The atmospheric data will be from the exact time of the satellite image acquisition

Depending on the satellite, the atmospheric data may be provided for every pixel in the image, and if this isn’t the case then there might be more than one value across the image (for example, the LEDAPS algorithm estimates AOT over all pixels containing Dense Dark Vegetation). This is key as it allows the atmospheric correction algorithm to take into account changes in atmospheric conditions across the image. I have published research (Wilson et al., 2014) showing that failing to take into account this variability when doing atmospheric correction can lead to significant errors.

However, if you’ve found this post then you’ve probably either already tried using data from the satellite you’re working with, and found it isn’t possible – or you’re doing radiative transfer modelling for some other purpose than atmospheric correction, so the explanation above had no relevance to you. So, on with the other data sources – and in Part 1 we’re going to cover aerosol-related data.

Aerosol Optical Thickness

(Also known as Aerosol Optical Depth, just to confuse you!)

This is a measure of the haziness of the atmosphere, and is set in Py6S as: s.aot550 = x Please note that this means AOT at 550nm (yes, AOT is wavelength dependent!), so you may have to interpolate the AOT data from different wavelengths. I’ll post a separate post soon about how best to do this.

When choosing an AOT data source, you usually have a trade-off between accuracy, spatial resolution and temporal resolution. If you’re lucky, you can manage to get two of those, but not three! So, going in order of accuracy:

AERONET

The Aerosol Robotic Network (AERONET) is the gold-standard in terms of accurate measurement of AOT. The measurements are acquired from ground-based sun photometers, which have the major benefit of being able to produce very accurate measurements at a high frequency – as they don’t depend on satellite orbits. However, there are two major disadvantages: they have a very low spatial resolution, and they only produce measurements when the sun is shining.

Specifically, although there are over 1,000 measurement sites listed on the AERONET website, many of these are for short-term measurement campaigns and so the number of sites regularly recording measurements over a long period of time is much lower. For example, excluding short-term sites, there are only three measurement sites in the UK (Hampshire, London and Edinburgh), and there are many parts of the world with very few sites (such as central Africa).

Also, as mentioned above, measurements are only acquired when there is a direct view of the sun from the instrument. This obviously means that measurements are not acquired during the night (which can stretch for many months near the poles), but also that cloud cover has a significant impact on the frequency of measurements. The AERONET cloud-screening algorithms have to be very strict, as even a tiny amount of cloud will cause a large error in the AOT measurement.

Other sun photometer data

AERONET run a large network of sun photometers, but there are many other sun photometers run by other people/groups, and these can be good sources of data. Indeed, you may even have acquired your own measurements using your own sun photometer (such as the Microtops sun photometer pictured below). The advantages and disadvantages above still apply, although if you’ve taken measurements yourself then I assume they are in the right place and at the right time!

MODIS AOT

Data from the MODIS sensor on the Aqua and Terra satellites is operationally processed to produce AOT data at 10km and 3km resolution, the MOD04 and MOD04_3K products. AOT estimates are produced over all land surfaces, but measurements over bright surfaces (such as deserts and snow) tend to have a higher error.

The Aqua and Terra satellites are in a sun-synchronous orbit, so passes over each location at approximately 10:30 and 13:30 local solar time, giving a total of two measurements a day. Again, cloud cover is an issue, and the MODIS algorithm masks out any pixel containing cloud.

The MODIS data is provided as HDF files with loads of Scientific Data Sets (SDS, basically bands) – the one you want to use is called Optical_Depth_Land_And_Ocean, which is high quality AOT data (where the QAC, the measure of retrieval quality, is 3) from both land and ocean, using all algorithms.

The MOD08 dataset provides the same data at a lower resolution (1 degree), if spatial resolution is not important for you.

Data available from: Data is not currently available online, but I think it is due to be released as an official MODIS product within the next six months or so.

Aerosol Type

There are lots of ways of parameterising aerosol types in radiative transfer models (see here for a list of the ways you can do this in Py6S) – and it is often difficult to work out what the best option to use is. I’ve listed some options below, but this is unlikely to be a comprehensive list – so please leave a comment if you know of any other data sources.

AERONET

As well as providing AOT data, AERONET sun photometers also acquire other measurements which can be used to infer the aerosol type. This is almost certainly the most accurate way of setting the aerosol type – and there is a built-in function in Py6S that will help you do it. See here for full details on how to use it: basically you just download the AERONET data and give Py6S the filename and date/time and it does the rest.

Data available from: NASA AERONET site (find AERONET Inversions on the left-hand side)

MODIS

The MOD04 product mentioned above also provides an estimation of aerosol type, in five categories: dust, sulfate, smoke, heavy absorbing smoke and mixed. It’s not entirely clear how best to match these to the predefined 6S aerosol types (which are biomass burning, continental, desert, maritime, stratospheric and urban) – although there is an option in 6S to set a user-defined aerosol profile, which allows you to specify the relative fraction of each of dust, water, oceanic and soot aerosols (see AeroProfile.User)

Aerosol type climatologies

A number of aerosol type climatologies have been produced, giving an indication of the typical aerosol type observed at each location, over time. These climatologies vary in the level of detail – both in terms of spatial resolution (which is often coarse) and temporal resolution (often monthly or seasonally). The best way to find these is to search for papers about aerosol climatologies, and then either look for a URL in the paper or contact the authors to ask for the data. I’ve only included a couple of examples here as new climatologies are being published frequently:

If someone wants to see how many times my work has been referenced, they’d probably go and look at my citation statistics, for example on my Google Scholar profile. At the time of writing, that shows that I have 16 citations overall, and a h-index of 2. However, I don’t think this tells the whole story.

Specifically, it only counts references to academic papers, books or conference proceedings (‘traditionally-citable items’), and it doesn’t take into account the far wider use of some of these items beyond traditional citations in other papers, books or conference proceedings. This misses out a lot of uses of my work – many of which are uses that I think are actually important (possibly more important than citations in other academic work).

(Please note that I’m not trying to point to other non-traditional references because I feel that my citation count is “too low”, and I’m also not trying to say that a mention of my website URL is equivalent of a citation of my paper in another journal article – but I think these things should be taken into account, and it is hard to do so at the moment).

Anyway, I’ve been doing some investigation into some of the other uses of my work – mostly using various Google and Google Scholar searches – and I’ve found a lot of non-traditional references, some of which I didn’t know about before. For example:

Finally, my FreeGISData website has been referenced a lot, in at least six journal articles, four books or book chapters (including one non-academic book), a couple of subject magazines, two PhD theses and at least one MOOC course.

A full list of non-traditional references to my work is available here, but I hope you’ll agree that – although they are not traditional references in traditional academic works – these uses and applications of my work are important, and show a real impact – often beyond the academic community.

I’m supervising an MSc student for her thesis this summer, and the work she’s doing with me is going to involve a fair amount of programming, in the context of remote sensing & GIS processing. She’s got experience programming in IDL from a programming course during the taught part of her Masters, but has no experience of Python.

I’ve just sent her an email with links to some useful resources, but in the spirit of Matt Might’s Blog tips for busy academics, I thought it would be worth doing a ‘reply to public’, and putting the list of resources here. So, here goes…

Software Carpentry Python course:- this is designed for people new to programming, so some of it will be very easy for you. However, it takes you through a good example of scientific programming with Python, including plotting graphs and dealing with arrays. I suggest that you use the Jupyter Notebook to run through this – there are instructions on how to do that in the course, and there is a brief intro to the notebook in this YouTube video (start at approx 3 minutes – the stuff before that is irrelevant for you)

Lewis’s Scientific Programming for RS course (from UCL): – The most useful bits will probably be Python 101, Plotting and Numerical Python and Geospatial Data. Click the ‘Course Notes’ under each section to see the detailed notes.

Python Scripting with Spatial Data is also good – it gives a good intro to Python, and then covers spatial analysis using GDAL and RIOS (we won’t be using RIOS, but the GDAL stuff is good).

Geoprocessing with Python using Open Source GIS: This is a very good set of slides and tutorials – along with assignments, homework tasks and solutions – from a course run at Utah State University. Unfortunately it is now very old – it was written in 2008/9 – and so it refers to a number of out of date things (such as the ‘numeric’ library for Python, which has been replaced by numpy). A lot of the GDAL content is still useful though – for example, this set of slides on reading raster data with GDAL is pretty good.

There are various good tutorials from conferences such as SciPy (Scientific Python) and FOSS4G (Free and Open Source Software for Geospatial). For example, you can watch a video of a three-hour tutorial from SciPy 2015 called Geospatial Data with Open Source tools in Python, and you can find the slides and other resources here. This goes into quite a lot of depth, and new Python programmers may find it all quite daunting – but it demonstrates the nice modern ways of doing things (using libraries like Fiona and rasterio) rather than the less-nice and lower-level GDAL library.

Another couple of resources (suggested by Chris Holden in the comments – thanks!) are the WUR Geoscripting resources, which start with R and then moves on to Python, and his own Open Source Geoprocessing tutorial, which covers GDAL and OGR, and demonstrates how to calculate NDVI and even goes as far as land cover classification.

There are also a few useful resources for switching from IDL to Python. Specifically:

A Numpy reference for IDL users: (Numpy is the Python library that provides functions to manipulate arrays – unlike IDL, this isn’t included by default in Python – but it does come with the Anaconda distribution I mentioned above).

I wrote a blog post comparing a set of ‘Ten Little Programs’ in IDL with equivalents in Python, which should give you an idea of the similarities and differences, and how you can translate some of your code.

This is a very short list of a few resources – I’m sure there are some better ones out there, and so please let me know if you’ve got any recommendations!

As part of my fellowship with the Software Sustainability Institute, I’ve written an article on Software Sustainability in Remote Sensing. This article was originally written a couple of years ago and it never quite got around to being published. However, I have recently updated it, and it’s now been posted on the SSI’s blog. I’ve also ‘reblogged it’ below: it is a long read, but hopefully you’ll agree that it is worth it.

If anyone reading this is also interested in software sustainability in remote sensing, and identifies with some of the issues raised below then please get in touch! I’d love to discuss it with you and see what we can do to improve things.

1. What is remote sensing?

Remote sensing broadly refers to the acquisition of information about an object remotely (that is, with no physical contact). The academic field of remote sensing, however, is generally focused on acquiring information about the Earth (or other planetary bodies) using measurements of electromagnetic radiation taken from airborne or satellite sensors. These measurements are usually acquired in the form of large images, often containing measurements in a number of different parts of the electromagnetic spectrum (for example, in the blue, green, red and near-infrared), known as wavebands. These images can be processed to generate a huge range of useful information including land cover, elevation, crop health, air quality, CO2 levels, rock type and more, which can be mapped easily over large areas. These measurements are now widely used operationally for everything from climate change assessments (IPCC, 2007) to monitoring the destruction of villages in Darfur (Marx and Loboda, 2013) and the deforestation of the Amazon rainforest (Kerr and Ostrovsky, 2003).

Figure 1: Landsat 8 image of Southampton, UK shown as a false-colour composite using the Near-Infrared, Red and Green bands. Vegetation is bright red, heathland and bare ground is beige/brown, urban areas are light blue, and water is dark blue.

The field, which crosses the traditional disciplines of physics, computing, environmental science and geography, developed out of air photo interpretation work during World War II, and expanded rapidly during the early parts of the space age. The launch of the first Landsat satellite in 1972 provided high-resolution images of the majority of the Earth for the first time, producing repeat images of the same location every 16 days at a resolution of 60m. In this case, the resolution refers to the level of detail, with a 60m resolution image containing measurements for each 60x60m area on the ground. Since then, many hundreds of further Earth Observation satellites have been launched which take measurements in wavelengths from ultra-violet to microwave at resolutions ranging from 50km to 40cm. One of the most recent launches is that of Landsat 8, which will contribute more data to an unbroken record of images from similar sensors acquired since 1972, now at a resolution of 30m and with far higher quality sensors. The entire Landsat image archive is now freely available to everyone – as are many other images from organisations such as NASA and ESA – a far cry from the days when a single Landsat image cost several thousands of dollars.

2. You have to use software

Given that remotely sensed data are nearly always provided as digital images, it is essential to use software to process them. This was often difficult with the limited storage and processing capabilities of computers in the 1960s and 1970s – indeed, the Corona series of spy satellites operated by the United States from 1959 to 1972 used photographic film which was then sent back to Earth in a specialised ‘re-entry vehicle’ and then developed and processed like standard holiday photos!

However, all civilian remote-sensing programs transmitted their data to ground stations on Earth in the form of digital images, which then required processing by both the satellite operators (to perform some rudimentary geometric and radiometric correction of the data) and the end-users (to perform whatever analysis they wanted from the images).

In the early days of remote sensing, computers didn’t even have the ability to display images onscreen, and the available software was very limited (see Figure 2a). Nowadays remote sensing researchers use a wide range of proprietary and open-source software, which can be split into four main categories:

Figure 2: Remote sensing image processing in the early 1970s (a) and 2013 (b). The 1970s image shows a set of punched cards and an ‘ASCII-art’ printout of a Landsat image of Southern Italy, produced by E. J. Milton as part of his PhD. The 2013 image shows the ENVI software viewing a Landsat 8 image of Southampton on a Windows PC

I surveyed approximately forty people at the Remote Sensing and Photogrammetry Society’s Annual Student Meeting 2013 to find out how they used software in their research, and this produced a list of 19 pieces of specialist software which were regularly used by the attendees, with some individual respondents regularly using over ten specialist tools.

Decisions about which software to use are often based on experience gained through university-level courses. For example, the University of Southampton uses ENVI and ArcGIS as the main software in most of its undergraduate and MSc remote sensing and GIS courses, and many of its students will continue to use these tools significantly for the rest of their career. Due to the limited amount of teaching time, many courses only cover one or two pieces of specialist software, thus leaving students under-exposed to the range of tools which are available in the field. There is always a danger that students will learn that tool rather than the general concepts which will allow them to apply their knowledge to any software they might need to use in the future

I was involved in teaching a course called Practical Skills in Remote Sensing as part of the MSc in Applied Remote Sensing and GIS at the University of Southampton which tried to challenge this, by introducing students to a wide range of proprietary and open-source software used in remote sensing, particularly software which performs more specialist tasks than ENVI and ArcGIS. Student feedback showed that they found this very useful, and many went on to use a range of software in their dissertation projects.

2.1 Open Source Software

In the last decade there has been huge growth in open-source GIS software, with rapid developments in tools such as GIS (a GIS display and editing environment with a similar interface to ArcGIS) and GRASS (a geoprocessing system which provides many functions for processing geographic data and can also be accessed through GIS). Ramsey (2009) argues that the start of this growth was caused by the slowness of commercial vendors to react to the growth of online mapping and internet delivery of geospatial data, and this created a niche for open-source software to fill. Once it had a foothold in this niche, open-source GIS software was able to spread more broadly, particularly as many smaller (and therefore more flexible) companies started to embrace GIS technologies for the first time.

Unfortunately, the rapid developments in open-source GIS software have not been mirrored in remote sensing image processing software. A number of open-source GIS tools have specialist remote sensing functionality (for example, the i.* commands in GRASS), but as the entire tool is not focused on remotely sensed image processing they can be harder to use than tools such as ENVI. Open-source remote sensing image processing tools do exist (for example, Opticks, OrfeoToolbox, OSSIM, ILWIS and InterImage), but they tend to suffer from common issues for small open-source software projects, specifically: poor documentation; complex and unintuitive interfaces; and significant unfixed bugs.

In contrast, there are a number of good open-source non-image-based remote sensing tools, particularly those used for physical modelling (for example, 6S (Vermote et al., 1997), PROSPECT and SAIL (Jacquemoud, 1993)), processing of spectra (SAMS and SPECCHIO (Bojinski et al., 2003; Hueni et al., 2009)), and as libraries for programmers (GDAL, Proj4J).

Open source tools are gradually becoming more widely used within the field, but their comparative lack of use amongst some students and researchers compared to closed-source software may be attributed to their limited use in teaching. However, organisations such as the Open Source Geospatial Laboratories (OSGL:www.osgeo.org) are helping to change this, through collating teaching materials based on open-source GIS at various levels, and there has been an increase recently in the use of tools such as Quantum GIS for introductory GIS teaching.

3. Programming for remote sensing

There is a broad community of scientists who write their own specialist processing programs within the remote sensing discipline, with 80% of the PhD students surveyed having programmed as part of their research. In many disciplines researchers are arguing for increased use and teaching of software, but as there is no real choice as to whether to use software in remote sensing, it is the use and teaching of programming that has become the discussion ground.

The reason for the significantly higher prevalence of programming in remote sensing compared to many other disciplines is that much remote sensing research involves developing new methods. Some of these new methods could be implemented by simply combining functions available in existing software, but most non-trivial methods require more than this. Even if the research is using existing methods, the large volumes of data used in many studies make processing using a GUI-based tool unattractive. For example, many studies in remote sensing use time series of images to assess environmental change (for example, deforestation or increases in air pollution) and, with daily imagery now being provided by various sensors, these studies can regularly use many hundreds of images. Processing all of these images manually would be incredibly time-consuming, and thus code is usually written to automate this process. Some processing tools provide ‘macro’ functionality which allow common tasks to be automated, but this is not available in all tools (for example, ENVI) and is often very limited. I recently wrote code to automate the downloading, reprojecting, resampling and subsetting of over 7000 MODIS Aerosol Optical Thickness images, to create a dataset of daily air pollution levels in India over the last ten years: creating this dataset would have been impossible by hand!

The above description suggests two main uses of programming in remote sensing research:

Writing code for new methods, as part of the development and testing of these methods

Writing scripts to automate the processing of large volumes of data using existing methods

3.1 Programming languages: the curious case of IDL and the growth of Python

The RSPSoc questionnaire respondents used languages as diverse as Python, IDL, C, FORTRAN, R, Mathematica, Matlab, PHP and Visual Basic, but the most common of these were Matlab, IDL and Python. Matlab is a commercial programming language which is commonly used across many areas of science, but IDL is generally less well-known, and it is interesting to consider why this language has such a large following within remote sensing.

IDL’s popularity stems from the fact that the ENVI remote sensing software is implemented in IDL, and exposes a wide range of its functionality through an IDL Application Programming Interface (API). This led to significant usage of IDL for writing scripts to automate processing of remotely-sensed data, and the similarity of IDL to Fortran (particularly in terms of its array-focused nature) encouraged many in the field to embrace it for developing new methods too. Although first developed in 1977, IDL is still actively developed by its current owner (Exelis Visual Information Solutions) and is still used by many remote sensing researchers, and taught as part of MSc programmes at a number of UK universities, including the University of Southampton.

However, I feel that IDL’s time as one of the most-used languages in remote sensing is coming to an end. When compared to modern languages such as Python (see below), IDL is difficult to learn, time-consuming to write and, of course, very expensive to purchase (although an open-source version called GDL is available, it is not compatible with the ENVI API, and thus negates one of the main reasons for remote sensing scientists to use IDL).

Python is a modern interpreted programming language which has an easy-to-learn syntax – often referred to as ‘executable pseudocode’. There has been a huge rise in the popularity of Python in a wide range of scientific research domains over the last decade, made possible by the development of the numpy (‘numerical python’ (Walt et al., 2011)), scipy (‘scientific python’ (Jones et al., 2001)) and matplotlib (‘matlab-like plotting’ (Hunter, 2007)) libraries. These provide efficient access to and processing of arrays in Python, along with the ability to plot results using a syntax very similar to Matlab. In addition to these fundamental libraries, over two thousand other scientific libraries are available at the Python Package Index (http://pypi.python.org), a number of which are specifically focused on remote-sensing.

Remote sensing data processing in Python has been helped by the development of mature libraries of code for performing common tasks. These include the Geospatial Data Abstraction Library (GDAL) which provides functions to load and save almost any remote-sensing image format or vector-based GIS format from a wide variety of programming languages including, of course, Python. In fact, a number of the core GDAL utilities are now implemented using the GDAL Python interface, rather than directly in C++. Use of a library such as GDAL gives a huge benefit to those writing remote-sensing processing code, as it allows them to ignore the details of the individual file formats they are using (whether they are GeoTIFF files, ENVI files or the very complex Erdas IMAGINE format files) and treat all files in exactly the same way, using a very simple API which allows easy loading and saving of chunks of images to and from numpy arrays.

Another set of important Python libraries used by remote-sensing researchers are those originally focused on the GIS community, including libraries for manipulating and processing vector data (such as Shapely) and libraries for dealing with the complex mathematics of projections and co-ordinate systems and the conversion of map references between these (such as the Proj4 library, which has APIs for a huge range of languages). Just as ENVI exposes much of its functionality through an IDL API; ArcGIS, Quantum GIS, GRASS and many other tools expose their functionality through a Python API. Thus, Python can be used very effectively as a ‘glue language’ to join functionality from a wide range of tools together into one coherent processing hierarchy.

A number of remote-sensing researchers have also released very specialist libraries for specific sub-domains within remote-sensing. Examples of this include the Earth Observation Land Data Assimilation System (EOLDAS) developed by Paul Lewis (Lewis et al., 2012), and the Py6S (Wilson, 2012) andPyProSAIL (Wilson, 2013) libraries which I have developed, which provide a modern programmatic interface to two well-known models within the field: the 6S atmospheric radiative transfer model and the ProSAIL vegetation spectral reflectance model.

Releasing these libraries – along with other open-source remote sensing code – has benefited my career, as it has brought me into touch with a wide range of researchers who want to use my code, helped me to develop my technical writing skills and also led to the publication of a journal paper. More importantly than any of these though, developing the Py6S library has given me exactly the tool that I need to do my research. I don’t have to work around the features (and bugs!) of another tool – I can implement the functions I need, focused on the way I want to use them, and then use the tool to help me do science. Of course, Py6S and PyProSAIL themselves rely heavily on a number of other Python scientific libraries – some generic, some remote-sensing focused – which have been realised by other researchers.

The Py6S and PyProSAIL packages demonstrate another attractive use for Python: wrapping legacy code in a modern interface. This may not seem important to some researchers – but many people struggle to use and understand models written in old languages which cannot cope with many modern data input and output formats. Python has been used very successfully in a number of situations to wrap these legacy codes and provide a ‘new lease of life’ for tried-and-tested codes from the 1980s and 1990s.

3.2Teaching programming in remote sensing

I came into the field of remote sensing having already worked as a programmer, so it was natural for me to use programming to solve many of the problems within the field. However, many students do not have this prior experience, so rely on courses within their undergraduate or MSc programmes to introduce them both to the benefits of programming within the field, and the technical knowledge they need to actually do it. In my experience teaching on these courses, the motivation is just as difficult as the technical teaching – students need to be shown why it is worth their while to learn a complex and scary new skill. Furthermore, students need to be taught how to program in an effective, reproducible manner, rather than just the technical details of syntax.

I believe that my ability to program has significantly benefited me as a remote-sensing researcher, and I am convinced that more students should be taught this essential skill. It is my view that programming must be taught far more broadly at university: just as students in disciplines from Anthropology to Zoology get taught statistics and mathematics in their undergraduate and masters-level courses, they should be taught programming too. This is particularly important in a subject such as remote sensing where programming can provide researchers with a huge boost in their effectiveness and efficiency. Unfortunately, where teaching is done, it is often similar to department-led statistics and mathematics courses: that is, not very good. Outsourcing these courses to the Computer Science department is not a good solution either – remote sensing students do not need a CS1-style computer science course, they need a course specifically focused on programming within remote sensing.

In terms of remote sensing teaching, I think it is essential that a programming course (preferably taught using a modern language like Python) is compulsory at MSc level, and available at an undergraduate level. Programming training and support should also be available to researchers at PhD, Post-Doc and Staff levels, ideally through some sort of drop-in ‘geocomputational expert’ service.

4. Reproducible research in remote sensing

Researchers working in ‘wet labs’ are taught to keep track of exactly what they have done at each step of their research, usually in the form of a lab notebook, thus allowing the research to be reproduced by others in the future. Unfortunately, this seems to be significantly less common when dealing with computational research – which includes most research in remote sensing. This has raised significant questions about the reproducibility of research carried out using ‘computational laboratories’, which leads to serious questions about the robustness of the science carried out by researchers in the field – as reproducibility is a key distinguishing factor of scientific research from quackery (Chalmers, 1999).

Reproducibility is where the automating of processing through programming really shows its importance: it is very hard to document exactly what you did using a GUI tool, but an automated script for doing the same processing can be self-documenting. Similarly, a page of equations describing a new method can leave a lot of important questions unanswered (what happens at the boundaries of the image? how exactly are the statistics calculated?) which will be answered by the code which implements the method.

A number of papers in remote sensing have shown issues with reproducibility. For example, Saleska et al. (2007) published a Science paper stating that the Amazon forest increased in photosynthetic activity during a widespread drought in 2005, and thus suggested that concerns about the response of the Amazon to climate change were overstated. However, after struggling to reproduce this result for a number of years, Samanta et al. (2010) eventually published a paper showing that the exact opposite was true, and that the spurious results were caused by incorrect cloud and aerosol screening in the satellite data used in the original study. If the original paper had provided fully reproducible details on their processing chain – or even better, code to run the entire process – then this error would likely have been caught much earlier, hopefully during the review process.

I have personally encountered problems reproducing other research published in the remote sensing literature – including the SYNTAM method for retrieving Aerosol Optical Depth from MODIS images (Tang et al., 2005), due to the limited details given in the paper, and the lack of available code, which has led to significant wasted time. I should point out, however, that some papers in the field deal with reproducibility very well. Irish et al. (2000) provide all of the details required to fully implement their revised version of the Landsat Automatic Cloud-cover Assessment Algorithm (ACCA), mainly through a series of detailed flowcharts in their paper. The release of their code would have saved me re-implementing it, but at least all of the details were given.

5. Issues and possible solutions

The description of the use of software in remote sensing above has raised a number of issues, which are summarised here:

There is a lack of open-source remote-sensing software comparable to packages such as ENVI or Erdas Imagine. Although there are some tools which fulfil part of this need, they need an acceleration in development in a similar manner to Quantum GIS to bring them to a level where researchers will truly engage. There is also a serious need for an open-source tool for Object-based Image Analysis, as the most-used commercial tool for this (eCognition) is so expensive that it is completely unaffordable for many institutions.

There is a lack of high-quality education in programming skills for remote-sensing students at undergraduate, masters and PhD levels.

Many remote sensing problems are conceptually easy to parallelise (if the operation on each pixel is independent then the entire process can be parallelised very easily), but there are few tools available to allow serial code to be easily parallelised by researchers who are not experienced in high performance computing.

Much of the research in the remote sensing literature is not reproducible. This is a particular problem when the research is developing new methods or algorithms which others will need to build upon. The development of reproducible research practices in other disciplines has not been mirrored in remote sensing, as yet.

Problems 2 & 4 can be solved by improving the education and training of researchers – particularly students – and problems 1 & 3 can be solved by the development of new open-source tools, preferably with the involvement of active remote sensing researchers. All of these solutions, however, rely on programming being taken seriously as a scientific activity within the field, as stated by the Science Code Manifesto (http://sciencecodemanifesto.org).

6. Advice

The first piece of advice I would give any budding remote sensing researcher is learn to program! It isn’t that hard – honestly! – and the ability to express your methods and processes in code will significantly boost your productivity, and therefore your career.

Once you’ve learnt to program, I would have a few more pieces of advice for you:

Script and code things as much as possible, rather than using GUI tools. Graphical User Interfaces are wonderful for exploring data – but by the time you click all of the buttons to do some manual processing for the 20th time (after you’ve got it wrong, lost the data, or just need to run it on another image) you’ll wish you’d coded it.

Don’t re-invent the wheel. If you want to do an unsupervised classification then use a standard algorithm (such as ISODATA or K-means) implemented through a robust and well-tested library (like the ENVI API, or scikit-learn) – there is no point wasting your time implementing an algorithm that other people have already written for you!

Try and get into the habit of documenting your code well – even if you think you’ll be the only person who looks at it, you’ll be surprised what you can forget in six months!

Don’t be afraid to release your code – people won’t laugh at your coding style, they’ll just be thankful that you released anything at all! Try and get into the habit of making research that you publish fully reproducible, and then sharing the code (and the data if you can) that you used to produce the outputs in the paper.