…the makings of a small casserole

Tag Archive: Python

In the spirit of improving my software engineering practices I have been trying to make more use of the Python logging module. In common with many programmers my first instinct when debugging a programming problem is to use print statements (or their local equivalent) to provide an insight into what my program is up to. Obviously, I should be making use of any debugger provided but there is something reassuring about the immediacy and simplicity of print.

A useful evolution of the print statement in Python is the logging module which can be used as a simple print function but it can do so much more: you can configure loggers for different packages and modules whose behaviour can be controlled centrally; you can vary the verbosity of your logging messages. If you decide to switch to logging to a file rather than the terminal this can be achieved too, and you can even post your log messages to a website using HTTPhandler. Obviously logging is about much more than debugging.

I am writing this blog post because, as most of us have discovered, using logging is not quite as straightforward as we were led to believe. In particular you might find yourself in the situation where you feel you have set up your logging yet when you run your code nothing appears in your terminal window. Print doesn’t do this to you!

Loggers are arranged in a hierarchy. Loggers have handlers which are the things that cause a log to generate output to a device. If no log is specified then a default log called the root log is used. A logger has a name and the hierarchy is defined by the dots in the name, all the way “up” to the root logger. Any logger can have a handler attached to it, if no handler is attached then any log message is passed to the parent logger.

A log record has a message (the thing you would have printed) and a “level” which indicates the severity of the message these are specified by integers for which the logging module provides convenient labels. The levels in order of severity are logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL. A log handler will output a message if the level of the message is equal to or more than the level it has been set to. So a handler set to WARNING will show messages at the WARNING, ERROR and CRITICAL levels but not the INFO and DEBUG levels.

The simplest way to use the logging module is to import the library:

import logging

Then carry out some minimal configuration,

logging.basicConfig(level=logging.INFO)

and then put logging.info statements in our code, just as we would have done with print statements:

logging.info("This is a log message that takes a parameter = {}".format(a_parameter_value))

logging.debug, logging.warning, logging.error and logging.critical are used to publish log messages with different levels of severity. These are all convenience methods which remove the need to explicitly give the level as found in the logging.log function:

logging.log(logging.INFO, "This is a log message")

If we are writing a module, or other code that we anticipate others importing and running then we should create a logger using logging.getLogger(__name__) but leave configuring it to the caller. In this instance we use the name of the logger we have created instead of the module level “logging”. So to publish a message we would do:

logger = logging.getLogger(__name__)
logger.info("Hello")

In the module importing this library you would do something like:

import some_library
logging.basicConfig(level=logging.INFO)
# if you wanted to tweak the levels of another logger
logger = logging.getLogger("some other logger")
logger.setLevel(logging.DEBUG)

basicConfig() configures the root logger which is where all messages end up in the absence of any other handler. The behaviour of logging.basicConfig() is downright obstructive at times. The core of the problem is that it can only be invoked once in a session, any future invocations are ignored. Worse than this it can be invoked implicitly. So if for example you do:

import logging
logging.warning("Hello")

You’ll see a message because secretly logging has effectively run logging.basicConfig(level=logging.WARNING) for you (or something similar). This means that if you were to then naively go ahead and run basicConfig yourself:

logging.basicConfig(level=logging.INFO)

You would see no message when you subsequently ran logging.info(“Hello”) because the “second” invocation of logging.basicConfig is ignored.

We can explicitly set the properties of the root logger by doing:

root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)

You can debug issues like this by checking the handlers to a logger. If you do:

import logging
lgr = logging.getLogger()
lgr.handlers

You get the empty list []. Issue a logging.warning() message and you see that a handler has been added to the root logger, lgr.handlers() returns something like [<logging.StreamHandler at 0x44327f0>].

If you want to see a list of all the loggers in the hierarchy then do:

logging.Logger.manager.loggerDict

So there you go, the logging module is great – you should use it instead of print. But beware of the odd behaviour of logging.basicConfig() which I’ve spent most of this post griping about. This is mainly so that I have all my knowledge of logging in one place rather than trying to remember which piece of code I pulled off a particular trick.

I used the logging documentation here, blog posts by Fang (here) and Praveen Gollakota (here) and tab completion in the ipython REPL in the preparation of this post.

Essential SQLAlchemy by Jason Myers and Rick Copeland is a short book about the Python library SQLAlchemy which provides a programming interface to databases using the SQL query language. As with any software library there is ample online material on SQLAlchemy but I’m old-fashioned and like to buy a book.

SQL was one of those (many) things I was never taught as a scientific programmer, so I went off and read a book and blogged about it (rather more extensively than usual). It’s been a useful skill as I’ve moved away from the physical sciences to more data-oriented software development. As it stands I have a good theoretical knowledge of SQL and databases, and can write fairly sophisticated single table queries but my methodology for multi-table operations is stochastic.

I’m looking at using SQLAlchemy because it’s something I feel I should use, and people with far more experience than me recommend using it, or a similar system. Django, the web application framework, has its own ORM.

Essential SQLAlchemy is divided into three sections, on SQLAlchemy Core, SQL Alchemy ORM and Alembic. The first two represent the two main ways SQLAlchemy interacts with databases. The Core model is very much a way of writing SQL queries but with Pythonic syntax. I can see this having pros and cons. On the plus side I’ve seen SQLAlchemy used to write, succinctly, rather complex join queries. In addition, SQLAlchemy Core allows you to build queries conditionally which is possible by using string manipulation on standard queries but requires some careful thought which SQLAlchemy has done for you. SQLAlchemy allows you to abstract away the underlying database so that, in principle, you can switch from SQLite to PostgresQL seamlessly. In practice this is likely to be a bit fraught since different databases support different functionality. This becomes a problem when it becomes a problem. SQLAlchemy gives your Python programme a context for its queries which I can see being invaluable in checking queries for correctness and documenting the database the programme accesses. On the con side: I usually know what SQL query I want to write, so I don’t see great benefit in adding a layer of Python to write that query.

SQLAlchemy Object Relational Mapper (ORM) is a different way of doing things. Rather than explicitly writing SQL-like statements we are invited to create classes which map to the database via SQLAlchemy. This leaves us to think about what we want our classes to do rather than worry too much about the database. This sounds fine in principle but I suspect the experienced SQL-user will know exactly what database schema they want to create.

Both the Core and ORM models allow the use of “reflection” to build the Pythonic structures from a pre-existing datatabase.

The third part of the book is on Alembic, a migrations manager for SQLAlchemy, which is installed separately. This automates the process of upgrading your database to a new schema (or downgrading it). You’d want to do this to preserve customer data in a transactional database storing orders or something like that. Here I learnt that SQLite does not have full ALTER TABLE functionality.

A useful pattern in both this book and in Test-driven Development is to wrap database calls in their own helper functions. This helps in testing but it also means that if you need to switch database backend or the library you are using for access then the impact is relatively low. I’ve gone some way to doing this in my own coding.

The sections on Core and ORM are almost word for word repeats with only small variations to account for the different syntax between the two methods. Although it may have a didactic value this is frustrating in a book already so short.

Reading this book has made me realise that the use I put SQL to is a little unusual. I typically use a database to provide convenient access to a dataset I’m interested in, so I do a one off upload of the data, apply indexes and then query. Once loaded the data doesn’t change. The datasets tend to be single tables with limited numbers of lookups which I typically store outside of the database. Updates or transactions are rare, and if I want a new schema then I typically restart from scratch. SQLite is very good for this application. SQLAlchemy, I think, comes into its own in more transactional, multi-table databases where Alembic is used to manage migrations.

Ultimately, I suspect SQLAlchemy does not make for a whole book by itself, hence the briefness of this one despite much repeated material. Perhaps, “SQL for Python Programmers” would work better, covering SQL in general and SQLAlchemy as a special case.

Test-Driven Development with Python by Harry J.W. Percival is a tutorial rather than a text book and it taught me as much about Django as testing. I should point out that I wilfully fail to do the “follow along with me” thing in tutorial style books.

Test-driven development (TDD) is a methodology that mandates writing tests first before actual code that does stuff. The first tests are for the desired behaviour that will be presented to the user.

I was introduced to software testing very early in my tenure ScraperWiki, now The Sensible Code Company. I was aware of its existence prior to this but didn’t really get the required impetuous to get me started, it didn’t help that I was mostly coding in Matlab which didn’t have a great deal of support for testing at the time. The required impetus at ScraperWiki was pair programming.

Python is different to Matlab, it has an entirely acceptable testing framework built-in. Test-driven Development made me look at this functionality again. So far I’ve been using the nose testing library but there is a note on its home page now saying it will not be developed further. It turns out Python’s unittest has been developing in Python 3 which reduces the need for 3rd party libraries to help in the testing process. Python now includes the Mock library which provides functions to act as “test doubles” prior to the implementation of the real thing. As an aside I learnt there is a whole set of such tests doubles including mocks, but also stubs, fakes and spies.

Test-driven Development is structured as a tutorial to build a simple list management web application which stores the lists of multiple users, and allows them to add new items to the lists. The workflow follows the TDD scheme: to write failing tests first which development then allows to pass. The first tests are functional tests of the whole application made using the Selenium webdriver, which automates a web browser, and allows testing of dynamic, JavaScript pages as well as simple static pages. Beneath these functional tests lie unit tests which test isolated pieces of logic and integrated tests which test logic against data sources and other external systems. Integration tests test against 3rd party services.

The solution is worked through using the Django web framework for Python. I’ve not used it before – I use the rather simpler Flask library. I can see that Django contains more functionality but it is at the cost of more complexity. In places it wasn’t clear whether the book was talking about general testing functionality or some feature of the Django testing functionality. Django includes a range of fancy features alien to the seasoned Flask user. These include its own ORM, user administration systems, and classes to represent web forms.

Test-driven Development has good coverage in terms of the end goal of producing a web application. So not only do we learn about testing elements of the Python web application but also something of testing in JavaScript. (This seems to involve a preferred testing framework for every library). It goes on to talk about some elements of devops, configuring servers using the Fabric library, and also continuous integration testing using Jenkins. These are all described in sufficient detail that I feel I could setup the minimal system to try them out.

Devops still seems to be something of a dark art with a provision of libraries and systems (Chef, Puppet, Ansible, Juju, Salt, etc etc) with no clear, stable frontrunner.

An appendix introduces “behaviour-driven development” which sees sees a test framework which allows the tests to be presented in terms of a domain specific language with (manual) links to the functional tests beneath.

In terms of what I will do differently having read this book. I’m keen to try out some JavaScript testing since my normal development activities involve data analysis and processing using Python but increasingly blingy web interfaces for visualisation and presentation. At the moment these frontends are slightly scary systems which I fear to adjust since they are without tests.

With the proviso above, that I don’t actually follow along, I like the tutorial style. Documentation has its place but ultimately it is good to be guided in what you should do rather than all the things you could possibly do. Test-driven Development introduces the tools and vocabulary you need to work in a test-driven style with the thread of the list management web application tying everything together. Whether it instils in me the strict discipline of always writing tests first remains to be seen.

This next review, of “Effective Computation in Physics” by Anthony Scopatz & Kathryn D. Huff, arose after a brief discussion on twitter with Mike Croucher after my review of “High Performance Python” by Ian Micha Gorelick and Ian Ozsvald. This in the context of introducing students, primarily in the sciences, to programming and software development.

I use the term “software development” deliberately. Scientists have been taught programming (badly, in my view*) for many years. Typically they are given a short course in the first year of their undergraduate training, where they are taught the crude mechanics of a programming language (typically FORTRAN, C, Matlab or Python). They are then left to it, perhaps taking up projects requiring significant coding as final year projects or in PhDs. The thing they have lacked is the wider skillset around programming – what you might call “software development”. The value of this is two-fold – firstly, it is a good training for a scientist to have for careers in science. Secondly, the wider software industry is full of scientists, providing students with a good grounding in this field is no bad thing for their future employability.

The book covers in at least outline all the things a scientist or engineer needs to know about software development. It is inspired by the Software Carpentry and The Hacker Within programmes.

The restriction to physics in the title seems needless to me. The material presented is mostly applicable to any science, and those working in the digital humanities, undertaking programming work. The examples have a physics basis but not to any great depth, and the decorative historical anecdotes are all physics based. Perhaps the only exception to this is the chapter on HDF5 which is a specialised data storage system, some coverage of SQL databases would make a reasonable substitute for a more general course. The chapter on parallel computing could likewise be dropped for a wider audience.

The book is divided into four broad sections. Including in these are chapters on:

Command line operations;

Programming in Python;

Build systems, version control, debugging and testing;

Documentation, publication, collaboration and licensing;

Command line operations are covered in two chunks, firstly in the basic navigation of the file system and files followed by a second chapter on “Regular Expressions” which covers find, grep, sed and awk – at a very basic level.

The introduction to Python is similarly staged with initial chapters covering the fundamentals of the core language, with sufficient detail and explanation that I learnt some new things**. Further chapters introduce core Python libraries for data analysis including NumPy, Pandas and matplotlib.

Beyond these core chapters on Python those on version control, debugging and testing are a welcome addition. Our dearest wish at ScraperWiki, a small software company where I worked until recently, was that new recruits and interns would come with at least some knowledge and habit for using source control (preferably Git). It is also nice to see some wider discussion of GitHub and the culture of Pull Requests and issue tracking. Systematic testing is also a useful skill to have, in fact my experience has been that formal testing is most useful for those most physics-like functions.

The final section covers documentation, publication and licensing. I found the short chapter on licensing rather useful, I’ve been working on some code to analyse LIDAR data and have made it public on GitHub, which helpfully asks which license I would like to use. As it turns out I chose the MIT license and this seems to be the correct one for the application. On publication the authors are Latex evangelists but students can chose to ignore their monomania on this point. Latex has a cult-like following in physics which I’ve never understood. I have written papers in Latex but much prefer Microsoft Word for creating documents, although Google Docs is nice for collaborative work. The view that a source control repository issue tracker might work for collaboration beyond coding is optimistic unless academics have changed radically in the last few years.

I’d say the only thing lacking was any mention of pair programming, although to be fair that is more a teaching method than course material. I found I learnt most when I had a goal of my own to work towards, and I had the opportunity to pair with people with more knowledge than I had. Actually, pairing with someone equally clueless in a particular technology can work pretty well.

There is a degree to which the book, particularly in this section strays into a fantasy of how the authors wish computational physics was undertaken, rather than describing how it is actually undertaken.

To me this is the ideal “Software development for scientists” undergraduate text, it is opinionated in places and I occasionally I found the style grating but nevertheless it covers the right bases.

*I’m happy to say this since I taught programming badly to physics undergraduates some years ago!

**People who know my Python skills will realise this is not an earthshattering claim.

I’m currently between jobs for a couple of weeks, so I have time to play with data.

The Environment Agency (EA) has recently released it’s LIDAR data for England amounting to several terabytes of the stuff. LIDAR is a laser ranging technology which gives you the height profile of the surface under inspection. You can get a feel for the data from this excerpt of central Chester:

The brightness of a pixel shows the height of a feature, so the race course (lower left) appears dark since it is a low flat region close to the River Dee. The CWAC HQ building is tall and appears bright. To the north of the city are a set of three high rise flats, which appear bright. The distinctive cross-shape of the cathedral, with it’s high, bright central tower is also visible. It’s immediately obvious that LIDAR is an excellent tool for picking out the footprint of buildings.

We can use the image above to make a 3D projection view where the brightness of a pixel is mapped to height:

The orientation for this image is the same as that in the first image, the three tower blocks are visible top right, and the CWAC HQ visible lower left.

The images above used the lowest spatial resolution data, each pixel is 2mx2m. The data have released have spatial resolutions 2m down to 25cm for selected areas. Looking at the areas with the high resolution data available it becomes very obvious what the primary uses of the data are: flood and coastal defences.

You can find the LIDAR data here. It’s divided up into several datasets. Surface data gives height information including all objects on the land such as buildings, trees, vehicles and so forth whilst Terrain data is processed to remove these artefacts and show the pristine land surface.

Composite data are data compiled to give maximum coverage by combining data from surveys conducted in different years and at different resolutions whilst Tile data are the underlying raw data collected in different years and different resolutions. The coverage sliders show the coverage of each dataset. The data are for England only.

The images of Chester shown above are an excerpt from a 10kmx10km tile, shown below:

Chester is on the left of this image, above the dark bend of River Dee flood plain. To the right hand side we can see the valley of the River Gowy, and its tributaries – features which are not obvious on the ground or in Google Maps. The large black area is where there is no data, smaller irregular black seem to correlate with water, you might just be able to pick out the line of the Shropshire Union canal cutting through the middle of the image.

I used Chester as an illustration because that’s where I live. I started looking at this data because I was curious, and I’ve spent a happy few days downloading data for lots of different places and playing with it.

It’s great to see data like this being released under permissive conditions. The Environment Agency has been collecting this data for its own purposes, and it’s been available from them commercially for a while – no doubt as a result of a central government edict to maximise revenue from it.

Opening the data like this means the curious can have a rummage, and perhaps others will find a commercial value in it.

I’ve included a few more images below. After them you can see the technical details of how to process these data and make the visualisations for yourself, the code is all in this GitHub repository:

Liverpool Metropolitan Cathedral at 1m resolution

St Paul’s Cathedral

Technical Details

The GitHub repository contains a readme file which describes the code, and provides links to the original data, other useful commentary and the numerous bits of code I borrowed from the internet.

The data start as sets of zipped text file archives, each archive contains the data for a 10kmx10km OS National Grid square – Chester is in the SJ46 cell. An archive contains a maximum of 100 text files, each one containing data for a single 1kmx1km square, the size of this file depends on the resolution of the data. I wrote a Python program to read the data for a 10kmx10km cell and convert it into a PNG format image. This program also calculates the bounding box in latitude and longitude for the cell. The processing program works fine for 2m and 1m resolution data. It works just about for 50cm data but is slow and throws memory errors. For 25cm resolution data it doesn’t yet work.

I made a visualisation using the leaflet.js library which allows you to overlay the PNG images generated above onto OpenStreetMap maps. The opacity of the image can be varied with a slider so that you can match LIDAR features to map features. The registration between the two data sources is pretty good but there are systematic problems which I believe might be due to different mapping projections being used by the Ordnance Survey and OpenStreetMap.

A second visualisation tool uses the three.js library to make an interactive 3D view. The input data are manual crops of approximately 512×512 from the raw PNGs, I did this using Paint .NET but other image editors would work fine. Larger images work but they are smoothed to 512×512 in the rendering. A gotcha here is that the revision number of the three.js library is important – the code for this visualisation leant heavily on previous work by others, and whilst integrating new functionality it was important to use three.js source files from the same revision. This visualisation allows you to manipulate the view with the mouse, it takes while to load up but once loaded it is pretty fast. Trying to upload a subsequent image doesn’t work.

I’m still working on the code, I’d like to be able to process the 25cm data and it would be good to select an area from the map and convert it to 3D view automatically.