snim2.org

How I work(ed)

Like most people who spend part of their working life writing code, I’m quite obsessive about the tools I use. Programming requires a complex mix of activities, some of which require detailed thought and creativity, and some of which are dull and repetitive. Every efficiency saving in the repetitive work leaves more time and (more importantly) attention with which to make creative decisions.

Small efficiency savings in the repetitive work add up to something that really does make a surprising difference to your quality of work and stress levels. To a non-programmer, a text editor which automatically inserts a ) every time you write a ( might not seem important, but never having to type that extra character or worry about whether your brackets match, has a cumulative effect.

At the end of every year, when I look back, I notice that a lot has changed in how I worked in the previous twelve months, and usually for the better. Back when I started working full time there was no such thing as cloud storage. If you wanted to store files to access both at work and at home, you had to run a server yourself and either copy files over manually or use something like rsync. Saving list of links to pages you are part way through reading was a particular pain. If you didn’t want to use something like rsync to store a text file with a list of URLs, you could always email them to yourself, but there wasn’t an easy way to share a list of browser bookmarks. Then in 2002 Delicious was released. It was a revelation! Suddenly you could store bookmarks on a public website, where you could see them from any machine, and send them to other people. Then in 2005 Flock came out. Flock was a new style of web browser, for the social web, and it had tools to manage RSS feeds, write blog posts, and so on. Flock was built on the Firefox code base, and used Delicious to manage favourites. Every time you starred a web page to bookmark it, Flock automatically saved your bookmark on Delicious. This way, all your bookmarks were available to you, whether you were using Flock at home or work. In 2008 Google brought out the Chrome browser, and it soon became normal for web browsers to sync bookmarks automatically.

Each of these small changes made life online that little bit simpler and easier, and as the years went on I began to wish that I had written an annual retrospective on how my working life had changed thanks to technology. Now I look back, I’m aware of just how much has changed that I’ve absorbed and forgotten. In that spirit, this post is a summary of the technology I used regularly in the job I have just left, and how I used it.

To make sense of all this, you need some context. My last job was busy, but not just because I had a lot of work, I also had a wide variety of work. This included around eight taught modules a year (two as module leader, a few as second lecturer, the rest teaching in labs), course leaderships for around two hundred and fifty students, between twenty and fifty personal tutees a year, between ten and twenty BSc and MSc project students, union rep duties, membership of various committees, open days, personal research, and so on. In this context, almost any efficiency saving is useful, but particularly in managing email. Below the fold, this post covers:

Research notes and rough ideas

By the end of my time at Wolverhampton I had started writing notes in LaTeX, in the hope that it would be easier to copy notes over to draft scientific papers for editing. I have already written a blog post about this, so I won’t go into more detail here. Suffice to say, I’m still using this system.

In the past, I have only every seriously tried two other methods of note taking: org-mode (when I still used emacs) and hard-backed notebooks. I still sometimes write notes in Markdown for short-term use, for example if I’m making notes in a presentation or meeting. Writing long-hand notes has an added advantage — the notes are easier to remember. However, I found that I could only ever be quite lazy when writing with a pen; I abbreviated so much that typing up my notes later was very time-consuming.

Managing references

Every academic needs to maintain collections of references to academic literature. This is always fiddly. Each reference you store needs to have enough accurate information with it that another scientist can find the document you are referring to. Every conference and journal has its own style for formatting references. You probably have notes you want to store with each reference, and you probably want to link that reference to a PDF or similar file on a hard drive or similar.

There are a number of tools and file formats for doing this. Because I don’t generally use Microsoft Word (or LibreOffice) to write papers, some common tools are out of reach for me. However, there are a large number of tools which are either available for Linux platforms, or work in a browser. By the time I left Wolverhampton I had started using Paperpile, just as an experiment, and I had ported over several hundred references to my account there (I now have 831 references there). I’m very happy with Paperpile, but if I’m honest I find it difficult to explain exactly why I find it easier than all the other tools I tried, which were:

Of these, Zenodo and Medeley are the most similar to Paperpile — they all have a browser button that allows you to click and import a paper from a website to the reference manager. Paperpile, like several other tools, attempts to discover bibliographic information from a document or the web, so most of the time you do not have to tell the tool the title, author, etc. of a paper. Some of these tools allow you to store a PDF along with the reference, but Paperpile stores that PDF in your Google Drive, so it is available everywhere. Somehow, these small changes just fit with my workflow, and Paperpile has been the first tool I’ve ever managed to stick with.

Of the rest, Qiqqa is the one tool that stands out as having something unique to offer. Qiqqa is able to show you semantic connections between papers. This allows you to explore topics in the literature, or track the development of a particular author (see pages fourteen onwards in the manual). Many of the other tools commit the cardinal sin of bibliography tools — they asked the user to supply basic information (author, title, date, etc.) instead of trying to discover it from the document or the web. Worst of all, some expect the user to tab through multiple fields in a complicated form, perhaps on the assumption that all GUIs are easier to use than just writing text. For this reason, I bounced around different bibliography tools for years, and always fell back to maintaining BibTeX files, which ended up being quicker to use than a GUI.

Medium term and permanent information

When dealing with a large and varied workload, to-do lists are probably the tool that most people think of as being essential. I used to think that too, but I eventually realised that I needed to capture information that was not immediately useful, but not really long term thinking either. Otherwise I spent too much time trying to retrieve this information when I needed it. Having used Tomboy notes for a while (quite happily) I eventually settled on using Google Keep. By the end of my time at Wolverhampton, I had the following notes in Keep:

My appraisal targets for the academic year, ticked when completed

A full list of current BSc and MSc project students, ticked when the student has passed

A list of personal tutors for courses I led

My full work address (useful for copy-pasting into emails requesting a reference)

A list of the assessments that I was moderating and the colleagues who were moderating my assessments

Notes from each of my modules on what I wanted to improve the next time the module runs

A list of the software packages I would need installed for my students in the following year

In addition, I had some project-specific notes, but these were largely subsumed into my research notes.

To-do lists and Kanban

For some years I kept all to-do lists on paper. I had two systems that I used for keeping to-do lists. Firstly, if I needed to write a long list of items (for example, everything I have to do this week I wrote the list on a single page, with tick boxes. Secondly, if I need to write a single item within a larger set of notes (for example, as part of some minutes of a meeting) I wrote ACTION across the page next to the to-do. This worked pretty well, but over the years the number of items I needed to capture increased, and reviewing a paper notebook every day became too time consuming.

The first tool I stuck with was Remember the Milk. RtM had a lot of features. You could email the service, it had great keyboard short-cuts, labels, priorities, per-project lists: everything you can think of. Eventually, after many happy years as an RtM customer, I migrated away. Why? Partly because having a hundred or so to-do items is miserable. I was beginning to find that although I was capturing notes well, I wasn’t executing them efficiently. Secondly, I reached a point where relatively few of my tasks were coming to me via meetings and paper, and most were related to emails. At that point, I dropped to-do lists altogether, and kept almost everything in my email system. The few tasks that didn’t fit, I tracked in Google Keep.

I’m sure a lot of people would use Trello for to-do lists, and the Kanban system. I’ve got a lot of sympathy with this way of doing things, and I’ve tried Trello for a number of projects. I found it worked best when I was collaborating with students on highly structured projects. When managing to-do lists for myself, the Kanban method seemed like overkill. Moreover, when working on projects that involved some programming, Trello boards don’t necessarily match with issue lists or other bug reporting tools, and linking to the relevant issue seems like an unnecessary extra step. There are Kanban tools which sit on top of issue trackers, such as HuBoard for GitHub, but I’ve never been in a project with enough contributors to make this work for me.

Managing email

These days, managing email seems to be a lifetimes work. It seems to swallow everything it touches, and grows exponentially. Years ago, I used to keep all emails in my inbox, but eventually I switched to something like an Inbox Zero policy, because the volume of email I had to manage meant that organising it was a necessity. So, every email coming in to me gets either deleted, sent to SPAM or labelled and archived.

I label and filter a lot of email automatically, but I have several labels that can only be applied by hand. For these, I take inspiration from Active Inbox. Even if you don’t actually use Active Inbox, the discipline the tool imposes is a really interesting guide. The labels I used (and still use) work like this:

Each project I am involved with (e.g. a module I am teaching on, an event I help to organise, a committee I sit on, etc.) gets a project label. These start with P-.

Any email labelled !ACTION requires some action from me. When I have done whatever is required, I remove the !ACTION label.

Any email labelled !WAITINGON needs some action from me that I cannot progress until someone else has replied.

The !MAYBE label means that an email might be worth acting on, but this isn’t something I have a commitment to do.

Using this system, I know exactly how many tasks I have to complete, just by counting the emails with the relevant label (although I don’t necessarily know when the tasks need to be finished). By the end of my last job, I had over one hundred !ACTION emails at any given time during teaching semesters. At the end of each semester, once marking and boards had been completed, I could usually get that down to between thirty and fifty. Using the system above, I almost never lost track of a task, even if I couldn’t complete tasks on time.

The same desktop everywhere

One minor annoyance when moving between different machines, is having different settings and preferences on each one. Whilst working at Wolverhampton I spent about two hours a day commuting on public transport. I would regularly use a desktop at work, a laptop on the train, and a desktop at home in the same day. Apart from moving files around, it was helpful to me to ensure that every tool I used was set up in the same way. To do this, I stored all my configuration files in a folder on Dropbox. To set up a new machine, I just deleted the default configuration files for each tools I used then created (symbolic) links from the Dropbox folder to $HOME.

Coding tools

All this talk of productivity, and nothing about programming. Seems strange, doesn’t it? Well, in my last job I rarely had a meaningful choice about which languages and tools I could use. A large proportion of the code I wrote was for teaching purposes, and I had a strict policy that I would only ever use the same tools that my students used. That meant Eclipse for Java, Gedit for Scala and C (Linux), Visual Studio for C (Windows), Notepad++ for PHP (Windows), and Jupyter notebooks for Python. proce55ing and App Inventor came with their own environments. We used SVN for version control, and the plan was to move to GitLab as soon as possible. Whenever possible we used the standard unit testing and build tools for whichever language we were using. The exception to this was C, which doesn’t quite seem to have a standard unit testing tool, but I settled arbitrarily on CUnit.

Other tools and languages came and went. For personal work, I used GitHub (after trying Google Code and BitBucket) and emacs or Sublime, and whichever language was appropriate for the job. I wish I could get more excited about these tools. They all have pros and cons, but the truth is that modern programming tools are converging on standard patterns.

Call for papers: 2nd Workshop on Programming Language Evolution

Programming languages tend to evolve in response to user needs, hardware advances, and research developments. Language evolution artefacts may include new compilers and interpreters or new language standards. Evolving programming languages is however challenging at various levels. Firstly, the impact on developers can be negative. For example, if two language versions are incompatible (e.g., Python 2 and 3) developers must choose to either co-evolve their codebase (which may be costly) or reject the new language version (which may have support implications). Secondly, evaluating a proposed language change is difficult; language designers often lack the infrastructure to assess the change. This may lead to older features remaining in future language versions to maintain backward compatibility, increasing the language’s complexity (e.g., FORTRAN 77 to Fortran 90). Thirdly, new language features may interact badly with existing features, leading to unforeseen bugs and ambiguities (e.g., the addition of Java generics). This workshop brings together researchers and developers interested in programming language evolution, to share new ideas and insights, to discuss challenges and solutions, and to advance programming language design.

Topics include (but are not limited to):

Programming language and software co-evolution

Empirical studies and evidence-driven evolution

Language-version integration and interoperation

Historical retrospectives and experience reports

Tools and IDE support for source-code mining and refactoring/rejuvenation

Gradual feature introductions (e.g., optional type systems)

We are accepting two kinds of submission:

Full papers (maximum 8 pages, ACM SIGPLAN 2 column, 9pt)

Talk abstracts (may include an ‘extended abstract’, up to 1 page of ACM SIGPLAN format).

Important dates:

Submission: Thu April 2nd 2015

Notification: Fri May 1 2015

Workshop: Tue 2nd July 2015

Please submit your abstracts/papers via EasyChair. Papers will be subject to full peer review, and talk abstracts will be subject to light peer-review/selection. Accepted submissions will be published in the ACM DL and must adhere to ACM SIGPLAN’s republication policy.

If you have any questions relating to the suitability of a submission please contact the program chairs at ple15@easychair.org.

You might think that this is a fixable problem, and it certainly would be if UEFA asked a Computer Scientist. Apparently there are 76 teams in the UEFA Champions League (I cheated and asked Wikipedia). Each team wears a “home” kit and an “away” kit (I guessed that one). So, how many ways are there of choosing 2 teams from 76, counting “home” and “away” matches as distinct choices? A bit of A-level maths tells me that this is P(76, 2) or 5700.

5700 is a lot of unique games that might be chosen. Of course, I have simplified this, I haven’t looked into things like group matches, or different rounds in the tournament, so the possible number of different games in the season will be a little different. However, the point is that if you had to design kits for all those teams so that no matter which team played which a colour-blind fan could tell them apart, you’d be designing for a very long time indeed.

Is all that design really necessary though? Or is there maybe some other way to think about the problem that might make it a little easier? In Computer Science this problem is very similar to something we call hashing. When we hash some data we want to store it so that similar data is kept together and cannot be easily confused with other data. A simple example would be voting slips. If we have three political parties, Left Wing, Right Wing, and Raving we want to put all the ballot papers into three buckets, one for each party (we can ignore the spoiled papers). We don’t care to differentiate between ballot papers, since every vote is equal, we just need to put each one into the right bucket so we can count them. The only important criteria for organising our data is that the Left Wing votes shouldn’t get muddled up with the Right Wing or the Raving votes.

Placing the votes into buckets is simple and makes intuitive sense. Is there a neat way to organise the football shirts like this? Can we find a simple hashing function that will work for team kits? Here is a really simple suggestion: every team has a home kit with some configuration of block colours and emblems. Every team also has an away kit which is striped. It doesn’t matter what the colours are or what writing or graphics is on each shirt. It doesn’t matter whether the stripes are vertical or horizontal or what thickness or colour they are – each team can be easily differentiated by fans, whether colour blind or not.

Hmm! Neat, but not terribly useful. What else can we do? A quick browse through the diff man-page show that the -a command-line switch tells diff to treat a binary file as if it were text. This sounds like a step forward.

As you’d expect with PDF, there is some metadata inside the files that we would expect to differ between PDF files, even if the files have the same content. What we need to do next is to tell diff to ignore this metadata, and we can do that with the -I switch. We might also want to ignore whitespace, which we can do with -w:

$ diff -w -a -I .*Date.* -I \/ID.* report.pdf expected.pdf
$

Just what we wanted! As with all UNIX tools here, the command was successful (the files were ‘identical’) so we didn’t get any output. To put that in a unit testing context, we can write that up as pytest unit test:

New features I would love to see in writeLaTeX

writeLaTeX is my new favourite thing. If you haven’t heard of it, writeLaTeX is an online service for writing collaborative LaTeX documents. Think of it like Google docs for scientists and people who like to typeset very beautiful documents.

Why does this matter? Well, it solves a whole bunch of simple problems for me. I can move between different machines at home and work and keep the same environment. This is more difficult than just auto-syncing my own documents via Dropbox or similar. It also means I need the same LaTeX environment, whether I am working on a locked-down Windows machine at work or a completely open laptop running FOSS software at home. Already that’s something that removes many of my document writing headaches.

More than providing just a synchronisation service, I can collaborate with colleagues in real-time, so I never need to worry about using the “latest” version of any document — even if my colleagues don’t use versioning software like Git or Mercurial. Beyond that, writeLaTeX automatically compiles my projects in the background, so I can always see a nearly up to date version of the resulting PDF. My favourite way of editing on writeLaTeX is to have a full “editor” window in my main monitor (straight ahead of me) and a full “pdf render” window on another monitor (off to the side). It’s super-convenient and allows me to concentrate without feeling interrupted by compiler warnings or errors when I’m half way through a complicated edit.

I could go on and on, but you get the point – writeLaTeX is a very, very neat way to typeset beautiful documents.

Of course, writeLaTeX is not the only start-up in this space. Authorea and ShareLaTeX make similar offerings, and both have different and interesting strengths. It happens that when I needed a service like this, writeLaTeX was the app that had all the built-in style and class files I needed, and the right combination of features, for me. In fact, the pre-installed TeX packages are exactly what you get from installing all of TeXLive on Ubuntu — so writeLaTeX essentially mirrors my own Linux set-up, minus my dodgy Makefiles. That said, I’m very excited that a few competitors are working on these problems. This tells me that online, collaborative LaTeX services have a serious long-term future, and that should benefit users of all these different services.

Once you start using a new shiny toy, there’s always the sense that this is *so* awesome, I wish it could do X… So this is my current wish list for writeLaTeX. This is no criticism of the awesome service, but if you happen to have a few million dollars lying around, please pay the company to implement the below…

Auto-sync with GitHub, BitBucket and similar

It’s really convenient to have all my writeLaTeX projects together on a writeLaTeX project page, but it also breaks the structure of my projects and documents and imposes a second, different structure.

This is what I call the expression problem of scientific projects (Computer Scientists will get the joke) – you can either organise your documents and code around each project you take part in (Option 1), or you can organise your documents around their type (Option 2). Either choice is good, and just a matter of personal taste, but it makes a big difference to your personal workflow and how quickly you can find information and track the progress of your projects. Like many things, consistency is the key principle here.

But what happens when you start to use services like writeLaTeX? Your whole workflow gets a lot more complex. You might have all of your projects sync’s to a service like GitHub, or not, but now your papers and talks are on writeLaTeX and can’t be “checked out”. Your software might well be on GitHub or similar. You might well be sending your figures and data off to FigShare. It is suddenly more difficult to keep everything together and it isn’t immediately clear how much progress you have made with each part of the project.

In my view the answer to this problem has to come in two parts. Firstly a way to expose writeLaTeX projects as git repositories so that they can be incorporated as git submodules inside an existing GitHub project (other SCMs and hosting companies are available). This means that it doesn’t matter whether you choose Option 1 or Option 2 above to structure your project files. writeLaTeX could then issue pull requests on GitHub when you update your documents to “send” your updates to GitHub. Secondly, existing CI services such as Travis can be configured to send documents off to FigShare once a tagged release of a paper has been created. This costs a little time to set up, but it is an automated workflow that can be reused over different projects, so that small set-up cost is nicely amortized.

Linting

A lint is a tool to check code for errors before it has been compiled. There are a number of these for LaTeX (the one I currently use is chkTeX), and it would be useful to have them run automatically during the background built-compile-render cycle that writeLaTeX already runs.

If you are not writeLaTeX one option here is to use a continuous integration tool to run the lint for you, together with your normal build cycle. For example, this:

A way to copy and share files between different projects

There are a few jobs that need to be done for any paper, but are time-consuming busy work that ideally would be minimised. One of these is producing and curating long lists of references to prior art, usually in BibTeX. Another is pulling in tables and figures (usually to do with prior art or experimental apparatus) that can be used in different papers. An obvious example is a BibTeX file containing the authors own papers. You might have a file called something like mypapers.bib which you certainly need in your own CV, but then you also need in pretty much all your papers and several talks. What happens when you update this file for your CV project? mypapers.bib isn’t shared between different projects, so if you also need to update it in all your other projects. That might not be so bad when you are just added a newly published paper to your list, but if you find a typo in your old papers it’s a real pain. The same is true for curated lists of papers in the area you work in and all sorts of other files.

It would be nice to find some clever way to resolve this, but what if you also have all your files nicely structured and curated using either Option 1 or Option 2 above? Maybe a neat thing to do would be to have some “dummy” projects which only contain common files, such as BibTeX files (and don’t get compiled with pdflatex or similar), then use something like Git submodules to “import” the dummy projects into “real” ones that do compile documents.

More help with BibTeX

If there’s one huge and pointless sink of valuable time it’s curating long lists of BibTeX references. In recent years a number of services have started to make this easier — Bibsonomy and Google Scholar being two very handy services — but there is still much that has to be done manually. A neat way to search for a citation and pull it into a BibTeX file from within writeLaTeX would be really, really cool.

Some crazy form of document review

Open document review has started to become common, at least for books. A great example of this is Real World OCaml where you can log in with a GitHub account and comment on any paragraph of the book. Comments then become issue tickets in a GitHub repository and the authors can resolve each comment (I notice Real World OCaml has logged an impressive 2457 closed tickets). This is a really neat solution to document review and would be a huge bonus for anyone writing in LaTeX.

Europython 2014 talk on message passing concurrency and Python

Research diaries and lab notes

The idea of keeping a diary fills me with dread. It conjurers up distant memories of receiving leather-bound paper diaries from well-meaning relatives at Christmas and the crushing obligation to write something, anything every single day, when actually nothing very interesting was going on. The obligation to do something every day is a sure-fire killer of motivation for me. So, as you can imagine, I have never been keen on keeping a regular diary of research notes and results. Not that I haven’t tried. I have a paper notebook that I use to keep track of discussions and obligations from meetings and at various times I’ve tried to use that as a discipline for writing down ideas and notes from my research work. Somehow though, it never stuck.

That is, it never stuck until I read this blog post by Mikhail Klassen on the writeLaTeX blog. Mikhail points out that having a digital diary has some compelling advantages. It allows you to keep track of intermediate results and ideas, links to software repositories and BibTeX citations. This means that next time you need to quickly put together a presentation or poster, or you are starting to write a paper, you can pull figures, citations and text directly from your diary. This is especially useful if a lot of your writing has equations and citations that are time-consuming to keep track of. So, keeping a diary means that a lot of the time-consuming tasks involved at the start of writing a paper or presentation just disappear – those costs are amortized with the costs of keeping the diary. This has enormous appeal to me. The time I get for research is not large, and anything I can do to make my work more efficient makes the process a lot less stressful.

So, having looked carefully at Mikhail’s template I was really impressed, but I wanted to tweak a few things. In particular I changed the layout of the whole diary and based my version on the excellent tufte-latex class which is inspired by the work of Edward Tufte. I also added a couple of new sections at the top of the diary – Projects and Collaborations and Someday / Maybe. Projects and Collaborations is there to help keep track of ongoing commitments, and as a reminder that those projects need to be regularly progressed or abandoned. Someday / Maybe is there to keep track of vague ideas that sound good but you aren’t yet committed to acting on. I find it useful to have a list of these, as they can easily get forgotten, and many good ideas which aren’t quite ready for action can be used as student projects or re-purposed. Other ideas can sit around for a long time, but suddenly become useful when a new collaboration comes about, or you find some scientific result or new technology which makes a previously very difficult idea tractable.

Lastly, like Mikhail, my template and my own notes are on writeLaTeX, which is a cloud platform for writing LaTeX documents. writeLaTeX (and its cousins ShareLaTeX and Authorea) have some great features, like collaborative real-time document editing, auto-compilation so that you can see a current version of the PDF of your document as you type, a wealth of templates and a friendly near-WYSIWYG editor. writeLaTeX can also has a limited sync-with-Dropbox feature for offline work. All of this makes diary entries really simple to write. I just have a writeLaTeX window open in my browser all day and I can write updates and upload new documents as I go along.

Oh, and because I have a pathological aversion to keeping a diary, I call mine “Lab Notes”. Much friendlier!

Mining Python Software

Automate, automate, automate

I’ve recently been working on a new Python project, which started off as a bit of an experiment at the recent PyPy London Sprint. Working on a brand new repository is always nice, a blank slate and a chance to write some really elegant code, without all the crud of a legacy project.

In this case, the infrastructure for the project is pretty involved. I was using the pytest unit testing framework and using the rpython toolkit from pypy, both for the first time.

That led to an interesting situation. When I run the unit tests, I want to use the CPython interpreter. This means I can use all the standard library modules that I know well, and can test the basic algorithms I’m writing. When I want to “translate” my code into a binary executable, I use pypy and some of its rlib replacements for the Python standard library modules. When I get an runtime error in the translation, I need to know whether that is related to my use of the rlib libraries or my code is just plain wrong, and using CPython helps me to do that.

The problem is that I have to keep switching between different standard libraries and interpreters. Somewhere in my code there is a switch for this:

DEBUG = True

In testing that switch should be True and in production it should be False, but changing that line manually is a real pain, so I need some scripts to catch when I’ve set the DEBUG flag to the wrong mode.

Test automation #3

So, now the flag is tested, set correctly if needs be and the tests are run. But I still have to run the test script! What a waste of typing. So, the next step is simply to call this script from a git pre-commit hook.