Tag Archives: debugging

After struggling with large FULLTEXT indexes in MySQL, Solr comes to the rescue, 16 million records ingested in 20 minutes – wow!

One small Gotcha was the security classes, which have obviously moved since the documentation was written (see fix at end of the post).

For web apps I live off MySQL, albeit now-a-days often wrapped with my own NoSQLite libraries to do Mongo-style databases over the LAMP stack. I’d also recently had a successful experience using MySQL FULLTEXT indices with a smaller database (10s of thousands of records) for the HCI Book search. So when I wanted to index 16 million the book titles with their author names from OpenLibrary I thought I might as well have a go.

For some MySQL table types, the normal recommendation used to be to insert records without an index and add the index later. However, in the past I have had a very bad experience with this approach as there doesn’t appear to be a way to tell MySQL to go easy with this process – I recall the disk being absolutely thrashed and Fiona having to restart the web server 🙁

Happily, Ernie Souhrada reports that for MyISAM tables incremental inserts with an index are no worse than bulk insert followed by adding the index. So I went ahead and set off a script adding batches of a 10,000 records at a time, with small gaps ‘just in case’. The just in case was definitely the case and 16 hours later I’d barely managed a million records and MySQL was getting slower and slower.

I cut my losses, tried an upload without the FULLTEXT index and 20 minutes later, that was fine … but no way could I dare doing that ‘CREATE FULLTEXT’!

In my heart I knew that lucene/Solr was the right way to go. These are designed for search engine performance, but I dreaded the pain of trying to install and come up to speed with yet a different system that might not end up any better in the end.

However, I bit the bullet, and my dread was utterly unfounded. Fiona got the right version of Java running and then within half an hour of downloading Solr I had it up and running with one of the examples. I then tried experimental ingests with small chunks of the data: 1000 records, 10,000 records, 100,000 records, a million records … Solr lapped it up, utterly painless. The only fix I needed was because my tab-separated records had quote characters that needed mangling.

So, a quick split into million record chunks (I couldn’t bring myself to do a single multi-gigabyte POST …but maybe that would have been OK!), set the ingest going and 20 minutes later – hey presto 16 million full text indexed records 🙂 I then realised I’d forgotten to give fieldnames, so the ingest had taken the first record values as a header line. No problems, just clear the database and re-ingest … at 20 minutes for the whole thing, who cares!

As noted there was one slight gotcha. In the Securing Solr section of the Solr Reference guide, it explains how to set up the security.json file. This kept failing until I realised it was failing to find the classes solr.BasicAuthPlugin and solr.RuleBasedAuthorizationPlugin (solr.log is your friend!). After a bit of listing of contents of jars, I found tat these are now in org.apache.solr.security. I also found that the JSON parser struggled a little with indents … I think maybe tab characters, but after explicitly selecting and then re-typing spaces yay! – I have a fully secured Solr instance with 16 million book titles – wow 🙂

This is my final security.json file (actual credentials obscured of course!

I followed a link to an article on Forbes’ web site1. After a few moments the computer fan started to spin like a merry-go-round and the page, and the browser in general became virtually unresponsive.

I copied the url, closed the browser tab (Firefox) and pasted the link into Chrome, as Chrome is often billed for its stability and resilience to badly behaving web pages. After a few moments the same thing happened, roaring fan, and, when I peeked at the Activity Monitor, Chrome was eating more than a core worth of the machine’s CPU.

I dug a little deeper and peeked at the web inspector. Network activity was haywire hundreds and hundreds of downloads, most were small, some just a few hundred bytes, others a few Kb, but loads of them. I watched mesmerised. Eventually it began to level off after about 10 minutes when the total number of downloads was nearing 1700 and 8Mb total download.

It is clear that the majority of these are ‘beacons’, ‘web bugs’, ‘trackers’, tiny single pixel images used by various advertising, trend analysis and web analytics companies. The early beacons were simple gifs, so would download once and simply tell the company what page you were on, and hence using this to tune future advertising, etc.

However, rather than simply images that download once, clearly many of the current beacons are small scripts that then go on to download larger scripts. The scripts they download then periodically poll back to the server. Not only can they tell their originating server that you visited the page, but also how long you stayed there. The last url on the screenshot above is one of these report backs rather than the initial download; notice it telling the server what the url of the current page is.

Some years ago I recall seeing a graphic showing how many of these beacons common ‘quality’ sites contained – note this is Forbes. I recall several had between one and two hundred on a single page. I’m not sure the actual count here as each beacon seems to create very many hits, but certainly enough to create 1700 downloads in 10 minutes. The chief culprits, in terms of volume, seemed to be two companies I’d not heard of before SimpleReach2 and Realtime3, but I also saw Google, Doubleclick and others.

While I was not surprised that these existed, the sheer volume of activity did shock me, consuming more bandwidth than the original web page – no wonder your data allowance disappears so fast on a mobile!

In addition the size of the JavaScript downloads suggests that there are doing more than merely report “page active”, I’m guessing tracking scroll location, mouse movement, hover time … enough to eat a whole core of CPU.

I left the browser window and when I returned, around an hour later, the activity had slowed down, and only a couple of the sites were still actively polling. The total bandwidth had climbed another 700Kb, so around 10Kb/minute – again think about mobile data allowance, this is a web page that is just sitting there.

When I peeked at the activity monitor Chrome had three highly active processes, between them consuming 2 cores worth of CPU! Again all on a web page that is just sitting there. Not only are these web beacons spying on your every move, but they are badly written to boot, costuming vast amounts of CPU when there is nothing happening.

I tried to scroll the page and then, surprise, surprise:

So, I will avoid links to Forbes in future, not because I respect my privacy; I already know I am tracked and tracked; who needed Snowdon to tell you that? I won’t go because the beacons make the site unusable.

I’m guessing this is partly because the network here on Tiree is slow. It does not take 10 minutes to download 8Mb, but the vast numbers of small requests interact badly with the network characteristics. However, this is merely exposing what would otherwise be hidden: the vast ratio between useful web page and tracking software, and just how badly written the latter is.

Come on Forbes, if you are going to allow spies to pay to use your web site, at least ask them to employ some competent coders.

The page I was after was this one, but I’d guess any news page would be the same. http://www.forbes.com/sites/richardbehar/2014/08/21/the-media-intifada-bad-math-ugly-truths-about-new-york-times-in-israel-hamas-war/[back]

This is, in principle, a lovely way of creating web sites that behave pretty much like native apps on mobile devices. However, things, as you can guess, do not always go as smoothly as the press releases and blogs suggest!

Some time I must write at length on various useful lessons, but, for now, just one – the potential for an endless cycle of caches, rather like Jörmungandr, the Norse world serpent, that wraps around the world swallowing its own tail.

My problem started when I had a file (which I will call ‘shared.prob’ below, but was actually ‘place_data.js’), which I had updated on the web server, but kept showing an old version on Chrome no matter how many times I hit refresh and even after I went to the history settings and asked chrome to empty its cache.

I eventually got to the bottom of this and it turned out to be this Jörmungandr, cache-eats-cache, problem (browser bug!), but I should start at the beginning …

To make a web site work off-line in HTML5 you simply include a link to an application cache manifest file in the main file’s <html> tag. The browser then pre-loads all of the files mentioned in the manifest to create the application cache (appCache for short). The site is then viewable off-line. If this is combined with off-line storage using the built-in SQLite database, you can have highly functional applications, which can sync to central services using AJAX when connected.

Of course sometimes you have updated files in the site and you would like browsers to pick up the new version. To do this you simply update the files, but then also update the manifest file in some way (often updating a version number or date in a comment). The browser periodically checks the manifest file when it is next connected (or at least some browsers check themselves, for some you need to add Javascript code to do it), and then when it notices the manifest has changed it invalidates the appCache and rechecks all the files mentioned in the manifest, downloading the new versions.

Of course as you work on your site you are likely to end up with different versions of it. Each version has its own main html file and manifest giving a different appCache for each. This is fine, you can update the versions separately, and then invalidate just the one you updated – particularly useful if you want a frozen release version and a development version.

Of course there may be some files, for example icons and images, that are relatively static between versions, so you end up having both manifest files mentioning the same file. This is fine so long as the file never changes, but, if you ever do update that shared file, things get very odd indeed!

I will describe Chrome’s behaviour as it seems particularly ‘aggressive’ at caching, maybe because Google are trying to make their own web apps more efficient.

First you update the shared file (let’s call it shared.prob), then invalidate the two manifest files by updating them.

Next time you visit the site for appCache_1 Chrome notices that manifest_1 has been invalidated, so decides to check whether the files in the manifest need updating. When it gets to shared.prob it is about to go to the web to check it, then notices it is in appCache_2 – so uses that (old version).

Now it has the old version in appCache_1, but thinks it is up-to-date.

Next you visit the site associated with appCache_2, it notices manifest_2 is invalidated, checks files … and, you guessed it, when it gets to shared.prob, it takes the same old version from appCacche_1 🙁 🙁

They seem to keep playing catch like that for ever!

The only way out is to navigate to the pseudo-url ‘chrome://appcache-internals/’, which lets you remove caches entirely … wonderful.

But don’t know if there is an equivalent to this on Android browser as it certainly seems to have odd caching behaviour, but does seem to ‘sort itself out’ after a time! Other browsers seem to temporarily have problems like this, but a few forced refreshes seems to work!

For future versions I plan to use some Apache ‘Rewrite’ rules to make it look to the browser that the shared files are in fact to completely different files:

RewriteRule ^version_3/shared/(.*)$ /shared_place/$1 [L]

To be fair the cache cycle more of a problem during development rather than deployment, but still … so confusing.

Useful sites:

These are some sites I found useful for the application cache, but none sorted everything … and none mentioned Chrome’s infinite cache cycle!

http://appcachefacts.info
Really useful quick reference, but: “FACT: Any changes made to the manifest file will cause the browser to update the application cache.” – don’t you believe it! For some browsers (Chrome, Android) you have to add your own checks in the code (See “Updating the cache” section in “A Beginner’s Guide …”).).

http://manifest-validator.com/
Wonderful on-line manifest file validator checks both syntax and also whether all the referenced files download OK. Of course it cannot tell whether you have included all the files you need to.

Nearly twenty years ago, back when I was in York, one of my student project suggestions was to try to make compiler parsers operate a little more like a human: scanning first for high-level structures like brackets and blocks and only moving on to finer level features later. If I recall there were several reasons for this, including connections with ‘dynamic pointers’1, but most important to help error reporting, especially in cases of mismatched brackets or missing ‘;’ from line ends … still a big problem.

Looking back I can see that one MEng student considered it, but in the end didn’t do it, so it lay amongst that great pile of “things to do one day” and discuss occasionally over tea or beer. I recall too looking at grammar-to-grammar parsers … I guess now-a-days I might imagine using XSLT!

Today, 18 years on, while scanning David Unger’s publications I discover that he actually did this in the Java parser at Sun2. I don’t know if this is actually used in the current Java implementations. Their reasons for looking at the issue were to do with making the parser easier to maintain, so it may actually be that this is being done under the hood, but the benefits for the Java programmer not being realised.

While I was originally thinking about programming languages, I have more recently found myself using the general methods in anger when doing data cleaning as often one approaches this in a pipeline fashion, creating elements of structure along the way that are picked up by future parsing/cleaning steps.

To my knowledge there are no general purpose tools for doing this. So, if anyone is looking for a little project, here is my own original project suggestion from 1993 …

Background
When compilers parse a computer program, they usually proceed in a sequential, left-to-right fashion. The computational requirement of limited lookahead means that the syntax of programming languages must usually be close to LL(1) or LR(1). Human readers use a very different strategy. They scan the text for significant features, building up an understanding of the text in a more top down fashion. The human reader thus looks at the syntax at multiple levels and we can think of this as a hierarchical grammar.

Objective
The purpose of this project is to build a parser based more closely on this human parsing strategy. The target language could be Pascal or C (ADA is probably a little complex!). The parser will operate in two or more passes. The first pass would identify the block structure, for example, in C this would be based on matching various brackets and delimiters `{};,()’. This would yield a partially sequential, partially tree-like structure. Mismatched brackets could be detected at this stage, avoiding the normally confusing error messages generated by this common error. Subsequent passes would `parse’ this tree eventually obtaining a standard syntax tree.

Options
Depending on progress, the project can develop in various ways. One option is to use the more human-like parsing to improve error reporting, for example, the first pass could identify the likely sites for where brackets have been missed by analysing the indentation structure of the program. Another option would be to build a YACC-like tool to assist in the production of multi-level parsers.

Incidentally, “Program comprehension beyond the line” is a fantastic paper both for its results and also methodologically. In the days when eye-tracking was still pretty complex (maybe still now!), they wanted to study program comprehension, so instead of following eye gaze, they forced experimental subjects to physically scroll through code using a single-line browser. [back]

What bit of software do you really need to be reliable? If anything else goes really wrong you have the backup — but if the backup fails you really are lost.

And Mac OS X Time Machine, while it does have a very pretty interface, is inclined to get stuck sometimes.

This is my own story of how it goes wrong … and how to put it right.

… and throughout I’ve dropped in a few lessons for anyone implementing critical system software — maybe the odd Apple engineer is reading

how to tell when things are wrong

Occasionally Time Machine seems to be stuck, but isn’t really. When you first do a backup, or when you haven’t backed up to a particular disk for ages (perhaps if you have been away on a trip), it can spend several hours ‘preparing’. You can tell it is ‘preparing’ because when you open the Time Machine preferences there is the little barbers pole saying ‘preparing’ 😉

This is when it is running over the disk working out what it needs to backup, and always seems to be the lengthiest operation, actually backing up the disk is often quite fast, and yet, for some reason there is no indication of how far through the ‘preparing’ process it has got.

Lesson 1: make sure you include progress indicators for anything that can take a while, not just the obvious ‘slow’ things.

So, when you see ‘preparing’, just be patient!

However, at least half-a-dozen times over the last year, my Time Machine has got completely stuck. I have seen this happen in three ways:

(i) it is still saying ‘preparing’ after leaving it overnight!

(ii) it starts to transfer to disk, but then gets stuck part way:

(iii) if you look in the Time Machine preferences it says the backup has failed

This last time in fact the first sign was (iii), but it doesn’t actually tell you (if you don’t look) until it has failed for ten days, by which time I was travelling. In the days before Time Machine I always did a manual backup before travelling as I knew that was when things were most likely to go wrong, but now-a-days I have got used to relying on it and forget to check it is working OK … so if you are paranoid about your data, do peek occasionally at Time Machine to check it is still working!

When I got home and told Time Machine to backup to the Time Capsule here rather than my office disk (why can’t it remember that I have two backup disks??). Then (after being very very patient while to was ‘preparing’ for four hours), I saw it got stuck in step (ii) at 1.4 GB or 4.2 GB. Of course progress indicators are never very good for very slow operations, when transferring several GB of data there may be several minutes before the bar even moves a pixel … but I was very very patient and it definitely did not move!

Lesson 2: for very long processes supplement the progress indicator with some other indicator to show things are still working, in this case perhaps amount transferred in last minute

At this point I did the normal things, turn Time Machine On/Off, restart machine a couple of times, etc., but when it persists then you know something is deeply wrong.

so why does it go wrong?

In fact Fiona@lovefibre has found Time Machine flawless for her desktop machine backing up to exactly the same Time Capsule. I am guessing the problem I have is because I use a laptop so possible reasons:

it may go to sleep occasionally, breaking connection to the Time Capsule

maybe the WiFi aerial on a laptop is not as good as the desktop

However, if every laptop failed as often surely Apple would have fixed it by now. So guessing there is an additional factor:

my disk has 196 Gb of data, much of it in smaller document files (word docs, code files, etc.), not just a few giant movies.

The software will be designed to withstand a certain amount of external failure, especially when connecting to disks over WiFi as the Time Capsule is designed to do. However, I imagine that there are places in the code where there are race conditions, or critical portions where external failure really makes a difference. If the external connections are reliable and the backup is quite fast the likelihood of hitting one of the nasty spots in the code is low. However, if you have a lot of data to check and then transfer and the external failures more frequent, then the likelihood of hitting one increases and things start to go wrong.

I see similar problems with other software, Dreamweaver in particular, which has got better, but still can crash if the Internet connection is poor (see also “Why software need never hang“). What happens is that during testing, the test machines often have minimal data, little software (maybe just the operating system and what is being tested), and operate in perfect situations. In such circumstances these hidden flaws never become apparent.

Lesson 3: make sure your test machine is fully loaded with data and applications, and operates in an unreliable environment, so that testing is realistic

However, this is not like Word crashing and losing your most recent edits to one document. When Time Machine fails it seems to occasionally leave something corrupt in the backup disk so that subsequent attempts to backup also fail. There is no excuse for this, the techniques for dealing with potential disk-writing failures are well established in both databases and low-level disk management. For example, one can save a timestamp file at the end of successful operations so that, when returning to the data, if the timestamp file is not there the software knows something went wrong last time.

Maybe Time Machine is trying to be too clever, picking up where it left off when, for example, connection to the disk is broken. If so it clearly needs some additional mechanism to notice “I’ve tried this several times and it keeps going wrong, maybe I need to back off to the last successful state”. Perhaps not something to worry about in less critical software, but not difficult to get right when it is really needed … as in backups!

Lesson 4: build critical software defensively in layers so that errors in one part do not affect the whole; and if saving to disk ensure there is some sort of atomic transaction

The aim during testing should be what I call “fail-fast programming” trying to make sure that failures happen during testing not real use!

One thing I found particularly disturbing about my most recent Time Machine hang is that when I looked at the system console it had regular spats of “unknown SIGSEGV” several times a minute … in the kernel! If you don’t know UNIX internals the ‘kernel’ is the heart of the operating system of the Mac, where all the lowest level work is done and where if something goes wrong everything fails. SIGSEGV means that some bit of software is trying to access a memory location that doesn’t exist. In fact while this is caught it is not so bad, the greater worry is that if it is trying to access non-existent memory, then it may corrupt other memory … and the kernel has access to everything – not good.

Please, please Apple if you cannot get Time Machine to work properly, do not let it affect the kernel!

how to put it right

One might hope that even if Time Machine cannot notice itself there is something wrong at least there would be an option to say “restart yourself”. One might hope, but there is not. However, you can do it yourself by digging a little into the backup disk itself.

First problem is to stop the Time Machine backup if it has hung.

In the Time Machine control panel, you can simply slide the OFF-ON button to OFF. The status should change to ‘stopping’ and after a while stop. Then you can restart the machine and try to fix things.

This is the ideal thing to do, but I find that when Time Machine is really hung this rarely works. I do turn it to OFF, but either it never changes to ‘stopping’ and stays ‘preparing’, or it changes to ‘stopping’, but never does. If this happens the system restart typically doesn’t restart the system as Time Machine won’t stop running. Then, always with much trepidation, I reach for the on/off button on the Mac itself :-/

After doing a hard on/off like this, I usually do anther restart from the Apple menu … not sure if this is necessary, but just to be on the safe side!

Occasionally I skip to the next step before the hard restart.

Then you can start to fix the problem properly.

Find the backup disk. If it is not obvious in the Finder use the ‘Go’ menu and select “Computer”; it shows all the locally connected disks (or it may simply appear in the left hand favourites pane in each Finder window).

If you skipped the restart stage (or of you just peek now to see what it is like when it hasn’t gone wrong), you will see something like “Backup of Alan Dix’s MacBook Pro” (obviously for you it will not be “Alan Dix’s MacBook Pro”!). This is the Time Machine backup. However, if you have restarted the machine with Time Machine off you will have to find the actual disk that you chose as your backup disk and on it look for a file called something like “Alan Dix’s MacBook Pro_0039fc56f8a2.sparsebundle”. This is some form of compressed disk image. In the older versions of Time Machine there was simply a folder with all the backups in it — I felt much more secure. Now this is a single opaque file and I worry that if one day it gets corrupted :-/

Having found the ‘sparsebundle’ double click it and it will display a little pop-up window that says ‘checking volumes’. I keep meaning to see if this ever stops, but I am not patient enough and press the button that says to skip this state and then (after a while) it mounts the disk image and the disk “Backup of Alan Dix’s MacBook Pro” appears.

Double click “Backup of Alan Dix’s MacBook Pro” and look inside and then inside the folder “Backups_backupd” and you find loads of dated folders, which are the actual backups of your system that you can browse if you prefer instead of using the Time Machine interface. In addition there may be one file ending “.inProgress”, which is some sort of internal file created while it is in the middle of doing the backup.

Delete the “.inProgress” file.

In addition, I usually delete the last of the dated folders (sort by “Date Modified” to get the last one). However, if you don’t want to lose the last backup you can try just deleting the “inProgress” file and only delete the last dated backup if Time Machine still gets stuck.

Important: only delete the latest of the dated backup folders (e.g. “2010-06-09-225547” in the screen shot above), NOT the entire “Alan Dix’s Macbook Pro” folder. If you do that you lose all your backups!

I recall doing this all with extreme trepidation the first time, but had got to the point when I couldn’t do backups or access them anyway so had nothing to lose. Actually it seems pretty OK getting in here and doing this sort of thing, the nice thing about Time Machine is that it uses ordinary folder structures that you can peek around in and see are there all secure. I am much happier with this than the kind of backup where you only know if it is working the day you try to restore something! At least half the times I have used such backups over the years I’ve found the backup is in some way corrupt or incomplete. So actually one up for Time Machine 🙂

Now reboot again (for luck). Turn Time Machine back on in the control panel and wait … a long time … it will start ‘preparing’ as if for the first backup … and several hours later hopefully all will be well.

But do remember to set the power save options not to go to sleep in the middle!

In fact the above has always worked for me except for this last time when, for some reason (maybe I missed something on the way?), it hung again and I had to go through the whole process again. This time I waited until yesterday evening before turning Time Machine back on so that I could leave it to do the long 4 hour ‘preparing’ stage without me doing anything else.

I have just finished reading Schön’s “The Reflective Practitioner“. It is one of those books that you feel you ought to have read years ago, resonating so much with many of my own thoughts and writing about creativity and innovation. However, I found myself at odds slightly with the adversarial dualism between science and practice, but realise this is partly because it is a book of its time. I will return to this later.

I’m starting to use shortcodes heavily in WordPress1 as we are using it internally on the DEPtH project to coordinate our new TouchIT book. There was minor bug which meant that HTML tags came out unbalanced (e.g. “<p></div></p”).

I’ve just been fixing it and posting a patch2, interestingly the bug was partly due to the fact that back-references in regular expressions count from the beginning of the regular expression, making it impossible to use them if the expression may be ‘glued’ into a larger one … lack of referential transparency!

For anyone having similar problems, full details and patch below (all WP and PHP techie stuff).

Last week I was giving a keynote at the annual workshop PPIG2008 of the Psychology of Programming Interest Group. Before I went I was politely pronouncing this pee-pee-eye-gee … however, when I got there I found the accepted pronunciation was pee-pig … hence the logo!

My own keynote at PPIG2008 was “as we may code: the art (and craft) of computer programming in the 21st century” and was an exploration of the changes in coding from 1968 when Knuth published the first of his books on “the art of computer programming“. On the web site for the talk I’ve made a relatively unstructured list of some of the distinctions I’ve noticed between 20th and 21st Century coding (C20 vs. C21); and in my slides I have started to add some more structure. In general we have a move from more mathematical, analytic, problem solving approach, to something more akin to a search task, finding the right bits to fit together with a greater need for information management and social skills. Both this characterisation and the list are, of course, a gross simplification, but seem to capture some of the change of spirit. These changes suggest different cognitive issues to be explored and maybe different personality types involved – as one of the attendees, David Greathead, pointed out, rather like the judging vs. perceiving personality distinction in Myers-Briggs1.

One interesting comment on this was from Marian Petre, who has studied many professional programmers. Her impression, and echoed by others, was that the heavy-hitters were the more experienced programmers who had adapted to newer styles of programming, whereas the younger programmers found it harder to adapt the other way when they hit difficult problems. Another attendee suggested that perhaps I was focused more on application coding and that system coding and system programmers were still operating in the C20 mode.

The social nature of modern coding came out in several papers about agile methods and pair programming. As well as being an important phenomena in its own right, pair programming gives a level of think-aloud ‘for free’, so maybe this will also cast light on individual coding.

Margaret-Anne Storey gave a fascinating keynote about the use of comments and annotations in code and again this picks up the social nature of code as she was studying open-source coding where comments are often for other people in the community, maybe explaining actions, or suggesting improvements. She reviewed a lot of material in the area and I was especially interested in one result that showed that novice programmers with small pieces of code found method comments more useful than class comments. Given my own frequent complaint that code is inadequately documented at the class or higher level, this appeared to disagree with my own impressions. However, in discussion it seemed that this was probably accounted for by differences in context: novice vs. expert programmers, small vs large code, internal comments vs. external documentation. One of the big problems I find is that the way different classes work together to produce effects is particularly poorly documented. Margaret-Anne described one system her group had worked on2 that allowed you to write a tour of your code opening windows, highlighting sections, etc.

I sadly missed some of the presentations as I had to go to other meetings (the danger of a conference at your home site!), but I did get to some and was particularly fascinated by the more theoretical/philosophical session including one paper addressing the psychological origins of the notions of objects and another focused on (the dangers of) abstraction.

The latter, presented by Luke Church, critiqued Jeanette Wing‘s 2006 CACM paper on Computational Thinking. This is evidently a ‘big thing’ with loads of funding and hype … but one that I had entirely missed :-/ Basically the idea is to translate the ways that one thinks about computation to problems other than computers – nerds rule OK. The tenet’s of computational thinking seem to overlap a lot with management thinking and also reminded me of the way my own HCI community and also parts of the Design (with capital D) community in different ways are trying to say they we/they are the universal discipline … well if we don’t say it about our own discipline who will …the physicists have been getting away with it for years 😉

Luke (and his co-authors) argument is that abstraction can be dangerous (although of course it is also powerful). It would be interesting perhaps rather than Wing’s paper to look at this argument alongside Jeff Kramer’s 2007 CACM article “Is abstraction the key to computing?“, which I recall liking because it says computer scientists ought to know more mathematics 🙂 🙂

I also sadly missed some of Adrian Mackenzie‘s closing keynote … although this time not due to competing meetings but because I had been up since 4:30am reading a PhD thesis and after lunch on a Friday had begin to flag! However, this was no reflection an Adrian’s talk and the bits I heard were fascinating looking at the way bio-tech is using the language of software engineering. This sparked a debate relating back to the overuse of abstraction, especially in the case of the genome where interactions between parts are strong and so the software component analogy weak. It also reminded me of yet another relatively recent paper3 on the way computation can be seen in many phenomena and should not be construed solely as a science of computers.

As well as the academic content it was great to be with the PPIG crowd they are a small but very welcoming and accepting community – I don’t recall anything but constructive and friendly debate … and next year they have PPIG09 in Limerick – PPIG and Guiness what could be better!

David has done some really interesting work on the relationship between personality types and different kinds of programming tasks. I’ve seen him present before about debugging and unfortunately had to miss his talk at PPIG on comprehension. Given his work has has shown clearly that there are strong correlations between certain personality attributes and coding, it would be good to see more qualitative work investigating the nature of the differences. I’d like to know whether strategies change between personality types: for example, between systematic debugging and more insight-based scan and see it bug finding. [back]

Over 20 years ago I wrote “The Myth of the Infinitely Fast Machine“, about the way software developers effectively assume that everything on the machine side of human interaction happens instantly. Often interaction is programmed in a turn-taking style:

wait for user action

process the event

display changes

back to step 1

This assumption of instant (or at least infinitely fast) response at step 2 often ignores network delays, disk IO or heavy computation. This tends to work fine on a high-spec development or test machine, with a fast network and clean install of all system software … but when the software hits a real machine, a few years old, untidy system, slow network … things fall to pieces.

So 20 years later (as I described in my post last week) I am sitting watching the spinning rainbow ball as Word struggles to save a document (over an hour now, I think I will need to kill it). To be fair I think the root ’cause’ of the problem … or at least one problem … may be the printer as the Cannon printer driver has never worked properly on an Intel Mac (maybe new driver when I upgrade to Leopard?) and perhaps some change in the rest of the system (maybe the Office install) has tipped it over into not working at all.

As far as I can tell Word then decides to ask the printer things in order to set the margins properly when saving the document, and then gets stuck. I found a post on a Microsoft forum about a different print related problem and the ‘helpful’ tech support from MS simply said “not our fault, re-install everything”.

So to recap:

user asks Word to save – probably the most critical operation in the system, or the system auto-saves, again to ensure safety against crashes, so really critical

Word decides it needs information from the printer (although it has been displaying the page to the users using some existing information on page properties).

Word asks for info from the printer driver of the currently selected printer

if the printer doesn’t respond Word hangs and blocks all user interaction

However, the printer driver may be third party, may be connecting to a shared printer hanging off a different network, or in the case of a laptop on a network currently disconnected from the computer … and any resulting delay is not the fault of the developers of Word??!

The annoying thing is that such ‘hanging’ delays need never happen.

Basically there are four main causes for delays:

ordinary computation takes a long time due to it being too complex for the available hardware

unbounded internal computation -for example iterative algorithms

waiting for external resources (disk, network, etc.)

bugs that lead to the system going crazy (effectively case 2 by accident!)

Type 1 will surface during testing and may require re-design of the interaction, but is simply ‘slow’ rather than ‘hanging’. Typically it leads to things gradually getting slower as the document or data gets larger or more complicated. This requires standard profiling and optimisation.

Type 4 is hard to deal with – bugs do happen. However, the majority of the problems I’m experiencing in Word at the moment are not a failure of this kind as Word does, most of the time, eventually complete without crashing.

Types 2 and 3, especially the latter, should be detected and then dealt with in the design of the user interface.

Some real-time programming languages have ways of automatically working out how long code will take to run in order to be able to assert “this will respond within a 10 ms interrupt cycle”. However, this is hard, even for relatively simple embedded systems; so not practical for complex operating systems or user interfaces.

However, a simpler version of the above is possible. Certain system functions invoke external resources such as the disk, or the network. If any function or method in your own application invokes one of these system functions, then it could potentially hang – and should be documented to say so or return some sort of ‘promise’: “I’ve started to do X, please check back later to see if it is ready”. Of course the methods that call these themselves need to be documented as potentially hanging … and so forth.

If the response to any form of user interaction ends up calling a potentially hanging function, then it is in danger of having a delay of type 3 above. However, so long as this is known, it can be dealt with at the user interface level by spawning a thread to do the work so that some form of progress indicator or at least “Cancel” button can be active – it should never ‘hang’.

This marking of functions as potentially ‘hanging’ could be done by programmers themselves, but equally can be automated as a form of static analysis, simply starting with a known set of hanging system functions and recursively ‘colouring’ functions that call them. This kind of automated checking should be standard practice in any large software project.

The type 2 hanging is a little more complicated. The ADA programming language has a ‘safe’ subset that only allows loops where the bounds are fixed at compile time. This is probably too restrictive for complex software, but certainly any loop with unknown limits could be flagged. If as part of a code walk through or similar practice it is decided that the loop is ‘safe’ it can be annotated as such, otherwise, just like the case of system calls, the system can propagate the fact that certain functions may have unbounded computation and then the UI adjusted accordingly.

For small bespoke software development I can be forgiving, but for large vendors like Microsoft, Apple or Adobe, there is no excuse for this form of culpable failure.

… but I have a bad feeling that in 20 years time I may be writing again …

Last week’s HCI2007 conference and the Physicality workshop now all finished (except sorting out the the final finances for HCI … but I’ll forget that for now!)

Being part of the organisation of things you always see so many things that are not as planned (like going wrong), but for the delegates it all seems a well-oiled machine. In this as in many other domains, the mark of a rubust system is not whether or not it fails, but how it copes with failure. This is the heart of my principles for appropriate intelligence when designing ‘intelligent’ user interfaces and also ‘fail fast programming’1 when designing and debugging critical computer systems.

Great to see so many old friends … and meet new people … and after able to show Nad2 the glories of the Lake District.

Recent Posts

Spinners Grasmere

Spinners is our first floor holiday let in Grasmere, Cumbria, furnished with colourful textiles and, naturally, a spinning wheel! It sleeps up to four people in two bedrooms. Contact Good Life Lake District Cottages for information and booking.