As every active user of morph.io knows, we’ve had our fair share of stability issues for some time now. We know what a big impact this can have on your use of morph.io. After all, we’re one of the biggest users ourselves - for our most popular project, PlanningAlerts.

Not only that but all the fire-fighting puts a big strain on our tiny team (we’re a charity, with just 2 full-time staff). Especially when we’re busy working on our other major projects. So while we would love to devote heaps of time to the project and get things humming along, we often have to make the hard choice to patch things up and move on for a while.

Recent stability work

The work that was done several months back significantly changed the way the backend scraper run queue works. This has made several aspects of how the server runs much more stable. Disk space and memory blow-outs have been rare or non-existent which means the site has experienced very few outages. Importantly it’s also meant we haven’t had to tend to these issues regularly like we were for the previous 12 months or so.

Unfortunately several new issues have been introduced that have been affecting the running of your scrapers.

Outstanding major issues

You run your scraper. It scrapes stuff. It completes successfully. But there’s no changes to the database - WTF?

This has been an issue since some changes made in April. We’ve spent quite a while trying to work out what the problem is but it’s intermittent and we can’t reproduce it locally. For a while it seemed to happen every few days then last month it didn’t happen at all. Now it seems to be back.

We know that restarting the queue makes things work again but we still don’t know what the problem is or how to fix it. For the moment this means other issues take priority and we’ll just restart things when we see a problem or you report it.

Scrapers taking ages to run

We’ve had backlogs before but over the last month there’s been some really huge ones. Like taking a day or more for your scraper to run. Or it seems queued forever.

Every few days over the last month we’ve had to spend an hour or two manually trying to get things working again when this issue has cropped up. This fire-fighting, and the issue itself, is really affecting our work. We’re going to spend a couple of days working on this from today.

What we have planned

The first thing I want to do is get all our packages and gems up to date. It’s very possible that a regression in a package or gem has contributed to the issues we’re having now.

Changing the way that used slots is calculated. A big part of the delays running scrapers from the user perspective is potentially avoidable. morph.io thinks all the available slots are used because a bunch of runs are having problems (but they’re not actually running!). I anticipate that this will be one of those changes that is not theoretically correct or the right thing to do long term but practically it could help massively and at least get us out of some pickle, preserve or other jam related substance.

And finally, if I have time, I might try and debug the root cause a bit more. I’ve had a small look and I have a very bad feeling that a huge part of the problem is an unresolved Docker issue.

What can we do to help you?

Obviously fixing the root cause of these stability problems would be the best thing to help us all. However it’s very unlikely we have the people available to do that for at least another couple of months.

So, in the mean time, what would help you? The idea of this post is to be as clear and transparent as possible so you know what’s going on and can plan accordingly. If there are other small things we can do then we’re keen to hear any ideas you have.

What can you do to help?

As you know, morph.io is an open source project run to benefit the global civic tech community. Anyone can contribute and we’re always delighted to get contributions. Despite being a tiny team we try hard to give contributions the time they deserve and help get them deployed.

morph.io is an important bit of shared civic infrastructure. We’d love this to extend to the project’s maintenance and support one day too. So if you rely on morph.io and want to contribute in a bigger way then let’s talk.

We’ve had backlogs before but over the last month there’s been some really huge ones. Like taking a day or more for your scraper to run. Or it seems queued forever.

This issue is actually a combination of how the queue backend was changed and a bug that’s been affecting run containers.

When I started work on this the queue had over 1,300 runs in it. Wanted to clear that out so we could see a bit better what was going on (and so your scrapers actually run!).

To do that I wrote a little rake task to help clear a backlogged queue given how it now works. I then manually fixed up problematic jobs and after running that for the last day or so the queue is now completely clear.

henare:

And finally, if I have time, I might try and debug the root cause a bit more. I’ve had a small look and I have a very bad feeling that a huge part of the problem is an unresolved Docker issue.

Thanks to the above work manually fixing things up and clearly seeing jobs work through the queue I was able to see a bit more detail about the problem. I’ve made a change that I thought would do one thing but it didn’t. However we haven’t seen the root cause problem recur since I made that change so I’m letting it settle and I’ll keep checking on things. Fingers and toes crossed.

Happily the change I made does seem to have helped avoid the problem of scraper run slots getting forever tied up by scrapers that can never finish because their container is broken.

However yesterday and today I had to manually intervene to fix up small backlogs on the queue (the one today actually had over 300 retries so not that small).

The problem this time has been that Sidekiq processes seem to be getting OOM killed and when they do all the runs on that process are orphaned, i.e. the container and run are still there but there’s no corresponding Sidekiq job to finish the run and tidy up the container. What I have to do in this case is create a job on the queue for those orphaned jobs and off they go again.