Main navigation

Watching a Mac losing its grip: background services are vital

Routine, background activities shouldn’t be run strictly by the clock. Traditional scheduling systems like Unix cron, and the Mac’s LaunchAgents, are fine for brief tasks which don’t put much load on the cores or hog hardware resources. But when the clock says it’s 23 minutes past the hour and that is time to do some heavy lifting, it’s only going to upset the user.

macOS now strives to do better. Most system activities which can be run in the background are now managed by a sophisticated scheduling service which avoids running a Time Machine backup when your laptop is running on battery power, or flooding cores with heavy tasks when you’re trying to get on with your work.

The snag is that, throughout Sierra, there have been problems in that scheduling service, which eventually lead to its collapse. Leave your Mac running long enough, for about a week or more, and those background tasks no longer get dispatched properly. Routine backups become irregular, or stop altogether for long periods, and many of the seventy or eighty other background activities are also forgotten.

A couple of days ago I was able to see this happening in front of my eyes – or, rather, in front of my tools, as I watched what was going wrong using T2M2 and DispatchView. What I saw suggested that Apple may have tried to address this long-standing bug, but it isn’t been fixed yet.

During the late morning, my hourly Time Machine backups seemed to skip a beat. I checked them using T2M2 just before 1300, when five backups had been run over the last six hours. There was clearly no cause for alarm, the traffic light was still green, but I decided to keep an eye on things.

When I ran T2M2 next, at nearly 1430, the light was red, and its report read:Analysis for 2 hours from 2017-05-28 11:23:08 +0000 to 2017-05-28 13:23:08 +0000:
Backing up to /dev/disk3s2: /Volumes/PROMISE PEGASUS/Backups.backupdb
on which there are 2.9 TB available.
Started 1 auto backups, and 0 manual backups; completed 1 backups successfully,
last backup completed successfully 111.0 minutes ago,
backed up 642 files in the last completed backup,
total data for each backup was 175.1 MB.
Times taken for each auto backup were 7.0 minutes,
Created 1 new backups, and deleted 1 old backups,
No error messages found.
Running check on scheduling system (CTS and DAS).
DAS appears to have stopped rescoring activities over the last 90 minutes.
Running CTS check next.
CTS has not been told to run a backup in the last 90 minutes. Scheduling may have stopped.
Consider restarting your Mac.

Before restarting, I explored a little further using DispatchView, which looks at log entries for Duet Activity Scheduler (DAS) and Centralized Task Scheduling (CTS), two key parts of the macOS scheduling and dispatch system.

Normally, DAS maintains a scored list of all its scheduled activities. Every few minutes, or more often, it reassess their scores, and decides whether any need to be run. Those that reach a high enough score, approaching 1.0, it then tells DAS to run. By 1117, that orderly system had ceased working.

Instead of rescoring the activities in its queue in the normal way, DAS reported that it wasEvaluating 77 activities based on triggers
which is its fallback when the normal scoring system has broken down. This works far less evenly, and DAS was deciding to run a lot of activities at a time, then going silent for long periods. At times its messages indicated its distress, referring to actions performed "because conditions are deteriorating". Some activities which it had started were also going awry, and it was reporting that those were ‘CANCELED’.

As a result, CTS was being inundated with DAS instructions to run tasks, then left with nothing to do for a long time. Suddenly at 1127, for example, CTS ran the following tasks in a whole bunch:com.apple.revisiond.requeue_chunking
com.apple.SoftwareUpdate.Activity
com.apple.coreduetd.people
com.apple.backupd-auto
com.apple.parsecd.flusher.background
com.apple.suggestd.persist-stats
com.apple.geod.XPCRemoteLogAdaptor.aid-types_12000
com.apple.calendar.AlarmScan
com.apple.ical.sync.x-coredata://CB4D3C6D-97B3-4840-926F-2A430381E148/CalDAVPrincipal/p6
com.apple.cloudkit.pcs.memorycache.evict
com.apple.photoanalysisd.backgroundanalysis
com.apple.cloudkit.packageGarbageCollection
com.apple.siri.xpc_activity.metrics-sender
com.apple.bird.stage.gc
com.apple.Safari.SafeBrowsing.Update
com.apple.ical.sync.x-coredata://CB4D3C6D-97B3-4840-926F-2A430381E148/CalDAVPrincipal/p4
com.apple.AddressBook.ScheduledSync
com.apple.bird.sync
com.apple.Maps.mapspushd.launchForAnnouncements

Those give you a good idea of the now pervasive use of these scheduling services by a wide range of background services.

So as of 1127 or so, DAS and CTS were still running routine Time Machine backups, but rather erratically.

An hour later, DAS and CTS were on their knees, and failing to cope. From 1232, DAS went silent altogether for fifteen minutes. When it did start writing to the log again, it was erratic at best. Finally at 1252, it again noted that "conditions are deteriorating", then tried to re-assess the activities in its queue:Evaluating 0 activities based on triggers

Backups were still limping along, though. One completed at 1232, and (as usual) CTS then calculated the time window for the next backup as being between 1307 and 1337, and passed that back to DAS to dispatch. But by the time that period came round, DAS and CTS had got preoccupied with a succession of background tasks to support Safari, and Time Machine’s overdue backup remained forgotten until after 1426. By then, scheduling and dispatch had ceased altogether: DAS and CTS had been completely silent for over forty minutes.

When this happens, what I’d really like to be able to do is to restart DAS and CTS. That should enable their normal service to resume, allowing servers to continue running without interruption, for example. But these scheduling and dispatch systems are undocumented, and there are no tools available to work with or control them.

For Time Machine backups, if you can’t restart your Mac, you can revert to a LaunchAgent runningtmutil startbackup -a -r
every hour.

What worries me are all the other background activities for which we rely on DAS and CTS. There are seventy or eighty other features – including software updates (and silent security patches?), iCloud, address book and other synchronisation, metadata updates for Spotlight search, security checks, and more – all of which become erratic, infrequent, and may stop altogether.

Time Machine backups are only the tip of the iceberg. When DAS and CTS go wrong, you really don’t want to leave your Mac running for much longer before restarting, even if it is a server. In fact, because it is a server.

6Comments

This is all very worrying so well done for finding it. I presume that you have reported this as one or more faults to Apple, have you had any response?

I used to think that I was old-fashioned in closing down my machine every night (and hence restarting every morning): now I have a new rationale. For those that can’t do this, I guess it’s the old ‘restart in the early hours and hope no-one notices’ strategy ☹️

I filed a bug report in February. As you know, responses are subject to non-disclosure, so I cannot give any further details. But Apple has known about the problem for over three months, and I provided copious information and log contents.
What happens seems to have changed in 10.12.5: in earlier versions, when DAS and CTS collapsed, they would do almost nothing for hours. It looks as if 10.12.5 now has a ‘get you home service’, which limps along flickering into life and running a batch of activities every now and again.
For anyone running a server, it’s still nothing short of disastrous.
Howard.

Reflecting on this, I have had several incidences of late of failure to synchronise my calendar from my Mac to my other (iOS) devices. Is there reason to think that this could be related? I had written them off as part of Apple’s general malaise when it comes to network services.

I’ve been following your investigation with some interest, as I was tasked with setting up a backup of valuable company data and my first approach was using TM but I quickly ran into the problem of scheduled backups just being skipped after a few days of running at best (and despite changes to the data of course). I have since set up a different method of backing up which works (and only saves the data in question, as opposed to the whole enchilada), but of course a time machine would still be nice to have as it would make recovery much quicker in the event of catastrophic hardware failure. Do you have any insight already if High Sierra possibly brings salvation in regards to this issue?

No, I’m afraid that I don’t.
Apple has had the bug reports since February. If it does not fix this issue shortly, it can only indicate that it is not interested in doing so.
Although TM is efficient in backing up files, if you are backing up a single, large database file then it does not offer any efficiencies. High Sierra’s snapshot feature might prove to be an efficient alternative, but at least in the initial release of High Sierra, it looks like TM will continue to be the preferred automatic backup system for regular (as opposed to Mobile) backups.
Mobile backups are likely to switch to using APFS snapshots in the initial release, but are not useful for you, I suspect.
Howard.