2016-09-30

s/macOS Sierra/r/macOS vista/

I've been using macOS sierra for about ten/eleven days now. and I've rebooted my laptop about 6+ times because the system was broken.

Two recurrent problems: failure to wake in the morning, gradual lockup of finder and transitive app failure.

Failure to wake: I go up to the laptop, hit the keyboard and mouse, nothing happens. Only way to fix: hold down the power and wait for a hard restart.

Back in 1999 I worked on a project with HP's laptop group, where we instrumented a set of colleague's laptops with a simple data collection app, then collected a few months of data. At the time this was considered "a lot of data". The result, the paper: The Secret Life of Notebooks. This showed that people tended to have a limited set of contexts, where context was defined as system setup: power, display, IP addr, and application: mail, ppt. And that people were so predictable in their use models, that doing some HMM to predict contexts wouldn't have been hard.

I ended up writing some little app which essentially did that: based on IPAddr and app (PPT, Acroread) full screen, could choose: power policy, network proxy options, sound settings (mute in meetings, etc). It was fairly popular amongst colleagues, because it would turn proxy stuff on and off for you, and know to turn off display timeouts when giving presentations; crank up the savings when on the move. When I look at Windows 8+ adaptation to network settings, or OSX's equivalent of that and the "When on battery ...", I see the same ideas. You don't get any HMM on the laptops though; for that you have to pick up an android phone and look at Google Now, something which really is trying to understand you and your habits. And, because it can read your emails, correlate those habits with emailed plans. If it really wanted to be innovative/scary it would work out who you were associated with (family, friends, colleagues, fellow students...) and use their actions to predict yours. Maybe someday.

User-wise, another interesting feature was how people viewed mail so differently when online vs offline. Offline, you'd see this workflow of outlook-> word-> outlook-> ppt-> outlook-> acroread-> outlook, ... etc, very fast cycles. It seemed uncontrolled window tabbing at first —until you realise it's peple going through attachments. Online, and people's workflow pulled in IE (it was a long time ago), and you'd get a cycle where IE was the most popular app cycled to from outlook. Email was already so full of links that the notion of reading email "offline" was dying. You could do it, but your workflow would be crippled. And that was 15+ years ago. Nowadays things would only be worse.

There was a second paper which was internal, plus circulated/presented to Microsoft. There I looked at system uptime, and especially the common sequence in the log

1998-08-23 18:15 event: hibernate1998-08-24 09:00 event: boot

or

1998-09-01 11:20 event: suspend1998-09-01 11:30 event: boot

That is: a lot of the time the laptop through it was going to sleep, it was really crashing.

My theory was that alongside the official ACPI sleep states S1-S5 there was actually a secret state S6, "the sleep you never awake from". Some more research showed that it was generally on startup that the process failed, and it was strongly correlated with external state changes: networks, power, monitors. It wasn't that the laptop made a mess of suspending, it was that when it came back up it couldn't cope with the changed state.

I don't know if macOS sierra has that issue: I do know that it has that problem if left attached to an external display overnight. Looking in the system logs, you can see powernap wakeups regularly (that's all displays off), but come the first user interaction event —where the displays are meant to kick off— they don't come up. This is resulting in system logs not far off from the '99 experiment

That's the nightly problem. What's happened 3+ times is the lockup of Finder, with a gradual degradation of other applications as they go near its services.

First finder goes, and restarts do nothing
Then the other apps fail, usually when you go near the filesytem dialogs, or the photo collection.
As with finder, restart does nothing.

If it was my own code, I'd assume a lock is being acquired in the kernel on some filesystem resource and never being released. This is why locks should always have leases. Root cause of that lock/release problem? Who knows. I can't help wondering, though, if its related to all the new icloud sync feature, as that's the biggest filesystem change. I've also noticed that I usually have a USB stick plugged in; I'm going to go without that to see if it helps.

When i get this slow failure, i don't rush to reboot. It takes about 10 minutes to get my dev environment back up and running again: the IDEs, the terminal windows, etc, 2FA signing in to webapps, etc. I really don't want to have to do it. Instead I end up with bits of the UI keeling over, while I stick to the IDE, chrome, terminals. I had a bit of problem on Thursday evening when calendar locked up the extent I couldn't get the URLs for some conf calls; I had to use the phone to get the links and type them in.

Anyway, come the evening, after the conf calls and some S3a Scale tests, I kick off a shutdown.

And here a flaw of the OSX UI comes in: it assumes that whatever reason you are trying to do a shutdown for, it is not because finder has crashed. And it gives any application the right to veto the shutdown. You can't just select "shut down..." on the menu, you have to wait for any apps to block it, stop them and then continue. And even after doing all of that, I come in this morning and find the laptop, fans spinning away, me logged out but some dialog box about keychain access required. This is not shutting down, this is half hearted attempt at maybe shutting down sometimes if your OS hasn't got into a mess.

It's notable that Windows has some hard coded assumptions that a shutdown is caused by the failure of something. It also has, from the world of Windows Server, the concept that the user may not be waiting at the console waiting to click OK dialogs popped up by apps. Thus it has a harsher workflow.

A WM_QUERYENDSESSION message comes out saying "we'd like to shut down, is that OK? Apps get the opportunity to veto the sesson end, but not if it's tagged as a critical shutdown. And of you don't service that event, you are considered dead and don't get a veto.

There is a registry entry WaitToKillAppTimeout you can use to control how long the OS waits for applications and to terminate, WaitToKillServiceTimeout for services, and even HungAppTimeout to control how long an app has to respond to an exit menu request (WM_EXIT?) before being considered dead and so killed.

See? Microsoft know that things hang, that even services can hang, and that if you want to shut down then you want to shut down, not find out 12 hours later that it had stopped with a dialog box.

In contrast macOS Sierra has implicit assumptions that apps rarely hang, the OS services never deadlock, and that shutting down is a rare activity where you are happy to wait for all the applications to interact with you —even the ones that have stopped responding.

This may have held for for OS/X, but for macOS all those assumptions are invalid. And that makes shutdown far more painful and unreliable than it need be.

Now if you go low level and do a "man shutdown", you can see that a similar escalation process is built in there

Upon shutdown, all running processes are sent a SIGTERM followed by a SIGKILL. The SIGKILL will follow the SIGTERM by an intentionally indeterminate period of time. Programs are expected to take only enough time to flush all dirty data and exit. Developers are encouraged to file a bug with the OS vendor, should they encounter an issue with this functionality.

I think from now on, it'll be a shutdown command from the console.

Anyway, because of all these problems, I do currently regret installing macOS sierra. It shipped to meet a deadline, rather than because it was ready.

macOS Sierra is not ready for use unless you are prepared to reboot every day, and are aware that the only way to reboot reliably is from the console.