Actually, this doesn't seem to be the issue. There's a race present but it's pretty narrow since we were already having the AsyncReserver cancel the callback in Trimming::exit(). Trying to reproduce with the trim sleep...

I'm not sure these are actually the same bug. Sage's involves snaptrimming, but the others have no mention of it in the OSD log at all. And I've got another log here with the ref count crash where the PG in question is actually in snaptrim (not snaptrim_wait or WaitTrimTimer).

I'm starting to think the leak is in peering now. All of them (except Casey's top one, which I didn't check and is about the peering_queue) involve PGs with a min_last_complete_on_disk of 0'0 — ie, one of the OSDs was being backfilled. Except they're in active+clean, so not sure how to resolve that?

So, we identified today that it looks like the op Backoff code may have an issue with holding PGRefs past when we expect them to die, although it's not a certain thing. Anyway, the patch is very small.

More generally though, we shut down several different timers/AsyncReservers/etc in OSDService::shutdown. This happens after the pg refcount check and lots of these are given PGRefs via input Context objects. Many of them are cleaned up but a few of them seem broken to me at first glance (and, in general, it's really hard to validate or when writing new code to realize that they need to get cleaned up separately). These include reserver_finisher, objecter_finisher, recovery_request_timer, and snap_sleep_timer (though this one at least should get cleaned up via its state exit() function). It's not clear to me if we can maybe just skip doing the PGRef counting assert until after we do OSDService shutdown? I don't think we can shutdown these utility objects earlier since they can be invoked by various other things.

I'm also a little confused about the snap_trimmer state machine when it shuts down. I've done testing where PGs which are in the WaitTrimTimer state (with associated Context holding a PGRef in snap_sleep_timer) are successfully exiting, which as far as I can tell must mean the Reset() event getting emitted by PrimaryLogPG::on_change() is triggering WaitTrimTimer::exit(). But it's not printing out the log message about exiting that state? So I don't know what's going on. (This was one of the ways I initially assumed it was broken, especially as we've started seeing the same pgref shutdown bug on a Jewel user with the backports.)

Actually there's a PrimaryLogPG::on_change and a PGBackend::on_change, they are distinct, and the PrimaryLogPG one is only invoked on start_peering_interval() in RecoveryState::Reset::react(AdvMap).So now I'm again just confused that the WaitTrimTimer's OnTimer callback isn't causing the PGRef assert failure.

Okay, my local testing was using init-ceph, and the way it repeatedly invokes kill signals means the OSD wasn't getting shut down cleanly anyway, and I was missing the asserts. But they were present.

So the proximate cause is that the snap_trimmer machine wasn't getting Reset() on shutdown; the PR has patches doing that (and closing some other holes) now. Will test against the rgw-multisite and see if that triggers any other issues, but I'm feeling pretty good.