The bug 700357 has added the rt->gcNumFreeArenas counter that is used for GC scheduling. However, currently the modifications to the counter can race. In particular, the counter is modified in Chunk::init.
That function is called outside the GC lock by the background thread. So the increase rt->gcNumFreeArenas += ArenasPerChunk in the init method can race with a counter decrease on the allocation thread. That could lead to the counter wrapping around 0 and becoming a positive number around UINT_MAX. That in turn permanently trigger periodic GC each 20 seconds.
That could be the reason for the bug 707162.

AFAICS there is another bug with rt->gcNumFreeArenas that has more chances to lead to the bug 707162. When the chunk becomes empty and is moved to the GC free list and the later destroyed, rt->gcNumFreeArenas is not decreased by ArenasPerChunk amount. So whenever we release the first empty chunk rt->gcNumFreeArenas becomes permanently positive.

Created attachment 582415[details][diff][review]
v1: Fix desync of gcNumFreeArenas
Good catch. I think it is fine to have the race as long as we reliably reset the value to something valid occasionally. I am not sure why I did not implement it like this in the first place: as is, it is clearly buggy.

Created attachment 582428[details][diff][review]
full fix
For the bug 702251 and related bugs it is a must to avoid the race as with bugs the reset point may not be reached.
The fix here solves that and yet another problem (but it is mostly theoretical) with counter desynchronization when the counter is not reset after PickChunk failed to add the chunk to the set and calls Chunk::release(rt, chunk).
With the fix the counter reflects the number of free committed arenas in non-empty chunks. I also renamed it to match ChunkInfo::numArenasFreeCommitted.

(In reply to Justin Lebar [:jlebar] from comment #3)
> Do we have telemetry set up so that, when this change is checked in, we'll
> be able to see whether it solves bug 707162? If not, we should set up that
> telemetry ASAP.
I suppose GC_REASON histogram should show a drop in frequency for MAYBEGC (1).

(In reply to Igor Bukanov from comment #5)
> (In reply to Justin Lebar [:jlebar] from comment #3)
> > Do we have telemetry set up so that, when this change is checked in, we'll
> > be able to see whether it solves bug 707162? If not, we should set up that
> > telemetry ASAP.
>
> I suppose GC_REASON histogram should show a drop in frequency for MAYBEGC
> (1).
Where can I see this histogram?

(In reply to Igor Bukanov from comment #8)
> That does not show GC_REASON for some reasons...
Ehsan recently pointed out that telemetry histograms are not allowed to be defined with a range that starts at 0. (Why this is I have no idea; it seems insane.) Because of this, the GC_REASON histogram was getting ignored.

(In reply to Bill McCloskey (:billm) from comment #9)
> Ehsan recently pointed out that telemetry histograms are not allowed to be
> defined with a range that starts at 0. (Why this is I have no idea; it seems
> insane.) Because of this, the GC_REASON histogram was getting ignored.
Is it ignored during submission so we have absolutely no data?

(In reply to Bill McCloskey (:billm) from comment #11)
> The bug is when the histogram data structure is created on the client side.
>
> http://hg.mozilla.org/mozilla-central/file/cbb0233c7ba8/toolkit/components/
> telemetry/Telemetry.cpp#l135
> Unfortunately, Telemetry::Accumulate drops errors on the floor.
>
> Ehsan fixed it so that the range starts at 1. However, we're stilling adding
> 0 values to the histogram, so I'm not actually sure what will happen.
Do we have a bug to either add 1 to histogram values or remap 0 into some other unused value to preserve submitted data if any?

Comment on attachment 582428[details][diff][review]
full fix
My patch has a bad bug - when the chunk is returned to the empty pool the global counter should decrease not by ArenasPerChunk, but rather by the number of committed arenas in the chunk that is put to the pool.

Created attachment 582667[details][diff][review]
v2
The new version hopefully fixes all the issues with the counter update. As before the patch the runtime counter accounts for decommitted arenas both in non-empty chunks and in empty chunks. If we have a lot of empty chunks, but only few arenas in them are committed, it should not represent the memory pressure.

Comment on attachment 582667[details][diff][review]
v2
Review of attachment 582667[details][diff][review]:
-----------------------------------------------------------------
I like the name changes and I especially like that this removes the rt from Chunk::init. However, this does introduce one potential problem: we include FreeCommitted from the ChunkPool in the count, but don't decommit them in DecommitFreePages, leaving us with a persistent positive number. With the current constants it is impossible to get into a loop where we GC_SHRINK at every MaybeGC, but this is a subtle interaction for anyone tweaking those constants to keep in mind. I'll give this an r+ given the addition of sufficient static asserts to ensure that the maximum ChunkPool size and the FreeCommittedArenaThreshhold have a sane relation.

(In reply to Terrence Cole [:terrence] from comment #16)
> However, this does introduce one potential problem: we include
> FreeCommitted from the ChunkPool in the count, but don't decommit them in
> DecommitFreePages, leaving us with a persistent positive number.
I am not sure what do you mean here. Surely we skip empty chunks when doing the decommit, but when we do decommit, we remove later on the background thread all the empty chunks. As the new GC always waits until the backgrounw sweep f
With the
> current constants it is impossible to get into a loop where we GC_SHRINK at
> every MaybeGC, but this is a subtle interaction for anyone tweaking those
> constants to keep in mind. I'll give this an r+ given the addition of
> sufficient static asserts to ensure that the maximum ChunkPool size and the
> FreeCommittedArenaThreshhold have a sane relation.

Sorry for incomplete comment, I meant:
(In reply to Terrence Cole [:terrence] from comment #16)
> However, this does introduce one potential problem: we include
> FreeCommitted from the ChunkPool in the count, but don't decommit them in
> DecommitFreePages, leaving us with a persistent positive number.
I am not sure what do you mean here. Surely we skip empty chunks when doing the decommit, but when we do decommit, we remove later on the background thread all the empty chunks. As the new GC always waits until the background sweep finishes, we cannot have several GC with decommits without releasing all empty chunks.
Also note that in the bug 702251 I wull implement decommit together with the chunk release on the background thread, so there would be no arbitrary pause between decoomit and release,

(In reply to Igor Bukanov from comment #18)
> Sorry for incomplete comment, I meant:
>
> (In reply to Terrence Cole [:terrence] from comment #16)
> > However, this does introduce one potential problem: we include
> > FreeCommitted from the ChunkPool in the count, but don't decommit them in
> > DecommitFreePages, leaving us with a persistent positive number.
>
> I am not sure what do you mean here. Surely we skip empty chunks when doing
> the decommit, but when we do decommit, we remove later on the background
> thread all the empty chunks. As the new GC always waits until the background
> sweep finishes, we cannot have several GC with decommits without releasing
> all empty chunks.
Ah, okay, I did not know that we freed the ChunkPool on GC_SHRINK, but it makes perfect sense. In that case, this is fine. It would be nice if we could also add an assert at the end of GC_SHRINK that the counter is always 0, but that would require us to decide that the Decommit is infallible, which I'm not sure we want to guarantee yet.
> Also note that in the bug 702251 I wull implement decommit together with
> the chunk release on the background thread, so there would be no arbitrary
> pause between decoomit and release,
I'm looking forward to seeing that -- particularly how you will avoid taking a lock at arena allocation if we may be decommitting at the same time.

(In reply to Terrence Cole [:terrence] from comment #19)
> I'm looking forward to seeing that -- particularly how you will avoid taking
> a lock at arena allocation if we may be decommitting at the same time.
That was not the most difficult part. The bigger complexity comes from avoiding racing with the allocation thread and decommitting from the tail of the available chunk lists, not from the start - I need to fix few issues with the patch before asking for a review...