Depending on the size, it definitely would happen anyways.
Correct me if I'm wrong, but once you use popFront, you
effectively modify the source to contain a slice of it's original
data. If you want to extend that data, the slice can't be
expanded safely since it doesn't own the whole memory.
Try this: (Tested, gives 80Mb and 400ms faster)
replace:ulong[] nums = new ulong[k];
with: ulong[] nums;
nums.reserve(k + n + 1);
nums.length = k;

That fixed the memory behaviour and is faster. Not using popFront
at all uses the same memory and is faster still. Given popFront
is advancing a range does this mean the underlying array is not
being deleted? How would one delete the information you don't
need any more if so?

This is becoming a "fixed size circular queue". But maybe a
modulus is faster than a branch here. (It's better when k is
always a power of two, you don't need a modulus. And even better
if your size is a multiple of the page size).

nums[iter_next] = total % 10^^8;

In such cases I suggest to add (), to not force the person that
reads the code reader to remember the precedence of uncommon
operators.
Bye,
bearophile

I tested that, modulus is slower. The compiler is surely
converting it to something branchless like:
uint iter_next = (iter + 1) * !(iter + 1 > k);
I take your point but I think most people know that the equals
operators have the lowest associativity.

In any case with large values of k the branch prediction will be
right almost all of the time, explaining why this form is faster
than modulo as modulo is fairly slow while this is a correctly
predicted branch doing an addition if it doesn't make it
branchless. The branchless version gives the same time result as
branched, is there a way to force that line not to optimized to
compare the predicted version?

In any case with large values of k the branch prediction will
be right almost all of the time, explaining why this form is
faster than modulo as modulo is fairly slow while this is a
correctly predicted branch doing an addition if it doesn't make
it branchless.

That seems the explanation.

The branchless version gives the same time result as branched,
is there a way to force that line not to optimized to compare
the predicted version?

I don't fully understand the question. Do you mean annotations
like the __builtin_expect of GCC?
Bye,
bearophile

In any case with large values of k the branch prediction will
be right almost all of the time, explaining why this form is
faster than modulo as modulo is fairly slow while this is a
correctly predicted branch doing an addition if it doesn't
make it branchless.

That seems the explanation.

The branchless version gives the same time result as branched,
is there a way to force that line not to optimized to compare
the predicted version?

I don't fully understand the question. Do you mean annotations
like the __builtin_expect of GCC?
Bye,
bearophile

If
uint iter_next = iter + 1 > k? 0 : iter + 1;
is getting optimized to
uint iter_next = (iter + 1) * !(iter + 1 > k);
or something like it by the compiler then it would be nice to be
able to test the branched code without having the rest of the
program lose optimizations for speed because as I said, for large
k branching will almost always be correctly predicted making me
think it'd be faster than the branchless version.