I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?
Andrei

I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?
Andrei

I'm confused. As far as I can tell, RandomCover is correct in 2.030. I both
read
the code and did a monte carlo test for some short sequences to see if the
distribution was uniform on the space of possible permutations. Isn't this
just a
case of using RandomCover, but stopping before iterating through the entire
range?
In other words, wouldn't the following work:
auto selectK(R)(R someRange, uint K, Random gen) {
assert(K <= someRange.length);
return take(K, randomCover(someRange, gen));
}

I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?
Andrei

I'm confused. As far as I can tell, RandomCover is correct in 2.030. I both
read
the code and did a monte carlo test for some short sequences to see if the
distribution was uniform on the space of possible permutations. Isn't this
just a
case of using RandomCover, but stopping before iterating through the entire
range?
In other words, wouldn't the following work:
auto selectK(R)(R someRange, uint K, Random gen) {
assert(K <= someRange.length);
return take(K, randomCover(someRange, gen));
}

You are right. I'm just less than happy with randomCover because it
needs additional memory, so I eliminated it as a possibility from the
get-go.
Andrei

I'm writing a range to select a random sampling of a forward range.
The number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random
samples out of n. To implement popFront, I generate a random number in
[0, n - k]. That's the probability that the next item in the input
will be part of the selection. If the random number is 0, then the
element got selected. Otherwise, I must ignore that guy so I popFront
the input range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?
Andrei

Check my reasoning on this:
It will be skewed because as soon as the system gets behind (and statistically
it will half the time) the elements start being selected with a higher
probability.
Take the related case, select each element with equal probability until you
have enough selected or have just enough left to get the correct amount.
The last element will get selected about half the time no matter what
percentage
you want.

Check my reasoning on this:
It will be skewed because as soon as the system gets behind (and
statistically it will half the time) the elements start being selected
with a higher probability.
Take the related case, select each element with equal probability
until you have enough selected or have just enough left to get the
correct amount. The last element will get selected about half the time
no matter what percentage you want.

after some ad-hoc tests ("fun time with Excel!") it seems that the above
is off in the weeds.

I'm writing a range to select a random sampling of a forward range. The
number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random samples
out of n. To implement popFront, I generate a random number in [0, n -
k]. That's the probability that the next item in the input will be part
of the selection. If the random number is 0, then the element got
selected. Otherwise, I must ignore that guy so I popFront the input
range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented than
the ones towards the end. Where am I wrong?
Andrei

You have your probability wrong. The selection probability is k/n. Pick a
number in [0,n) and pick the element if the result is in [0,k). If you do
select an item, decrement k. Always decrement n when popping, regardless of if
the element is selected.

I'm writing a range to select a random sampling of a forward range.
The number of elements in the forward range is already known.
The strategy is: at any moment, I know I need to select k random
samples out of n. To implement popFront, I generate a random number
in [0, n - k]. That's the probability that the next item in the
input will be part of the selection. If the random number is 0,
then the element got selected. Otherwise, I must ignore that guy so
I popFront the input range, decrease n by one, and repeat.
This seems reasonable but somehow the selection comes skewed: the
elements towards the beginning of the range are less represented
than the ones towards the end. Where am I wrong?
Andrei

You have your probability wrong. The selection probability is k/n.
Pick a number in [0,n) and pick the element if the result is in
[0,k). If you do select an item, decrement k. Always decrement n when
popping, regardless of if the element is selected.

Great, thanks, that works perfect. If I want to choose _toSelect stuff
out of _available items, this works to position to the next element:
for (;;)
{
auto r = uniform(0, _available);
if (r < _toSelect)
{
// chosen!
return;
}
// not chosen, retry
assert(!_input.empty);
_input.popFront;
--_available;
assert(_available > 0);
}
As an aside, it looks like the right-open range for random numbers is a
good default.
Andrei

Andrei,
I noticed in random.d, uniform template, that popFront is called in
different locations for integral compared to floating point types: for
integral types you .front first and .popFront afterwards, but for floating
point types you start with .popFront and then check .front.
This has a peculiar effect: for example, if you do uniform(0.0,100.0)
followed by uniform(0,100) there's a big chance that the integral part of
the first random number is equal to the second random number.
import std.stdio, std.random;
void main()
{
writeln(uniform(0.0,100.0));
writeln(uniform(0,100));
}
I don't think this warrants a bug report, but I do think the location of
.popFront should be standardized, either before or after any .front.
Just sayin'.
L.

Andrei,
I noticed in random.d, uniform template, that popFront is called in
different locations for integral compared to floating point types: for
integral types you .front first and .popFront afterwards, but for
floating point types you start with .popFront and then check .front.
This has a peculiar effect: for example, if you do uniform(0.0,100.0)
followed by uniform(0,100) there's a big chance that the integral part
of the first random number is equal to the second random number.
import std.stdio, std.random;
void main()
{
writeln(uniform(0.0,100.0));
writeln(uniform(0,100));
}
I don't think this warrants a bug report, but I do think the location of
.popFront should be standardized, either before or after any .front.
Just sayin'.
L.

Definitely warrants a bug report. I'm busy today but I might find time
for it tomorrow.
Andrei