I could sample a set of m elements from the uniform distribution over a universe $U$ of n >> m elements. Alternately, I could select a random probability distribution $\mathcal{D}$, and sample $m$ elements from $\mathcal{D}$.

Do these two methods lead to the same distribution over my sample? If not, how do they differ? If some event (my sample lies in some set of samples of size m) occurs with probability p using the second method, what can I say about its probability using the first method?

How do you interpret a point in the $\ell_1$ ball as a distribution? It might have some coordinates negative, or its coordinates might (in fact, almost surely will) sum to less than one. Do you mean to pick a point on the standard $n$-simplex, i. e. the set of points $(x_1, \ldots, x_n)$ with all coordinates nonnegative and $x_1 + \cdots + x_n = 1$?
–
Michael LugoJan 8 '10 at 15:54

Yes, thanks for the correction. I mean sample a point in the positive region of the $ell_1$-ball rescaled so that all coordinates sum to 1, which should be equivalent to picking a point from the standard $n$-simplex.
–
WilsonJan 8 '10 at 16:02

Do you mean for the m elements to be chosen independently in both cases? In that case m plays no role in the question and you might as well assume m=1.
–
Mark MeckesJan 8 '10 at 16:09

Hi Mark -- yes, in both cases, the m elements are chosen independently. I'm not sure that I may as well assume $m=1$, however. If $m=1$, by symmetry, the two distributions must be identical. But for $m > 1$, selecting a distribution and then selecting an independent sample from it may introduce correlations that do not exist when selecting independently from a uniform distribution.
–
WilsonJan 8 '10 at 16:12

The two distributions are actually not the same; rescaling has a nontrivial effect. If we pick a random point $(x_1, x_2)$ from the unit square $[0,1] \times [0,1]$, the probability that $x_1/x_2$ is greater than 2 is the area of the triangle with vertices $(0,0), (1,0), (1, 1/2)$, which is $1/4$. If we pick a random point $(x_1, x_2)$ from the line segment from $(0,1) to (1,0)$, the probability that $x_1/x_2$ is greater than 2 is $1/3$.
–
Michael LugoJan 8 '10 at 16:12

2 Answers
2

Assume $U=\{1,\ldots,n\}$ for concreteness. If $Y_1,\ldots,Y_m$ are chosen independently and uniformly from $U$, then for any $k_1,\ldots,k_m\in U$, we of course have
$$
\Pr[Y_1=k_1,\ldots,Y_m=k_m] = \frac{1}{n^m}.
$$

On the other hand, if $x=(x_1,\ldots,x_m)$ is chosen uniformly from the standard $n$-simplex and $Y_1,\ldots,Y_m$ are then chosen independently according to $x$, then
$$
\Pr[Y_1=k_1,\ldots,Y_m=k_m] = \mathbb{E}\Pr[Y_1=k_1,\ldots,Y_m=k_m|x]
= \mathbb{E}\prod_{i=1}^m x_{k_i} = \frac{n!}{(n+r)!}\prod_{j=1}^n r_j!,
$$
where $r_j = \#\{1\le i \le m : k_i=j\}$ and $r=r_1 + \cdots r_n$. This last expectation can be proved most easily from Lemma 1 in this paper.

While the exact answer by Mark Meckes is nice, it's worth pointing out that if you condition on not repeating elements, the conditional distributions are equal by symmetry, and your condition $n \gt\gt m$ is close to what you need to say that repetitions are rare.

Repetition is much more common if you choose a random weighting and then sample from that instead of sample uniformly. The condition that $n$ is much greater than $m^2$ means that repetition is rare in samples from the uniform distribution, as the expected number of repetitions of pairs $Y_i = Y_j$ is $\binom{m}{2}/n$.

If we choose a random distribution, the weight on a particular element follows a beta distribution $\beta(1,n-1)$. The probability that both $Y_i$ and $Y_y$ equal that element is the 2nd moment, variance + mean^2, or
$(n-1)/(n^2 (n+1)) + 1/n^2 = 2/(n(n+1))$. The probability $P_2(Y_i=Y_j)=2/(n+1)$ instead of $1/n$, and the expected number of repetitions of pairs is $2\binom{m}{2}/(n+1)$.

If $n \gt\gt m^2$, there is low total variation distance. Let $\Delta$ be the diagonal set where there is at least one repetition.
$|P_2(S)-P_1(S)| \le 2P_2(\Delta) \le 4\binom{m}{2}/(n+1) < 2m^2/n$.