Say you have 2 iid random variables $x,y\sim U[0,1]^k$, i.e. the uniform distribution over the k-dimensional unit cube.
What's the expected value of the Euclidean distance between them when they have been normalized by the maximum distance possible, i.e. $\sqrt k$?

For $k=1$, I worked out that this is 1/3. For $k=100$, monte carlo simulations tell me it's a little over 0.4.

I tried to work out the math but
$$\frac{1}{\sqrt{k}}\int_{S_x} \int_{S_y} \sqrt{(x-y)^T (x-y)}\;dx\;dy$$
where $S_x,S_y=[0,1]^k$ for general $k$ is beyond me.

2 Answers
2

To see this, consider i.i.d. random variables $X_i$ and $Y_i$ uniform on $[0,1]$ and write the quantity to be computed as
$I_k=\mathrm E\left(\sqrt{Z_k}\right)$ with
$$
Z_k=\frac1k\sum_{i=1}^k(X_i-Y_i)^2.
$$
By the strong law of large numbers for i.i.d. bounded random variables, when $k\to\infty$, $Z_k\to z$ almost surely and in every $L^p$, with $z=\mathrm E(Z_1)$. In particular, $I_k\to \sqrt{z}$. Numerically,
$$
z=\iint_{[0,1]^2}(x-y)^2\mathrm{d}x\mathrm{d}y=2\int_0^1x^2\mathrm{d}x-2\left(\int_0^1x\mathrm{d}x\right)^2=2\frac13-2\left(\frac12\right)^2=\frac16.
$$

Edit (This answers a different question asked by the OP in the comments.)

Consider the maximum of $n\gg1$ independent copies of $kZ_k$ with $k\gg1$ and call $M_{n,k}$ its square root. A heuristics to estimate the typical behaviour of $M_{n,k}$ is as follows.

By the central limit theorem (and in a loose sense), $Z_k\approx z+N\sqrt{v/k}$ where $v$ is the variance of $Z_1$ and $N$ is a standard gaussian random variable. In particular, for every given nonnegative $s$,
$$
\mathrm P\left(Z_k\ge z+s\right)\approx\mathrm P\left(N^2\ge ks^2/v\right).
$$
Furthermore, the typical size of $M_{n,k}^2$ is $z+s$ where $s$ solves $\mathrm P(Z_k\ge z+s)\approx1/n$. Choose $q(n)$ such that $\mathrm P(N\ge q(n))=1/n$, that is, $q(n)$ is a so-called quantile of the standard gaussian distribution. Then, the typical size of $M_{n,k}^2$ is $k(z+s)$ where $s$ solves $ks^2/v=q(n)^2$. Finally,
$$
M_{n,k}\approx \sqrt{kz+q(n)\sqrt{kv}}.
$$
Numerically, $z=1/6$, $v=7/180$, and you are interested in $k=1'000$. For $n=10'000$, $q(n)=3.719$ yields a typical size $M_{n,k}\approx13.78$ and for $n=100'000$, $q(n)=4.265$ hence $M_{n,k}\approx13.90$ (these should be compared to the values you observed).

To make rigorous such estimates and to understand why, in a way, $M_{n,k}$ concentrates around the typical value we computed above, see here.

Thanks! This makes perfect sense. One question about monte carlo experiments: there's a fairly big difference in the expected value when I fix the max distance possible to the theoretical $\sqrt(k)$ and when I take the actual max that's been observed from the samples. The latter actually gives the sample expectation to be about 0.8ish. I know that observing $\sqrt{k}$ from samples is near impossible. But will the monte carlo estimate converge almost surely to $1/\sqrt{6}$ for large enough $n$?
–
JasonMondSep 13 '11 at 15:35

Wait, you simulate (pairs of) points in the unit cube and not their distances, do you? So I am not sure of the reason why you have to fix a max distance.
–
DidSep 13 '11 at 16:39

Let me elaborate: if $X$ and $Y$ are independent and uniform on $[0,1]$, neither $|X-Y|$ nor $(X-Y)^2$ are uniform on $[0,1]$. So if you simulate either one of these directly as uniform (instead of simulating $X$ uniform then $Y$ uniform then combining them), the result has little to do with $I_k$.
–
DidSep 13 '11 at 17:07

I'm simulating as you describe it (in R: for(i in 1:10000) {v <- nrm(runif(1000),runif(1000)); if (v > mx) mx <- v;} where nrm is euclidian distance and mx stores the maximum). But with 10000 samples, the maximum distance I get with $k=1000$ is 13.8. With 100000 samples, it ekes out a bit more with 14.0. Theoretically, the maximum distance possible is $\sqrt{1000}\approx 31.6$ and with an infinite number of samples, I guess I will get very close to this value. And the expected value of pairwise distance would be $1/\sqrt{6}$. But not with the sample maximum.
–
JasonMondSep 14 '11 at 2:01

1

@JasonMond: What is your concern with the maximum? And, but the way, a much more efficient way to generate your sample in $R$ is to do: v<-replicate(10000,sqrt(sum((runif(1000)-runif(1000))^2)).
–
cardinalSep 14 '11 at 9:56

This probably isn't anything nice for general $k$. You're asking for
$$ E \sqrt{ \sum_{i=1}^k (X_i-Y_i)^2/k } $$
where $X_i$ and $Y_i$ are independent uniform(0,1). However, let $Z_i = (X_i-Y_i)^2$. Now you're looking for
$$ E \sqrt{ (\sum_{i=1}^k Z_k)/k }$$
and let's think about what happens when $k$ gets large. When $k$ is large, by the law of large numbers $(\sum_{i=1}^k Z_k)/k$ will be concentrated around $E(Z_k)$. Therefore as $k$ gets large I expect that the answer approaches $\sqrt{E(Z_k)}$, which is $\sqrt{1/6}$.