Given a set of data samples, we classify it into two classes, A and B.
We can calculate the total variance $\sigma^2_\text{total}$,
the inter-class variance $\sigma^2_\text{inter}$, and intra-class variance $\sigma^2_\text{intra}$.
Can anyone prove that $\sigma^2_\text{total}= \sigma^2_\text{inter}+\sigma^2_\text{intra}$.?
Thanks.

1 Answer
1

It's the sums of squares of deviations, not the averages of squares of deviations, for which this identity holds.

Let $y_{ij}$ be the $j$th observation in the $i$th class. The overall average is
$$
\bar{y}_{\bullet\bullet} = \frac{\sum_i\sum_j y_{ij}}{\sum_i\sum_j 1}.
$$
The average within class $i$ is
$$
\bar{y}_{i\bullet} = \frac{\sum_j y_{ij}}{\sum_j 1}
$$
(where the number of values of $j$ may depend on $i$, i.e. not all classes must have the same size).

Then the total corrected sum of squares is
$$
\sum_i\sum_j (y_{ij}-y_{\bullet\bullet})^2.
$$
The intra-class sum of squares is
$$
\sum_i\sum_j (y_{ij}- \bar{y}_{i\bullet})^2 ).
$$
The inter-class sum of squares is
$$
\sum_i \sum_j (\bar{y}_{i\bullet}-\bar{y}_{\bullet\bullet})^2 = \sum_i \left( n_i (\bar{y}_{i\bullet}-\bar{y}_{\bullet\bullet})^2\right)
$$
where $n_i$ is the number of observations in the $i$th class.

What we need to show is that the sum of all values of the middle term---the part over the underbrace--- is $0$. Observe that
$$
2\sum_i\sum_j (y_{ij}-\bar{y}_{i\bullet})(\bar{y}_{i\bullet}
-y_{\bullet\bullet}) = 2\sum_i\left((\bar{y}_{i\bullet}-\bar{y}_{\bullet\bullet})\sum_j (y_{ij}-\bar{y}_{i\bullet})\right).
$$
This is because the factor that is pulled out from the inside sum does not depend on $j$. Now just observe that the remaining inside sum is $0$.

The fact that the sum in the middle vanishes actually says two vectors are at right angles to each other, and one phrases the argument in terms of orthogonal projections when one is trying to show that the two sums that remain have independent chi-square distributions.