where for each pair \(z=(x,y)\), \(h(z,z')=k(x,x')+k(y,y')-k(x,y')- k(x',y)\).

The type (biased/unbiased/incomplete) can be selected via set_statistic_type(). Note that there are presently two setups for computing statistic. While using BIASED, UNBIASED or INCOMPLETE, the estimate returned by compute_statistic() is \(\frac{n_xn_y}{n_x+n_y}\hat{\eta}_k\). If DEPRECATED ones are used, then this returns \((n_x+n_y)\hat{\eta}_k\) in general and \((\frac{n}{2}) \hat{\eta}_k\) when \(n_x=n_y=\frac{n}{2}\). This holds for the null distribution samples as well.

Estimating variance of the asymptotic distribution of the statistic under null and alternative hypothesis can be done using compute_variance() method. This is internally done alongwise computing statistics to avoid recomputing the kernel.

Variance under null is computed as \(\sigma_{k,0}^2=2\hat{\kappa}_2=2(\kappa_2-2\kappa_1+\kappa_0)\) where \(\kappa_0=\left(\mathbb{E}_{X,X'}k(X,X')\right )^2\), \(\kappa_1=\mathbb{E}_X\left[(\mathbb{E}_{X'}k(X,X'))^2\right]\), and \(\kappa_2=\mathbb{E}_{X,X'}k^2(X,X')\) and variance under alternative is computed as

Note that statistic and variance estimation can be done for multiple kernels at once as well.

Along with the statistic comes a method to compute a p-value based on different methods. Permutation test is also possible. If unsure which one to use, sampling with 250 permutation iterations always is correct (but slow).

MMD2_SPECTRUM_DEPRECATED: For a fast, consistent test based on the spectrum of the kernel matrix, as described in [2]. Only supported if Eigen3 is installed.

MMD2_SPECTRUM: Similar to the deprecated version except it estimates the statistic under null as \(\frac{n_xn_y}{n_x+n_y}\hat{\eta}_{k,U}\rightarrow \sum_r\lambda_r(Z_r^2-1)\) instead (see method description for more details).

MMD2_GAMMA: for a very fast, but not consistent test based on moment matching of a Gamma distribution, as described in [2].

PERMUTATION: For permuting available samples to sample null-distribution

If you do not know about your data, but want to use the MMD from a kernel matrix, just use the custom kernel constructor. Everything else will work as usual.

Creates a clone of the current object. This is done via recursively traversing all parameters, which corresponds to a deep copy. Calling equals on the cloned object always returns true although none of the memory of both objects overlaps.

Returns

an identical copy of the given object, which is disjoint in memory. NULL if the clone fails. Note that the returned object is SG_REF'ed

Approximates the null-distribution by the two parameter gamma distribution. It works in O(m^2) where m is the number of samples from each distribution. Its very fast, but may be inaccurate. However, there are cases where it performs very well. Returns parameters of gamma distribution that is fitted.

Note that when being used for constructing a test, the provided statistic HAS to be the biased version (see paper for details). To use, set BIASED_DEPRECATED as statistic type. Note that m*Null-distribution is fitted, which is fine since the statistic is also m*MMD.

Performs the complete two-sample test on current data and returns a p-value.

This is a wrapper that calls compute_statistic first and then calls compute_p_value using the obtained statistic. In some statistic classes, it might be possible to compute statistic and p-value in one single run which is more efficient. Therefore, this method might be overwritten in subclasses.

merges both sets of samples and computes the test statistic m_num_null_samples times. This version checks if a precomputed custom kernel is used, and, if so, just permutes it instead of re- computing it in every iteration.

where \(t=m+n\), \(\lim_{m,n\rightarrow\infty}m/t\rightarrow \rho_x\) and \(\rho_y\) likewise (equation 10 from [1]) and \(\lambda_l\) are estimated as \(\frac{\nu_l}{(m+n)}\), where \(\nu_l\) are the eigenvalues of centered kernel matrix HKH.

Note that (m+n)*Null-distribution is returned, which is fine since the statistic is also (m+n)*MMD: except when m and n are equal, then m*MMD^2 is returned

Works well if the kernel matrix is NOT diagonal dominant. See Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). A fast, consistent kernel two-sample test.

Parameters

num_samples

number of samples to draw

num_eigenvalues

number of eigenvalues to use to draw samples Maximum number of m+n-1 where m and n are the sizes of samples from p and q respectively. It is usually safe to use a smaller number since they decay very fast, however, a conservative approach would be to use all (-1 does this). See paper for details.