K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that an observation in a given group is more similar to another observation in the same group than to another observation in a different group.

fold_assignment: (Applicable only if a value for nfolds is specified and fold_column is not specified) Specify the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, Modulo, or Stratified (which will stratify the folds based on the response variable for classification problems).

fold_column: Specify the column that contains the cross-validation fold index assignment per observation.

ignored_columns: (Optional, Python and Flow only) Specify the column or columns to be exclude from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.

ignore_const_cols: (Optional) Specify whether to ignore constant training columns, since no information can be gained from them. This option is enabled by default.

score_each_iteration: (Optional) Specify whether to score during each iteration of the model training.

k: Specify the number of clusters (groups of data) in a dataset that are similar to one another.

estimate_k: Specify whether to estimate the number of clusters (<=k) iteratively (independent of the seed) and deterministically (beginning with k=1,2,3...). If enabled, for each k that, the estimate will go up to max_iteration. This option is disabled by default.

user_points: Specify a dataframe, where each row represents an initial cluster center.

max_iterations: Specify the maximum number of training iterations. The range is 0 to 1e6.

standardize: Enable this option to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is enabled by default.

Note: If standardization is enabled, each column of numeric data is centered and scaled so that its mean is zero and its standard deviation is one before the algorithm is used. At the end of the process, the cluster centers on both the standardized scale (centers_std) and the de-standardized scale (centers). To de-standardize the centers, the algorithm multiplies by the original standard deviation of the corresponding column and adds the original mean. Enabling standardization is mathematically equivalent to using h2o.scale in R with center = TRUE and scale = TRUE on the numeric columns. Therefore, there will be no discernible difference if standardization is enabled or not for K-Means, since H2O calculates unstandardized centroids.

seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.

init: Specify the initialization mode. The options are Random, Furthest, PlusPlus, or User.

Random initialization randomly samples the k-specified value of the rows of the training data as cluster centers.

PlusPlus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen.

Furthest initialization chooses one initial center at random and then chooses the next center to be the point furthest away in terms of Euclidean distance.

User initialization requires the corresponding user_points parameter. Note that the user-specified points dataset must have the same number of columns as the training dataset.

Note: If PlusPlus is specified, the initial Y matrix is chosen by the final cluster centers from the K-Means PlusPlus algorithm.

max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable.

categorical_encoding: Specify one of the following encoding schemes for handling categorical features:

auto or AUTO: Allow the algorithm to decide (default). In K-Means, the algorithm will automatically perform enum encoding.

enum or Enum: 1 column per categorical feature

one_hot_explicit: N+1 new columns for categorical features with N levels

Model Summary Model Summary (number of clusters, number of categorical columns, number of iterations, total within sum of squares, total sum of squares, total between the sum of squares. Note that Flow also returns the number of rows.)

Scoring history (duration, number of iterations, number of reassigned observations, number of within cluster sum of squares)

K-Means randomly chooses starting points and converges to a local minimum of centroids. The number of clusters is arbitrary and should be thought of as a tuning parameter. The output is a matrix of the cluster assignments and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly from run to run as this problem is Non-deterministic Polynomial-time (NP)-hard.

Missing values are automatically imputed by the column mean. K-means
also handles missing values by assuming that missing feature distance
contributions are equal to the average of all other distance term
contributions.

How does the algorithm handle missing values during testing?

Missing values are automatically imputed by the column mean of the
training data.

What happens when you try to predict on a categorical level not
seen during training?

An unseen categorical level in a row does not contribute to that row’s
prediction. This is because the unseen categorical level does not
contribute to the distance comparison between clusters, and therefore
does not factor in predicting the cluster to which that row belongs.

Does it matter if the data is sorted?

No.

Should data be shuffled before training?

No.

What if there are a large number of columns?

K-Means suffers from the curse of dimensionality: all points are roughly
at the same distance from each other in high dimensions, making the
algorithm less and less useful.

What if there are a large number of categorical factor levels?

This can be problematic, as categoricals are one-hot encoded on the fly,
which can lead to the same problem as datasets with a large number of
columns.

The number of clusters \(K\) is user-defined and is determined a priori.

Choose \(K\) initial cluster centers \(m_{k}\) according to one of the
following:

Random: Choose \(K\) clusters from the set of \(N\) observations at random so that each observation has an equal chance of being chosen.

Furthest (Default):

Choose one center \(m_{1}\) at random.

Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = ||(x_{i}-m_{1})||^2\)

Choose \(m_{2}\) to be the \(x_{i}\) that maximizes \(d(x_{i}, m_{1})\).

Repeat until \(K\) centers have been chosen.

PlusPlus:

Choose one center \(m_{1}\) at random.

Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = \|(x_{i}-m_{1})\|^2\)

Let \(P(i)\) be the probability of choosing \(x_{i}\) as \(m_{2}\). Weight \(P(i)\) by \(d(x_{i}, m_{1})\) so that those \(x_{i}\) furthest from \(m_{2}\) have a higher probability of being selected than those \(x_{i}\) close to \(m_{1}\).

Choose the next center \(m_{2}\) by drawing at random according to the weighted probability distribution.

Repeat until \(K\) centers have been chosen.

User initialization allows you to specify a file (using the user_points parameter) that includes a vector of initial cluster centers.

Once \(K\) initial centers have been chosen calculate the difference
between each observation \(x_{i}\) and each of the centers
\(m_{1},...,m_{K}\), where difference is the squared Euclidean
distance taken over \(p\) parameters.