It is a memory-efficient, online-learning algorithm provided as an
alternative to MiniBatchKMeans. It constructs a tree
data structure with the cluster centroids being read off the leaf.
These can be either the final cluster centroids or can be provided as input
to another clustering algorithm such as AgglomerativeClustering.

The radius of the subcluster obtained by merging a new sample and the
closest subcluster should be lesser than the threshold. Otherwise a new
subcluster is started. Setting this value to be very low promotes
splitting and vice-versa.

branching_factor:int, default 50

Maximum number of CF subclusters in each node. If a new samples enters
such that the number of subclusters exceed the branching_factor then
that node is split into two nodes with the subclusters redistributed
in each. The parent subcluster of that node is removed and two new
subclusters are added as parents of the 2 split nodes.

n_clusters:int, instance of sklearn.cluster model, default 3

Number of clusters after the final clustering step, which treats the
subclusters from the leaves as new samples.

None : the final clustering step is not performed and the
subclusters are returned as they are.

sklearn.cluster Estimator : If a model is provided, the model is
fit treating the subclusters as new samples and the initial data is
mapped to the label of the closest subcluster.

Whether or not to make a copy of the given data. If set to False,
the initial data will be overwritten.

Attributes:

root_:_CFNode

Root of the CFTree.

dummy_leaf_:_CFNode

Start pointer to all the leaves.

subcluster_centers_:ndarray,

Centroids of all subclusters read directly from the leaves.

subcluster_labels_:ndarray,

Labels assigned to the centroids of the subclusters after
they are clustered globally.

labels_:ndarray, shape (n_samples,)

Array of labels assigned to the input data.
if partial_fit is used instead of fit, they are assigned to the
last batch of data.

Notes

The tree data structure consists of nodes with each node consisting of
a number of subclusters. The maximum number of subclusters in a node
is determined by the branching factor. Each subcluster maintains a
linear sum, squared sum and the number of samples in that subcluster.
In addition, each subcluster can also have a node as its child, if the
subcluster is not a member of a leaf node.

For a new point entering the root, it is merged with the subcluster closest
to it and the linear sum, squared sum and the number of samples of that
subcluster are updated. This is done recursively till the properties of
the leaf node are updated.

The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each
component of a nested object.