DIANA: Bisecting Kmeans¶

toyml.clustering.bisect_kmeans.BisectingKmeans `dataclass` ¶

BisectingKmeans(k: int, cluster_tree_: ClusterTree = ClusterTree(), labels_: list[int] = list())

Bisecting K-means algorithm.

Belong to Divisive hierarchical clustering (DIANA) algorithm.(top-down).

Examples:

>>> from toyml.clustering import BisectingKmeans
>>> dataset = [[1.0, 1.0], [1.0, 2.0], [2.0, 1.0], [10.0, 1.0], [10.0, 2.0], [11.0, 1.0]]
>>> bisect_kmeans = BisectingKmeans(k=3)
>>> labels = bisect_kmeans.fit_predict(dataset)
>>> clusters = bisect_kmeans.cluster_tree_.get_clusters()

References

Harrington
Tan

k `instance-attribute` ¶

k: int

The number of clusters, specified by user.

cluster_tree_ `class-attribute` `instance-attribute` ¶

cluster_tree_: ClusterTree = field(default_factory=ClusterTree)

The cluster tree

labels_ `class-attribute` `instance-attribute` ¶

labels_: list[int] = field(default_factory=list)

The cluster labels of the dataset.

fit ¶

fit(dataset: list[list[float]]) -> BisectingKmeans

Fit the dataset with Bisecting K-means algorithm.

PARAMETER	DESCRIPTION
`dataset`	The set of data points for clustering. TYPE: `list[list[float]]`

RETURNS	DESCRIPTION
`BisectingKmeans`	self.

Source code in toyml/clustering/bisect_kmeans.py

def fit(self, dataset: list[list[float]]) -> BisectingKmeans:
    """Fit the dataset with Bisecting K-means algorithm.

    Args:
        dataset: The set of data points for clustering.

    Returns:
        self.

    """
    n = len(dataset)
    # check dataset
    if self.k > n:
        msg = f"Number of clusters(k) cannot be greater than the number of samples(n), not get {self.k=} > {n=}"
        raise ValueError(
            msg,
        )
    # start with only one cluster which contains all the data points in dataset
    cluster = list(range(n))
    self.cluster_tree_.cluster = cluster
    self.cluster_tree_.centroid = self._get_cluster_centroids(dataset, cluster)
    self.labels_ = self._predict_dataset_labels(dataset)
    total_error = sum_square_error([dataset[i] for i in cluster])
    # iterate until got k clusters
    while len(self.cluster_tree_.get_clusters()) < self.k:
        # init values for later iteration
        to_split_cluster_node = None
        split_cluster_into: tuple[list[int], list[int]] | None = None
        for _, cluster_node in enumerate(self.cluster_tree_.leaf_cluster_nodes()):
            # perform K-means with k=2
            cluster_data = [dataset[i] for i in cluster_node.cluster]
            # If the cluster cannot be split further, skip it
            if len(cluster_data) < 2:
                continue
            # Bisect by kmeans with k=2
            cluster_unsplit_error, cluster_split_error, (cluster1, cluster2) = self._bisect_by_kmeans(
                cluster_data,
                cluster_node,
                dataset,
            )
            new_total_error = total_error - cluster_unsplit_error + cluster_split_error
            if new_total_error < total_error:
                total_error = new_total_error
                split_cluster_into = (cluster1, cluster2)
                to_split_cluster_node = cluster_node

        if to_split_cluster_node is not None and split_cluster_into is not None:
            self._commit_split(to_split_cluster_node, split_cluster_into, dataset)
            self.labels_ = self._predict_dataset_labels(dataset)
        else:
            break

    return self

fit_predict ¶

fit_predict(dataset: list[list[float]]) -> list[int]

Fit and predict the cluster label of the dataset.

PARAMETER	DESCRIPTION
`dataset`	The set of data points for clustering. TYPE: `list[list[float]]`

RETURNS	DESCRIPTION
`list[int]`	Cluster labels of the dataset samples.

Source code in toyml/clustering/bisect_kmeans.py

def fit_predict(self, dataset: list[list[float]]) -> list[int]:
    """Fit and predict the cluster label of the dataset.

    Args:
        dataset: The set of data points for clustering.

    Returns:
        Cluster labels of the dataset samples.
    """
    return self.fit(dataset).labels_

predict ¶

predict(points: list[list[float]]) -> list[int]

Predict the cluster label of the given points.

PARAMETER	DESCRIPTION
`points`	A list of data points to predict. TYPE: `list[list[float]]`

RETURNS	DESCRIPTION
`list[int]`	A list of predicted cluster labels for the input points.

RAISES	DESCRIPTION
`ValueError`	If the model has not been fitted yet.

Source code in toyml/clustering/bisect_kmeans.py

def predict(self, points: list[list[float]]) -> list[int]:
    """Predict the cluster label of the given points.

    Args:
        points: A list of data points to predict.

    Returns:
        A list of predicted cluster labels for the input points.

    Raises:
        ValueError: If the model has not been fitted yet.
    """
    if self.cluster_tree_.centroid is None:
        msg = "The model has not been fitted yet."
        raise ValueError(msg)

    clusters = self.cluster_tree_.get_clusters()
    predictions = []
    for point in points:
        node = self.cluster_tree_
        while not node.is_leaf():
            if node.left is None or node.right is None:
                msg = "Invalid cluster tree structure."
                raise ValueError(msg)

            dist_left = euclidean_distance(point, node.left.centroid)  # type: ignore[arg-type]
            dist_right = euclidean_distance(point, node.right.centroid)  # type: ignore[arg-type]

            node = node.left if dist_left < dist_right else node.right
        cluster_index = clusters.index(node.cluster)
        predictions.append(cluster_index)

    return predictions

DIANA: Bisecting Kmeans¶

toyml.clustering.bisect_kmeans.BisectingKmeans dataclass ¶

k instance-attribute ¶

cluster_tree_ class-attribute instance-attribute ¶

labels_ class-attribute instance-attribute ¶

fit ¶

fit_predict ¶

predict ¶

toyml.clustering.bisect_kmeans.BisectingKmeans `dataclass` ¶

k `instance-attribute` ¶

cluster_tree_ `class-attribute` `instance-attribute` ¶

labels_ `class-attribute` `instance-attribute` ¶