Skip to content

Kmeans

toyml.clustering.kmeans.Kmeans dataclass

Kmeans(
    k: int,
    max_iter: int = 500,
    tol: float = 1e-05,
    centroids_init_method: Literal[
        "random", "kmeans++"
    ] = "random",
    random_seed: Optional[int] = None,
    distance_metric: Literal["euclidean"] = "euclidean",
    iter_: int = 0,
    clusters: dict[int, list[int]] = dict(),
    centroids: dict[int, list[float]] = dict(),
    labels: list[int] = list(),
)

K-means algorithm (with Kmeans++ initialization as option).

Examples:

>>> from toyml.clustering import Kmeans
>>> dataset = [[1.0, 2.0], [1.0, 4.0], [1.0, 0.0], [10.0, 2.0], [10.0, 4.0], [11.0, 0.0]]
>>> kmeans = Kmeans(k=2, random_seed=42).fit(dataset)
>>> kmeans.clusters
{0: [3, 4, 5], 1: [0, 1, 2]}
>>> kmeans.centroids
{0: [10.333333333333334, 2.0], 1: [1.0, 2.0]}
>>> kmeans.labels
[1, 1, 1, 0, 0, 0]
>>> kmeans.predict([0, 1])
1
>>> kmeans.iter_
2

There is a fit_predict method that can be used to fit and predict.

Examples:

>>> from toyml.clustering import Kmeans
>>> dataset = [[1, 0], [1, 1], [1, 2], [10, 0], [10, 1], [10, 2]]
>>> Kmeans(k=2, random_seed=42).fit_predict(dataset)
[1, 1, 1, 0, 0, 0]
References
  1. Zhou Zhihua
  2. Murphy
Note

Here we just implement the naive K-means algorithm.

See Also

k instance-attribute

k: int

The number of clusters, specified by user.

max_iter class-attribute instance-attribute

max_iter: int = 500

The number of iterations the algorithm will run for if it does not converge before that.

tol class-attribute instance-attribute

tol: float = 1e-05

The tolerance for convergence.

centroids_init_method class-attribute instance-attribute

centroids_init_method: Literal["random", "kmeans++"] = (
    "random"
)

The method to initialize the centroids.

random_seed class-attribute instance-attribute

random_seed: Optional[int] = None

The random seed used to initialize the centroids.

distance_metric class-attribute instance-attribute

distance_metric: Literal['euclidean'] = 'euclidean'

The distance metric to use.(For now we only support euclidean).

clusters class-attribute instance-attribute

clusters: dict[int, list[int]] = field(default_factory=dict)

The clusters of the dataset.

centroids class-attribute instance-attribute

centroids: dict[int, list[float]] = field(
    default_factory=dict
)

The centroids of the clusters.

labels class-attribute instance-attribute

labels: list[int] = field(default_factory=list)

The cluster labels of the dataset.

fit

fit(dataset: list[list[float]]) -> 'Kmeans'

Fit the dataset with K-means algorithm.

PARAMETER DESCRIPTION
dataset

the set of data points for clustering

TYPE: list[list[float]]

RETURNS DESCRIPTION
'Kmeans'

self.

Source code in toyml/clustering/kmeans.py
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def fit(self, dataset: list[list[float]]) -> "Kmeans":
    """
    Fit the dataset with K-means algorithm.

    Args:
        dataset: the set of data points for clustering

    Returns:
        self.
    """
    self.centroids = self._get_initial_centroids(dataset)
    for _ in range(self.max_iter):
        self.iter_ += 1
        prev_centroids = self.centroids
        self._iter_step(dataset)
        if self._is_converged(prev_centroids):
            break
    return self

fit_predict

fit_predict(dataset: list[list[float]]) -> list[int]

Fit and predict the cluster label of the dataset.

PARAMETER DESCRIPTION
dataset

the set of data points for clustering

TYPE: list[list[float]]

RETURNS DESCRIPTION
list[int]

Cluster labels of the dataset samples.

Source code in toyml/clustering/kmeans.py
102
103
104
105
106
107
108
109
110
111
112
def fit_predict(self, dataset: list[list[float]]) -> list[int]:
    """
    Fit and predict the cluster label of the dataset.

    Args:
        dataset: the set of data points for clustering

    Returns:
        Cluster labels of the dataset samples.
    """
    return self.fit(dataset).labels

predict

predict(point: list[float]) -> int

Predict the label of the point.

PARAMETER DESCRIPTION
point

The data point to predict.

TYPE: list[float]

RETURNS DESCRIPTION
int

The label of the point.

Source code in toyml/clustering/kmeans.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def predict(self, point: list[float]) -> int:
    """
    Predict the label of the point.

    Args:
        point: The data point to predict.

    Returns:
        The label of the point.

    """
    if len(self.centroids) == 0:
        raise ValueError("The model is not fitted yet")
    return self._get_centroid_label(point, self.centroids)