Kmeans¶
toyml.clustering.kmeans.Kmeans
dataclass
¶
Kmeans(k: int, max_iter: int = 500, tol: float = 1e-05, centroids_init_method: Literal['random', 'kmeans++'] = 'random', random_seed: int | None = None, distance_metric: Literal['euclidean'] = 'euclidean', iter_: int = 0, clusters_: dict[int, list[int]] = dict(), centroids_: dict[int, list[float]] = dict(), labels_: list[int] = list())
K-means algorithm (with Kmeans++ initialization as option).
Examples:
>>> from toyml.clustering import Kmeans
>>> dataset = [[1.0, 2.0], [1.0, 4.0], [1.0, 0.0], [10.0, 2.0], [10.0, 4.0], [11.0, 0.0]]
>>> kmeans = Kmeans(k=2, random_seed=42).fit(dataset)
>>> kmeans.clusters_
{0: [3, 4, 5], 1: [0, 1, 2]}
>>> kmeans.centroids_
{0: [10.333333333333334, 2.0], 1: [1.0, 2.0]}
>>> kmeans.labels_
[1, 1, 1, 0, 0, 0]
>>> kmeans.predict([0, 1])
1
>>> kmeans.iter_
2
There is a fit_predict
method that can be used to fit and predict.
Examples:
>>> from toyml.clustering import Kmeans
>>> dataset = [[1, 0], [1, 1], [1, 2], [10, 0], [10, 1], [10, 2]]
>>> Kmeans(k=2, random_seed=42).fit_predict(dataset)
[1, 1, 1, 0, 0, 0]
References
- Zhou Zhihua
- Murphy
Note
Here we just implement the naive K-means algorithm.
See Also
- Bisecting K-means algorithm: toyml.clustering.bisect_kmeans
max_iter
class-attribute
instance-attribute
¶
max_iter: int = 500
The number of iterations the algorithm will run for if it does not converge before that.
centroids_init_method
class-attribute
instance-attribute
¶
centroids_init_method: Literal['random', 'kmeans++'] = 'random'
The method to initialize the centroids.
random_seed
class-attribute
instance-attribute
¶
random_seed: int | None = None
The random seed used to initialize the centroids.
distance_metric
class-attribute
instance-attribute
¶
distance_metric: Literal['euclidean'] = 'euclidean'
The distance metric to use.(For now we only support euclidean).
clusters_
class-attribute
instance-attribute
¶
The clusters of the dataset.
centroids_
class-attribute
instance-attribute
¶
The centroids of the clusters.
labels_
class-attribute
instance-attribute
¶
The cluster labels of the dataset.
fit
¶
Fit the dataset with K-means algorithm.
PARAMETER | DESCRIPTION |
---|---|
dataset
|
the set of data points for clustering |
RETURNS | DESCRIPTION |
---|---|
Kmeans
|
self. |
Source code in toyml/clustering/kmeans.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
_iter_step
¶
Can be used to control the fitting process step by step.
Source code in toyml/clustering/kmeans.py
91 92 93 94 95 |
|
fit_predict
¶
Fit and predict the cluster label of the dataset.
PARAMETER | DESCRIPTION |
---|---|
dataset
|
the set of data points for clustering |
RETURNS | DESCRIPTION |
---|---|
list[int]
|
Cluster labels of the dataset samples. |
Source code in toyml/clustering/kmeans.py
97 98 99 100 101 102 103 104 105 106 |
|
predict
¶
Predict the label of the point.
PARAMETER | DESCRIPTION |
---|---|
point
|
The data point to predict. |
RETURNS | DESCRIPTION |
---|---|
int
|
The label of the point. |
Source code in toyml/clustering/kmeans.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|
_get_initial_centroids
¶
Get initial centroids by a simple random selection.
Source code in toyml/clustering/kmeans.py
137 138 139 140 141 142 143 144 |
|
_is_converged
¶
Check if the centroids converged.
PARAMETER | DESCRIPTION |
---|---|
prev_centroids
|
previous centroids |
RETURNS | DESCRIPTION |
---|---|
bool
|
Whether the centroids converged. |
Source code in toyml/clustering/kmeans.py
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
_get_initial_centroids_random
¶
Get initial centroids by a simple random selection.
PARAMETER | DESCRIPTION |
---|---|
dataset
|
The dataset for clustering |
RETURNS | DESCRIPTION |
---|---|
dict[int, list[float]]
|
The initial centroids |
Source code in toyml/clustering/kmeans.py
164 165 166 167 168 169 170 171 172 173 174 175 |
|
_get_initial_centroids_kmeans_plus
¶
Get initial centroids by k-means++ algorithm.
PARAMETER | DESCRIPTION |
---|---|
dataset
|
The dataset for clustering |
RETURNS | DESCRIPTION |
---|---|
dict[int, list[float]]
|
The initial centroids |
Source code in toyml/clustering/kmeans.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
|
_get_min_square_distance
¶
Get the minimum square distance from the point to current centroids.
PARAMETER | DESCRIPTION |
---|---|
point
|
The point to calculate the distance. |
RETURNS | DESCRIPTION |
---|---|
float
|
The minimum square distance |
Source code in toyml/clustering/kmeans.py
195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
_get_point_centroid_label
¶
Get the label of the centroid, which is closest to the point.
Source code in toyml/clustering/kmeans.py
209 210 211 212 |
|