Skip to content

KNN

toyml.classification.knn.KNN dataclass

KNN(k: int, std_transform: bool = True, dataset_: list[list[float]] | None = None, labels_: list[Any] | None = None, standardizationer_: Standardizationer | None = None)

K-Nearest Neighbors classification algorithm implementation.

This class implements the K-Nearest Neighbors algorithm for classification tasks. It supports optional standardization of the input data.

Examples:

>>> dataset = [[1.0, 2.0], [2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]
>>> labels = ["A", "A", "B", "B"]
>>> knn = KNN(k=3, std_transform=True).fit(dataset, labels)
>>> knn.predict([2.5, 3.5])
'A'
ATTRIBUTE DESCRIPTION
k

The number of nearest neighbors to consider for classification.

TYPE: int

std_transform

Whether to standardize the input data (default: True).

TYPE: bool

dataset_

The fitted dataset (standardized if std_transform is True).

TYPE: list[list[float]] | None

labels_

The labels corresponding to the fitted dataset.

TYPE: list[Any] | None

standardizationer_

The Standardizationer instance if std_transform is True.

TYPE: Standardizationer | None

References
  1. Li Hang
  2. Tan
  3. Zhou Zhihua
  4. Murphy
  5. Harrington

fit

fit(dataset: list[list[float]], labels: list[Any]) -> KNN

Fit the KNN model to the given dataset and labels.

PARAMETER DESCRIPTION
dataset

The input dataset to fit the model to.

TYPE: list[list[float]]

labels

The labels corresponding to the input dataset.

TYPE: list[Any]

RETURNS DESCRIPTION
KNN

The fitted KNN instance.

Source code in toyml/classification/knn.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def fit(self, dataset: list[list[float]], labels: list[Any]) -> KNN:
    """Fit the KNN model to the given dataset and labels.

    Args:
        dataset: The input dataset to fit the model to.
        labels: The labels corresponding to the input dataset.

    Returns:
        The fitted KNN instance.
    """
    self.dataset_ = dataset
    self.labels_ = labels
    if self.std_transform:
        self.standardizationer_ = Standardizationer()
        self.dataset_ = self.standardizationer_.fit_transform(self.dataset_)
    return self

predict

predict(x: list[float]) -> Any

Predict the label of the input data.

PARAMETER DESCRIPTION
x

The input data to predict.

TYPE: list[float]

RETURNS DESCRIPTION
Any

The predicted label.

RAISES DESCRIPTION
ValueError

If the model is not fitted yet.

Source code in toyml/classification/knn.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def predict(self, x: list[float]) -> Any:  # noqa: ANN401
    """Predict the label of the input data.

    Args:
        x: The input data to predict.

    Returns:
        The predicted label.

    Raises:
        ValueError: If the model is not fitted yet.
    """
    if self.dataset_ is None or self.labels_ is None:
        msg = "The model is not fitted yet!"
        raise ValueError(msg)

    if self.std_transform:
        if self.standardizationer_ is None:
            msg = "Cannot find the standardization!"
            raise ValueError(msg)
        x = self.standardizationer_.transform([x])[0]
    distances = [self._calculate_distance(x, point) for point in self.dataset_]
    # get k-nearest neighbors' label
    k_nearest_labels = [
        label for _, label in sorted(zip(distances, self.labels_, strict=False), key=lambda x: x[0])
    ][:: self.k]
    label = Counter(k_nearest_labels).most_common(1)[0][0]
    return label

_calculate_distance staticmethod

_calculate_distance(x: list[float], y: list[float]) -> float

Calculate the Euclidean distance between two points using a numerically stable method.

This implementation avoids overflow by using the two-pass algorithm.

Source code in toyml/classification/knn.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
@staticmethod
def _calculate_distance(x: list[float], y: list[float]) -> float:
    """Calculate the Euclidean distance between two points using a numerically stable method.

    This implementation avoids overflow by using the two-pass algorithm.
    """
    assert len(x) == len(y), f"{x} and {y} have different length!"

    # First pass: find the maximum absolute difference
    max_diff = max(abs(xi - yi) for xi, yi in zip(x, y, strict=False))

    if math.isclose(max_diff, 0, abs_tol=1e-9):
        return 0.0  # All elements are identical

    # Second pass: calculate the normalized sum of squares
    sum_squares = sum(((xi - yi) / max_diff) ** 2 for xi, yi in zip(x, y, strict=False))

    return max_diff * math.sqrt(sum_squares)

toyml.classification.knn.Standardizationer dataclass

Standardizationer(_means: list[float] = list(), _stds: list[float] = list(), _dimension: int | None = None)

A class for standardizing numerical datasets.

Provides methods to fit a standardization model to a dataset, transform datasets using the fitted model, and perform both operations in a single step.

fit

fit(dataset: list[list[float]]) -> Standardizationer

Fit the standardization model to the given dataset.

PARAMETER DESCRIPTION
dataset

The input dataset to fit the model to.

TYPE: list[list[float]]

RETURNS DESCRIPTION
Standardizationer

The fitted Standardizationer instance.

RAISES DESCRIPTION
ValueError

If the dataset has inconsistent dimensions.

Source code in toyml/classification/knn.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def fit(self, dataset: list[list[float]]) -> Standardizationer:
    """Fit the standardization model to the given dataset.

    Args:
        dataset: The input dataset to fit the model to.

    Returns:
        The fitted Standardizationer instance.

    Raises:
        ValueError: If the dataset has inconsistent dimensions.
    """
    self._dimension = self._get_dataset_dimension(dataset)
    self._means = self._dataset_column_means(dataset)
    self._stds = self._dataset_column_stds(dataset)
    return self

transform

transform(dataset: list[list[float]]) -> list[list[float]]

Transform the given dataset using the fitted standardization model.

PARAMETER DESCRIPTION
dataset

The input dataset to transform.

TYPE: list[list[float]]

RETURNS DESCRIPTION
list[list[float]]

The standardized dataset.

RAISES DESCRIPTION
ValueError

If the model has not been fitted yet.

Source code in toyml/classification/knn.py
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def transform(self, dataset: list[list[float]]) -> list[list[float]]:
    """Transform the given dataset using the fitted standardization model.

    Args:
        dataset: The input dataset to transform.

    Returns:
        The standardized dataset.

    Raises:
        ValueError: If the model has not been fitted yet.
    """
    if self._dimension is None:
        msg = "The model is not fitted yet!"
        raise ValueError(msg)
    return self.standardization(dataset)

fit_transform

fit_transform(dataset: list[list[float]]) -> list[list[float]]

Fit the standardization model to the dataset and transform it in one step.

PARAMETER DESCRIPTION
dataset

The input dataset to fit and transform.

TYPE: list[list[float]]

RETURNS DESCRIPTION
list[list[float]]

The standardized dataset.

Source code in toyml/classification/knn.py
159
160
161
162
163
164
165
166
167
168
169
def fit_transform(self, dataset: list[list[float]]) -> list[list[float]]:
    """Fit the standardization model to the dataset and transform it in one step.

    Args:
        dataset: The input dataset to fit and transform.

    Returns:
        The standardized dataset.
    """
    self.fit(dataset)
    return self.transform(dataset)

standardization

standardization(dataset: list[list[float]]) -> list[list[float]]

Standardize the given numerical dataset.

The standardization is performed by subtracting the mean and dividing by the standard deviation for each feature. When the standard deviation is 0, all the values in the column are the same, here we set std to 1 to make every value in the column become 0 and avoid division by zero.

PARAMETER DESCRIPTION
dataset

The input dataset to standardize.

TYPE: list[list[float]]

RETURNS DESCRIPTION
list[list[float]]

The standardized dataset.

RAISES DESCRIPTION
ValueError

If the model has not been fitted yet.

Source code in toyml/classification/knn.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
def standardization(self, dataset: list[list[float]]) -> list[list[float]]:
    """Standardize the given numerical dataset.

    The standardization is performed by subtracting the mean and dividing
    by the standard deviation for each feature.
    When the standard deviation is 0, all the values in the column are the same,
    here we set std to 1 to make every value in the column become 0 and avoid division by zero.

    Args:
        dataset: The input dataset to standardize.

    Returns:
        The standardized dataset.

    Raises:
        ValueError: If the model has not been fitted yet.
    """
    if self._dimension is None:
        msg = "The model is not fitted yet!"
        raise ValueError(msg)
    for j, column in enumerate(zip(*dataset, strict=False)):
        mean, std = self._means[j], self._stds[j]
        # ref: https://github.com/scikit-learn/scikit-learn/blob/7389dbac82d362f296dc2746f10e43ffa1615660/sklearn/preprocessing/data.py#L70
        if math.isclose(std, 0, abs_tol=1e-9):
            std = 1
        for i, value in enumerate(column):
            dataset[i][j] = (value - mean) / std
    return dataset

_dataset_column_means staticmethod

_dataset_column_means(dataset: list[list[float]]) -> list[float]

Calculate vectors mean.

Source code in toyml/classification/knn.py
208
209
210
211
@staticmethod
def _dataset_column_means(dataset: list[list[float]]) -> list[float]:
    """Calculate vectors mean."""
    return [statistics.mean(column) for column in zip(*dataset, strict=False)]

_dataset_column_stds staticmethod

_dataset_column_stds(dataset: list[list[float]]) -> list[float]

Calculate vectors(every column) standard variance.

Source code in toyml/classification/knn.py
213
214
215
216
@staticmethod
def _dataset_column_stds(dataset: list[list[float]]) -> list[float]:
    """Calculate vectors(every column) standard variance."""
    return [statistics.stdev(column) for column in zip(*dataset, strict=False)]