KNN
Normal KNN
KNN(K-Nearest Neighbor) is a extremely simple classification algorithm.
For the normal KNN algorithm, the model does not have an explicit trainining process. For any test samples, the algorithm calculate distance with each 'training samples' (\(L_2\) distance , Euclidean distance), and then select the first \(k\) samples according to the distance. The most classes in the \(k\) samples will be the label of the test sample.
For a smaller \(k\), the model will become complexity, and easily overfitting. Increasing the value of \(k\) means the model becomes simpler.
kd Tree
A simple approach to select the k samples is do linear scan in the training samples. However, this process will become extremely computing expensice when the feature dimension and the data are huge. Therefore, to reduce the computation, we can construct a kd-tree to store the training data.
kd-tree is a kind of binary tree. When splitting the samples, the algorithm uses the median value as the splitting point. Here is the code of constructing the kd-tree:
1 | def get_split_node(self, samples, index, cur_node): |
Searching on the kd-tree is a little bit complex :
1 | def search_nodes(self, root, test_data): |
The whole codes are submitted on the github (kd-tree codes)