Implementation of KNN on scikit-learn

4 min readMay 16, 2021

Dataset:

Optdigits data by Alpaydin and Kaynak, from UCI Machine Learning Repository:

ftp://ftp.ics.uci.edu/pub/ml-repos/machine-learning-databases/optdigits/

For the following 4 methods: knn, decision tree, linear discrimination, multilayer perceptron:

We Partition the optdigits.tra data randomly into 90% train and 10% validation sets.

We Train on training data and determine the best hyper-parameters with KNN based on the validation accuracy. Using those parameters, train this on the whole training data and measure the performance of KNN.

 we split the training set and validation set to the whole of dataset with the ratio of 9/10 and 1/10, and this selection is chosen randomly(i use the train_test split function of sciket-learn).

Therefore, 10 percent of data is prepared for the validation set and 90 percent of them also allocated to the training set.

# import the required libraries
import os
#i use it for iterators
import itertools
import numpy as np
#computing the elapse time
import time
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.neural_network import MLPClassifier
#for linear discriminant
from sklearn import linear_model
import pandas as pd
#i give some warning during the code i put this packge for that.
import warnings
warnings.filterwarnings("ignore")class_name=[0,1,2,3,4,5,6,7,8,9]
class_frequency=[376,389,380,389 ,387 ,376,377,387,380,382]
    
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(class_name,class_frequency,color='green')
#plt.grid(True)
#plt.legend('number of K')
plt.title("Frequency of each class trainset")
plt.show()
class_name=[0,1,2,3,4,5,6,7,8,9]
class_frequency_test=[178, 182,177,183,181,182,181,179,174,180]   
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(class_name,class_frequency,color='blue')
#plt.grid(True)
#plt.legend('number of K')
plt.title("frequency of each class of testset")
plt.show()

KNN(K-nearest neighbors): for implementing this method I use the scikit-learn package. for better visualization of the confusion matrix, I used the additional function that I put the references of that in the references section.

np.random.seed(seed=300)

# got the input of trainset and testset from the file 
trainset = np.loadtxt(open( 'optdigits.tra'), delimiter=",")
testset =np.loadtxt(open('optdigits.tes'),  delimiter=",")#we slide the training set and label of that from trainset and put in some matrices

df_x_train=trainset[:,0:63]
df_x_test=testset[:,0:63]
df_y_train=trainset[:,64]
df_y_test=testset[:,64]#i use sciket learn train test split funmction with 10 percent to validation 90 to training set
from sklearn.model_selection import train_test_split
x_train, x_validation, y_train, y_validation = train_test_split(df_x_train, df_y_train, test_size=0.1)

k_test= np.arange(1,20,1)
class_label = ['c0','c1','c2','c3','c4','c55','c6','c7','c8','c9']

val_scores_knn = []



best_knn = -1
h_knn = -1
for i in k_test:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train, y_train)
    #find the accuarcy of validation set with score
    val_acc_knn = knn.score(x_validation,y_validation ) 
    val_scores_knn.append(val_acc_knn)
    if val_acc_knn > h_knn:
        
        best_knn = knn
        h_knn = val_acc_knn

#find best k with highets score 
best_k = k_test[val_scores_knn.index(max(val_scores_knn))]



print("accuracy of validation: {:.5f}".format(best_knn.score(x_validation, y_validation)))


print (" high accuracy in KNN: {}\n".format(best_k))
# compute knn train elapse time
start_train = time.time()



print('\n ##########################elapse time of process for trainset##################################')

knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(df_x_train, df_y_train)
#function for elapse time
finish_train=time.time()
print("n_neighbors: {},elaps time of process in training set  {:.5f}s.\n".format(best_k, finish_train - start_train))

y_pred_train_knn = knn.predict(df_x_train)
Confusion_Matrix(confusion_matrix(df_y_train, y_pred_train_knn), classes=class_name, title="confusion matrix of trainset for KNN ")
plt.figure(1)



#calcualting the elapse time for test set
start_test = time.time()
print('\n ##########################elapse time of process for testset##################################')
y_pred_test_knn = knn.predict(df_x_test)
finish_test = time.time()
print("Test accuracy {:.5f}\n".format(knn.score(df_x_test, df_y_test)))
print("KNN with :elapse time for testset{:.5f} s.\n".format( finish_test - start_test))


# plot test set confusion matrix

Confusion_Matrix(confusion_matrix(df_y_test, y_pred_test_knn), classes=class_name, title="confusion matrix of testset for KNN")
#plt.figure(2)
print("The process finish.")
plt.show(2)# i use this function  from this forum https://stackoverflow.com/questions/40246277/how-to-change-the-ticks-in-a-confusion-matrix
def Confusion_Matrix(cm, classes,
                          normalize=False,
                          
                          cmap=plt.cm.binary,title='confusion matrix'):
  
    plt.figure()
    plt.imshow(cm, cmap=cmap, interpolation='nearest')
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print(title)

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.xlabel('y_predicted')
    plt.ylabel('y_train')

We have the best accuracy of k for KNN: 1 , worst accuracy of k for KNN: 19. As you see in fig1. This method's elapse time in trainset and test set is 0.9896 and 0.6230s respectively. We choose the best hyper parameter based on the validation accuracy.

We have the best variable in 0.5 and the elapsed time for linear discriminant in train and test set is 0.03202 s and 0.0010 respectively. We train the trainset and predict the test set and show the performance of that with confusion matrix for a train set and test set. We use the regularization penalty for the linear discriminant method.

Analysis

The training rate in KNN and MLP is 100 percent , conversely the story about the linear discriminant is different. Furthermore, when we look at MLP accuracies, 100, 0.9948, 0.9627 percent accuracy for training validation and test respectively.validation accuracy in KNN and linear discriminant is 0.98956 and 0.9277 percent respectively and also in testset we have 0.9293, 0.97997 for the test set (a little low performance in linear discriminant), In addition. In terms of training time, the fastest classier is knn, number of hidden layers and other related parameters can make the algorithm more complicated and it is cause of overfitting in training data, so cross validation reduce the effect of overfitting in training set.furthermore, among this KNN and linear discriminant have the very low elapsed time rather than multi-layer perceptron.

Implementation of KNN on scikit-learn

Analysis

Written by Aydin Ayanzadeh