KNN in Python – Simple Practical Implementation

K Nearest Neighbor

Hello, readers! In this article, we will be focusing on the Understanding and Implementation of KNN in Python.

So, let us get started!!


What is KNN Algorithm?

KNN is an acronym for K-Nearest Neighbor. It is a Supervised machine learning algorithm. KNN is basically used for classification as well as regression.

KNN does not assume any underlying parameters i.e. it is a non-parametric algorithm.


Steps followed by KNN algorithm

  • It initially stores the training data into the environment.
  • When we come up with data for prediction, Knn selects the k-most alike/similar data values for the new test record in accordance with the training dataset.
  • Further, the selection of the k-most similar neighbors for the new test point is done using Euclidean or Manhattan distance. Basically, they calculate the distance between the test point and the training data values and then selects the K nearest neighbors.
  • Finally, the test data value is assigned to the class or group which contains the maximum points of K nearest neighbors of the test data.

Real-Life Example of K-NN

Problem statement – Consider a bag of beads(training data) having two colors — Green and Blue.

So, here there are two classes: Green and Blue. Our task is to find to which class a new bead ‘Z’ would fall.

Solution – Initially, we randomly select the value of K. Let us now assume K=4. So, KNN will calculate the distance of Z with all the training data values(bag of beads).

Further, we select the 4(K) nearest values to Z and then try to analyze to which class the majority of 4 neighbors belong.

Finally, Z is assigned a class of majority of neighbors in the space.


Implementation of KNN in Python

Now, let us try to implement the concept of KNN to solve the below regression problem.

We have been provided with a dataset that contains the historic data about the count of people who would choose to rent a bike depending on various environmental conditions.

You can find the dataset here.

So, let us begin!


1. Load the dataset

We have made use of Pandas module to load the dataset into the environment using pandas.read_csv() function.

import pandas 
BIKE = pandas.read_csv("Bike.csv")

2. Select the right features

We have made use of correlation regression analysis technique to select the important variables from the dataset.

corr_matrix = BIKE.loc[:,numeric_col].corr()
print(corr_matrix)

Correlation Matrix

               temp     atemp       hum  windspeed
temp       1.000000  0.991738  0.114191  -0.140169
atemp      0.991738  1.000000  0.126587  -0.166038
hum        0.114191  0.126587  1.000000  -0.204496
windspeed -0.140169 -0.166038 -0.204496   1.000000

As ‘temp’ and ‘atemp’ are highly correlated, we drop ‘atemp’ from the dataset.

BIKE = BIKE.drop(['atemp'],axis=1)

3. Split the dataset

We have made use of train_test_split() function to segregate the dataset into 80% training and 20% testing dataset.

#Separating the dependent and independent data variables into two data frames.
from sklearn.model_selection import train_test_split 

X = bike.drop(['cnt'],axis=1) 
Y = bike['cnt']

# Splitting the dataset into 80% training data and 20% testing data.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.20, random_state=0)

4. Define Error Metrics

As this is a regression problem, we have defined MAPE as the error metrics as shown below–

import numpy as np
def MAPE(Y_actual,Y_Predicted):
    mape = np.mean(np.abs((Y_actual - Y_Predicted)/Y_actual))*100
    return Mape

5. Build the model

The sklearn.neighbors module contains KNeighborsRegressor() method to implement Knn as shown below–

#Building the KNN Model on our dataset
from sklearn.neighbors import KNeighborsRegressor
KNN_model = KNeighborsRegressor(n_neighbors=3).fit(X_train,Y_train)

Further, we predict the testing data using predict() function.

KNN_predict = KNN_model.predict(X_test) #Predictions on Testing data

6. Accuracy Check!

We call the above-defined MAPE function to check for the misclassification error and judge the accuracy of the predictions of the model.

# Using MAPE error metrics to check for the error rate and accuracy level
KNN_MAPE = MAPE(Y_test,KNN_predict)
Accuracy_KNN = 100 - KNN_MAPE
print("MAPE: ",KNN_MAPE)
print('Accuracy of KNN model: {:0.2f}%.'.format(Accuracy_KNN))

Accuracy Evaluation of Knn–

MAPE:  17.443668778014253
Accuracy of KNN model: 82.56%.

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to Python, Stay tuned and till then, Happy Learning!! 🙂