Hello, readers! In this article, we will be focusing on the Understanding and Implementation of KNN in Python.
So, let us get started!!
What is KNN Algorithm?
KNN is an acronym for K-Nearest Neighbor. It is a Supervised machine learning algorithm. KNN is basically used for classification as well as regression.
KNN does not assume any underlying parameters i.e. it is a non-parametric
algorithm.
Steps followed by KNN algorithm
- It initially stores the training data into the environment.
- When we come up with data for prediction, Knn selects the k-most alike/similar data values for the new test record in accordance with the training dataset.
- Further, the selection of the k-most similar neighbors for the new test point is done using
Euclidean or Manhattan distance
. Basically, they calculate the distance between the test point and the training data values and then selects the K nearest neighbors. - Finally, the test data value is assigned to the class or group which contains the maximum points of K nearest neighbors of the test data.
Real-Life Example of K-NN
Problem statement – Consider a bag of beads(training data) having two colors — Green and Blue.
So, here there are two classes: Green and Blue. Our task is to find to which class a new bead ‘Z’ would fall.
Solution – Initially, we randomly select the value of K. Let us now assume K=4. So, KNN will calculate the distance of Z with all the training data values(bag of beads).
Further, we select the 4(K) nearest values to Z and then try to analyze to which class the majority of 4 neighbors belong.
Finally, Z is assigned a class of majority of neighbors in the space.
Implementation of KNN in Python
Now, let us try to implement the concept of KNN to solve the below regression problem.
We have been provided with a dataset that contains the historic data about the count of people who would choose to rent a bike depending on various environmental conditions.
You can find the dataset here.
So, let us begin!
1. Load the dataset
We have made use of Pandas module to load the dataset into the environment using pandas.read_csv()
function.
import pandas BIKE = pandas.read_csv("Bike.csv")
2. Select the right features
We have made use of correlation regression analysis technique to select the important variables from the dataset.
corr_matrix = BIKE.loc[:,numeric_col].corr() print(corr_matrix)
Correlation Matrix
temp atemp hum windspeed temp 1.000000 0.991738 0.114191 -0.140169 atemp 0.991738 1.000000 0.126587 -0.166038 hum 0.114191 0.126587 1.000000 -0.204496 windspeed -0.140169 -0.166038 -0.204496 1.000000
As ‘temp’ and ‘atemp’ are highly correlated, we drop ‘atemp’ from the dataset.
BIKE = BIKE.drop(['atemp'],axis=1)
3. Split the dataset
We have made use of train_test_split() function to segregate the dataset into 80% training and 20% testing dataset.
#Separating the dependent and independent data variables into two data frames. from sklearn.model_selection import train_test_split X = bike.drop(['cnt'],axis=1) Y = bike['cnt'] # Splitting the dataset into 80% training data and 20% testing data. X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.20, random_state=0)
4. Define Error Metrics
As this is a regression problem, we have defined MAPE as the error metrics as shown below–
import numpy as np def MAPE(Y_actual,Y_Predicted): mape = np.mean(np.abs((Y_actual - Y_Predicted)/Y_actual))*100 return Mape
5. Build the model
The sklearn.neighbors module
contains KNeighborsRegressor()
method to implement Knn as shown below–
#Building the KNN Model on our dataset from sklearn.neighbors import KNeighborsRegressor KNN_model = KNeighborsRegressor(n_neighbors=3).fit(X_train,Y_train)
Further, we predict the testing data using predict() function.
KNN_predict = KNN_model.predict(X_test) #Predictions on Testing data
6. Accuracy Check!
We call the above-defined MAPE function to check for the misclassification error and judge the accuracy of the predictions of the model.
# Using MAPE error metrics to check for the error rate and accuracy level KNN_MAPE = MAPE(Y_test,KNN_predict) Accuracy_KNN = 100 - KNN_MAPE print("MAPE: ",KNN_MAPE) print('Accuracy of KNN model: {:0.2f}%.'.format(Accuracy_KNN))
Accuracy Evaluation of Knn–
MAPE: 17.443668778014253 Accuracy of KNN model: 82.56%.
Conclusion
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more such posts related to Python, Stay tuned and till then, Happy Learning!! 🙂