Wine Classification using Python – Easily Explained

Feautured Img Wine Classify

Hello everybody! In this tutorial, we are going to learn how to classify wines on the basis of various features in the Python programming language.

Also read: Classifying Clothing Images in Python – A complete guide

Introduction to Wine Classification

There are numerous wines available in this globe, including dessert wines, sparkling wines, appetizers, pop wines, table wines, and vintage wines.

You may wonder how one knows which wine is good and which is not. The answer to this question is machine learning!

There are numerous wine categorization methods available. Here are listed a few of them:

  1. CART
  2. Logistic Regression
  3. Random forest
  4. Naïve Bayes
  5. Perception
  6. SVM
  7. KNN

Implementing Wine Classification in Python

Let’s now get into a very basic implementation of a wine classifier in Python. This will give you a starting point in learning how classifiers work and how you can implement them in Python for various real-world scenarios.

1. Importing Modules

The first step is importing all the necessary modules/libraries into the program. The modules needed for the classification are some basic modules such as:

  1. Numpy
  2. Pandas
  3. Matplotlib

The next step is to import all the models into the program that comes under the sklearn library. We will also include some other functions from the sklearn library.

The models loaded are listed below:

  1. SVM
  2. Logistic Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.preprocessing import normalize

2. Dataset Preparation

Next, we need to prepare our dataset. Let me begin by introducing the dataset and then importing the same in our application.

2.1 Introduction to Dataset

In the dataset, we have 6497 observations and in total 12 features. There aren’t NAN values in any variable. You can download the data easily here.

The name and description of the 12 features are as follows:

  • Fixed acidity: Amount of acidity in the wine
  • Volatile acidity: Amount of acetic acid present in the wine
  • Citric acid: Amount of citric acid present in the wine
  • Residual sugar: Amount of sugar after fermentation
  • Chlorides: Amount of salts present in the wine
  • Free sulfur dioxide: Amount of free form of SO2
  • Total sulfur dioxide: Amount of free and bound forms of S02
  • Density: Density of the wine (mass/volume)
  • pH: pH of the wine ranging from 0-14
  • Sulphates: Amount of sulfur dioxide gas (S02) levels in the wine
  • Alcohol: Amount of alcohol present in the wine
  • Quality: Final quality of the wine mentioned

2.2 Loading the Dataset

Dataset is loaded into the program with the help of the read_csv function and display the first five rows of the dataset using the head function.

Wine Classify Data First5
Wine Classify Data First5

2.3 Cleaning of Data

Cleaning of the dataset includes dropping the unnecessary columns and the NaN values with the help of the code mentioned below:

data=data.drop('Unnamed: 0',axis=1)

2.4 Data Visualization

An important step is to first visualize the data before processing it any further. The visualization is done in two forms namely,

  1. Histographs
  2. Seaborn Graph
Plotting Histograms'dark_background')
for i in range(1,13):

We will be plotting histograms for each feature separately. The output is displayed below.

Wine Classify Histogram Plot
Wine Classify Histogram Plot
Plotting Seaborn
import seaborn as sns
correlations = data[data.columns].corr(method='pearson')
sns.heatmap(correlations, annot = True)

Seaborn graphs are used to show the relationship between different features present in the dataset.

Wine Classify Seaborn Plot
Wine Classify Seaborn Plot

2.5 Train-Test Split and Data Normalization

To split the data into training and testing data, there is no optimal splitting percentage.

But one of the fair splitting rules is the 80/20 rule where 80% of the data goes to training data and the rest 20% goes to testing data.

This step also involves normalizing the dataset.

print("Split of data is at: ",split)
print("\n-------AFTER SPLITTING-------")
print('Shape of train data:',train_data.shape)
print('Shape of train data:',test_data.shape)
print('Shape of x train data:',x_train.shape)
print('Shape of y train data:',y_train.shape)
print('Shape of x test data:',x_test.shape)
print('Shape of y test data:',y_test.shape)


3. Wine Classification Model

In this program we have used two algorithms namely, SVM and Logistic Regression.

3.1 Support Vector Machine (SVM) Algorithm

clf = svm.SVC(kernel='linear'), y_train)
y_pred_svm = clf.predict(nor_test)
print("Accuracy (SVM) :",metrics.accuracy_score(y_test, y_pred_svm)*100)

The accuracy of the model turned out to be around 50%.

3.2 Logistic Regression Algorithm

logmodel = LogisticRegression(), y_train)
y_pred_LR= logmodel.predict(nor_test)
print('Mean Absolute Error(Logistic Regression):', metrics.mean_absolute_error(y_test, y_pred_LR)*100)

The accuracy, in this case, turns out to be around 50% as well. The main reason for this is the model that we’ve used/created. Advanced models such as those available for tensorflow are


In order to get higher accuracy, you can check out tensorflow models as well!

Happy Learning! 😇

Stay tuned for more such tutorials! Thank you for reading!