In this article, we’ll learn more about fitting a logistic regression model in Python. In Machine Learning, we frequently have to tackle problems that have only two possible outcomes – determining if a tumor is malignant or benign in the medical domain, or determining whether a student is admitted to a given university or not in the educational domain.
Binary classification problems are one type of challenge, and logistic regression is a prominent approach for solving these problems. In this article, we’ll look at how to fit a logistic regression model in Python.
Skip to building and fitting a logistic regression model if you know the basics.
What is Logistic Regression?
Logistic Regression is a Machine Learning technique that makes predictions based on independent variables to classify problems like tumor status (malignant or benign), email categorization (spam or not spam), or admittance to a university (admitted or not admitted).
For instance, when categorizing an email, the algorithm will utilize the words in the email as characteristics and generate a prediction about whether or not the email is spam.
Logistic Regression is a supervised Machine Learning technique, which means that the data used for training has already been labeled, i.e., the answers are already in the training set. The algorithm gains knowledge from the instances.
Importance of Logistic Regression
This technique can be used in medicine to estimate the risk of disease or illness in a given population, allowing for the provision of preventative therapy.
By monitoring buyer behavior, businesses can identify trends that lead to improved employee retention or produce more profitable products. This form of analysis is used in the corporate world by data scientists, whose purpose is to evaluate and comprehend complicated digital data.
Predictive models developed with this approach can have a positive impact on any company or organization. One can improve decision-making by using these models to analyze linkages and forecast consequences.
For instance, a manufacturer’s analytics team can utilize logistic regression analysis, which is part of a statistics software package, to find a correlation between machine part failures and the duration those parts are held in inventory. The team can opt to change delivery schedules or installation times based on the knowledge it receives from this research to avoid repeat failures.
Types of Logistic Regression
Based on the type of classification it performs, logistic regression can be classified into different types. With this in mind, there are three different types of Logistic Regression.
1. Binary Logistic Regression
The most common type is binary logistic regression. It’s the kind we talked about earlier when we defined Logistic Regression. This type assigns two separate values for the dependent/target variable: 0 or 1, malignant or benign, passed or failed, admitted or not admitted.
2. Multinomial Logistic Regression
When the target or independent variable has three or more values, Multinomial Logistic Regression is used. For example, a company can conduct a survey in which participants are asked to choose their favorite product from a list of various options. One may construct profiles of those who are most likely to be interested in your product and use that information to tailor your advertising campaign.
3. Ordinal Logistic Regression
When the target variable is ordinal in nature, Ordinal Logistic Regression is utilized. In this case, the categories are organized in a meaningful way, and each one has a numerical value. Furthermore, there are more than two categories in the target variable.
Fitting a Logistic Regression Model
Let’s start by building the prediction model. Now we are going to use the logistic regression classifier to predict diabetes. In the first step, we will load the Pima Indian Diabetes dataset and read it using Pandas’ read CSV function.
Link to download data: https://www.kaggle.com/uciml/pima-indians-diabetes-database
1. Loading and Reading the Data
Lets import the required packages and the dataset that we’ll work on classifying with logistic regression.
#import necessary packages import pandas as pd col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label'] # load dataset pima = pd.read_csv("C:\\Users\Intel\Documents\diabetes.csv", header=None, names=col_names) pima.head()
2. Feature Selection
In the feature selection step, we will divide all the columns into two categories of variables: dependent or target variables and independent variables, also known as feature variables.
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree'] #features X = pima[feature_cols] #target variable y = pima.label
3. Data Splitting
Splitting the dataset into a training set and a test set helps understand the model’s performance better. We will use the function train_test_split() to divide the dataset.
Following that, we will use random_state to select records randomly. The dataset will be divided into two parts in a ratio of 75:25, which means 75% of the data will be used for training the model and 25% will be used for testing the model.
from sklearn.cross_validation import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
4. Model Building and Prediction
In this step, we will first import the Logistic Regression Module then using the Logistic Regression() function, we will create a Logistic Regression Classifier Object.
You can fit your model using the function fit() and carry out prediction on the test set using predict() function.
from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() # fit the model with data logreg.fit(X_train,y_train) #predict the model y_pred=logreg.predict(X_test)
5. Evaluation of the Model with Confusion Matrix
Let’s start by defining a Confusion Matrix.
A confusion matrix is a table that is used to assess a classification model’s performance. An algorithm’s performance can also be seen. The number of right and wrong predictions that are summed up class-wise is the foundation of a confusion matrix.
from sklearn import metrics cnf_matrix = metrics.confusion_matrix(y_test, y_pred) cnf_matrix
array([[119, 11], [ 26, 36]])
In the above result, you can notice that the confusion matrix is in the form of an array object. As this model is an example of binary classification, the dimension of the matrix is 2 by 2.
The values present diagonally indicate actual predictions and the values present non-diagonal values are incorrect predictions. Thus, 119 and 36 are actual predictions and 26 and 11 are incorrect predictions.
- It doesn’t take a lot of computing power, is simple to implement, and understand, and is extensively utilized by data analysts and scientists because of its efficiency and simplicity.
- It also does not necessitate feature scaling. For each observation, logistic regression generates a probability score.
- A huge number of categorical features/variables is too much for logistic regression to manage. It’s prone to be overfitted.
- Logistic regression cannot handle the nonlinear problem, which is why nonlinear futures must be transformed. Independent variables that are not associated with the target variable but are very similar or correlated to each other will not perform well in logistic regression.
We covered a lot of information about Fitting a Logistic Regression in this session. You’ve learned what logistic regression is, how to fit regression models, how to evaluate its performance, and some theoretical information. You should now be able to use the Logistic Regression technique for your own datasets.