Factor analysis using Python

Today We are going to discuss factor analysis in python, It may be new for most students nowadays. But I am assuring you, it is going to be very exciting as well. let’s get into it without getting late.

Introduction to Factor Analysis

Factor analysis is a dimensionality reduction technique commonly used in statistics. It is an unsupervised machine-learning technique. It uses the biochemist dataset from the Pydataset module and performs a FA that creates two components. Basically, it aims to describe the correlation between the measured features in terms of variations. It identifies variables or items of common features.

There are two types of factor analysis

Exploratory Factor Analysis
Confirmatory Factor Analysis

Also read: How to Split Data into Training and Testing Sets in Python using sklearn?

Exploratory Factor Analysis

It is used to find structures among a set of attributes. The number of factors/components is not specified on hand by the researchers or the scientists. The overall values need to be derived as well.

Confirmatory Factor Analysis

It is used for ground-level hypotheses and is based on existing theories or concepts. Here, the researchers already have an expected (hypothesized) structure of the data. So the purpose of CFA is to determine the extent to which the proven data fits the expected data.

Application of Factor Analysis

To reduce the number of variables used to analyze data
To detect the structure of the relationship between the variables.

Implementing Factor Analysis in Python

Let us have a quick look at some modules we are going to use as well.

impost pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt
import numpy as np

Sometimes it throws an error that “No module named ‘pydataset‘ “, To solve this problem you need to install the same using your pip installer on your command prompt as follows.

pip install pydatset

The module will be installed as follows :

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydataset
  Downloading pydataset-0.2.0.tar.gz (15.9 MB)
     |████████████████████████████████| 15.9 MB 9.5 MB/s 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pydataset) (1.3.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2022.5)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (1.21.6)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pydataset) (1.15.0)
Building wheels for collected packages: pydataset
  Building wheel for pydataset (setup.py) ... done
  Created wheel for pydataset: filename=pydataset-0.2.0-py3-none-any.whl size=15939432 sha256=c1e17d06778dfdf2cc48266bf5d59c8172dcc2eb57b97a928eeaa85e0fe65573
  Stored in directory: /root/.cache/pip/wheels/32/26/30/d71562a19eed948eaada9a61b4d722fa358657a3bfb5d151e2
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0

Data Preparation

First of all, we are going to create the data frames called df made up of the bioChemist datasets. And we will reduce the data frame to 15 rows.

df = data('bioChemists')
df = df.iloc[1:15]
X = df[['art', 'kid5', 'phd', 'ment']]
df

The above code snippet will give the output as follows.

index	fem	mar	kid5	phd	ment
2	Women	Single	0	2.04999995231628	6
3	Women	Single	0	3.75	6
4	Men	Married	1	1.17999994754791	3
5	Women	Single	0	3.75	26
6	Women	Married	2	3.58999991416931	2
7	Women	Single	0	3.19000005722046	3
8	Men	Married	2	2.96000003814697	4
9	Men	Single	0	4.61999988555908	6
10	Women	Married	0	1.25	0
11	Men	Single	0	2.96000003814697	14
12	Women	Single	0	0.754999995231628	13
13	Women	Married	1	3.69000005722046	3
14	Women	Married	0	3.40000009536743	4
15	Women	Married	0	1.78999996185303	0

The last line pulls the variables we want to use for our analysis. We can observe by printing the same.

index	kid5	phd	ment
2	0	2.04999995231628	6
3	0	3.75	6
4	1	1.17999994754791	3
5	0	3.75	26
6	2	3.58999991416931	2
7	0	3.19000005722046	3
8	2	2.96000003814697	4
9	0	4.61999988555908	6
10	0	1.25	0
11	0	2.96000003814697	14
12	0	0.754999995231628	13
13	1	3.69000005722046	3
14	0	3.40000009536743	4
15	0	1.78999996185303	0

Model development

fact_2c = FactorAnalysis(n_components = 2)
X_factor = fact_2c.fit_transform(X)
X_factor

The first line tells the python how many factors we want. the second line takes this information along with revised dataset X to create the actual factors that we want. The output of the above code snippet is,

array([[-0.06116534,  0.45436164],
       [-0.05368177, -0.21586197],
       [-0.51588955,  0.41579685],
       [ 2.87683951, -0.2463228 ],
       [-0.66312275, -0.91895129],
       [-0.49572513,  0.00948667],
       [-0.37284394, -0.67362045],
       [-0.04985194, -0.5588587 ],
       [-0.9438434 ,  0.7788992 ],
       [ 1.11504909,  0.08341052],
       [ 0.95881639,  0.954253  ],
       [-0.50484028, -0.57376861],
       [-0.34827463, -0.07482872],
       [-0.94146627,  0.56600467]])

Visualization

Visualization requires several steps as well. We want to identify how well the two components separate students who are married from students who are not married yet. First, we need to make a dictionary that can be used to convert the single or married status to a number.

thisdict = {"Single" : "0" , "Married" : "1"}
thisdict

z = np.array(df.mar.map(thisdict), dtype = int)
colors = np.array(["blue", "purple"])
z

output for the above code:

array([0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1])

Now plotting the same.

plt.scatter(X_factor[:,0], X_factor[:,1], c = colors[z])

By mapping the dictionary to the married variable, it automatically changes every single and married entry in the df dataset into 0 and 1 respectively. the c parameter needs a number in order to set color which is why the dictionary was been created.

Summary

Today we covered a pinch knowledge of Factor analysis using python. Hope you must have got it. and again we must visit with some more exciting topics.