Today We are going to discuss factor analysis in python, It may be new for most students nowadays. But I am assuring you, it is going to be very exciting as well. let’s get into it without getting late.
Introduction to Factor Analysis
Factor analysis is a dimensionality reduction technique commonly used in statistics. It is an unsupervised machine-learning technique. It uses the biochemist dataset from the Pydataset module and performs a FA that creates two components. Basically, it aims to describe the correlation between the measured features in terms of variations. It identifies variables or items of common features.
There are two types of factor analysis
- Exploratory Factor Analysis
- Confirmatory Factor Analysis
Also read: How to Split Data into Training and Testing Sets in Python using sklearn?
Exploratory Factor Analysis
It is used to find structures among a set of attributes. The number of factors/components is not specified on hand by the researchers or the scientists. The overall values need to be derived as well.
Confirmatory Factor Analysis
It is used for ground-level hypotheses and is based on existing theories or concepts. Here, the researchers already have an expected (hypothesized) structure of the data. So the purpose of CFA is to determine the extent to which the proven data fits the expected data.
Application of Factor Analysis
- To reduce the number of variables used to analyze data
- To detect the structure of the relationship between the variables.
Implementing Factor Analysis in Python
Let us have a quick look at some modules we are going to use as well.
impost pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt
import numpy as np
Sometimes it throws an error that “No module named ‘pydataset‘ “, To solve this problem you need to install the same using your pip installer on your command prompt as follows.
pip install pydatset
The module will be installed as follows :
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydataset
Downloading pydataset-0.2.0.tar.gz (15.9 MB)
|████████████████████████████████| 15.9 MB 9.5 MB/s
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pydataset) (1.3.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2022.5)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (1.21.6)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pydataset) (1.15.0)
Building wheels for collected packages: pydataset
Building wheel for pydataset (setup.py) ... done
Created wheel for pydataset: filename=pydataset-0.2.0-py3-none-any.whl size=15939432 sha256=c1e17d06778dfdf2cc48266bf5d59c8172dcc2eb57b97a928eeaa85e0fe65573
Stored in directory: /root/.cache/pip/wheels/32/26/30/d71562a19eed948eaada9a61b4d722fa358657a3bfb5d151e2
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0
Data Preparation
First of all, we are going to create the data frames called df
made up of the bioChemist
datasets. And we will reduce the data frame to 15 rows.
df = data('bioChemists')
df = df.iloc[1:15]
X = df[['art', 'kid5', 'phd', 'ment']]
df
The above code snippet will give the output as follows.
index | art | fem | mar | kid5 | phd | ment |
---|---|---|---|---|---|---|
2 | 0 | Women | Single | 0 | 2.04999995231628 | 6 |
3 | 0 | Women | Single | 0 | 3.75 | 6 |
4 | 0 | Men | Married | 1 | 1.17999994754791 | 3 |
5 | 0 | Women | Single | 0 | 3.75 | 26 |
6 | 0 | Women | Married | 2 | 3.58999991416931 | 2 |
7 | 0 | Women | Single | 0 | 3.19000005722046 | 3 |
8 | 0 | Men | Married | 2 | 2.96000003814697 | 4 |
9 | 0 | Men | Single | 0 | 4.61999988555908 | 6 |
10 | 0 | Women | Married | 0 | 1.25 | 0 |
11 | 0 | Men | Single | 0 | 2.96000003814697 | 14 |
12 | 0 | Women | Single | 0 | 0.754999995231628 | 13 |
13 | 0 | Women | Married | 1 | 3.69000005722046 | 3 |
14 | 0 | Women | Married | 0 | 3.40000009536743 | 4 |
15 | 0 | Women | Married | 0 | 1.78999996185303 | 0 |
The last line pulls the variables we want to use for our analysis. We can observe by printing the same.
index | art | kid5 | phd | ment |
---|---|---|---|---|
2 | 0 | 0 | 2.04999995231628 | 6 |
3 | 0 | 0 | 3.75 | 6 |
4 | 0 | 1 | 1.17999994754791 | 3 |
5 | 0 | 0 | 3.75 | 26 |
6 | 0 | 2 | 3.58999991416931 | 2 |
7 | 0 | 0 | 3.19000005722046 | 3 |
8 | 0 | 2 | 2.96000003814697 | 4 |
9 | 0 | 0 | 4.61999988555908 | 6 |
10 | 0 | 0 | 1.25 | 0 |
11 | 0 | 0 | 2.96000003814697 | 14 |
12 | 0 | 0 | 0.754999995231628 | 13 |
13 | 0 | 1 | 3.69000005722046 | 3 |
14 | 0 | 0 | 3.40000009536743 | 4 |
15 | 0 | 0 | 1.78999996185303 | 0 |
Model development
fact_2c = FactorAnalysis(n_components = 2)
X_factor = fact_2c.fit_transform(X)
X_factor
The first line tells the python how many factors we want. the second line takes this information along with revised dataset X to create the actual factors that we want. The output of the above code snippet is,
array([[-0.06116534, 0.45436164],
[-0.05368177, -0.21586197],
[-0.51588955, 0.41579685],
[ 2.87683951, -0.2463228 ],
[-0.66312275, -0.91895129],
[-0.49572513, 0.00948667],
[-0.37284394, -0.67362045],
[-0.04985194, -0.5588587 ],
[-0.9438434 , 0.7788992 ],
[ 1.11504909, 0.08341052],
[ 0.95881639, 0.954253 ],
[-0.50484028, -0.57376861],
[-0.34827463, -0.07482872],
[-0.94146627, 0.56600467]])
Visualization
Visualization requires several steps as well. We want to identify how well the two components separate students who are married from students who are not married yet. First, we need to make a dictionary that can be used to convert the single or married status to a number.
thisdict = {"Single" : "0" , "Married" : "1"}
thisdict
z = np.array(df.mar.map(thisdict), dtype = int)
colors = np.array(["blue", "purple"])
z
output for the above code:
array([0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1])
Now plotting the same.
plt.scatter(X_factor[:,0], X_factor[:,1], c = colors[z])

By mapping the dictionary to the married variable, it automatically changes every single and married entry in the df dataset into 0 and 1 respectively. the c parameter needs a number in order to set color which is why the dictionary was been created.
Summary
Today we covered a pinch knowledge of Factor analysis using python. Hope you must have got it. and again we must visit with some more exciting topics.