Factor analysis using Python

Ask Python (1)

Today We are going to discuss factor analysis in python, It may be new for most students nowadays. But I am assuring you, it is going to be very exciting as well. let’s get into it without getting late.

Introduction to Factor Analysis

Factor analysis is a dimensionality reduction technique commonly used in statistics. It is an unsupervised machine-learning technique. It uses the biochemist dataset from the Pydataset module and performs a FA that creates two components. Basically, it aims to describe the correlation between the measured features in terms of variations. It identifies variables or items of common features.

There are two types of factor analysis

  • Exploratory Factor Analysis
  • Confirmatory Factor Analysis

Also read: How to Split Data into Training and Testing Sets in Python using sklearn?

Exploratory Factor Analysis

It is used to find structures among a set of attributes. The number of factors/components is not specified on hand by the researchers or the scientists. The overall values need to be derived as well.

Confirmatory Factor Analysis

It is used for ground-level hypotheses and is based on existing theories or concepts. Here, the researchers already have an expected (hypothesized) structure of the data. So the purpose of CFA is to determine the extent to which the proven data fits the expected data.

Application of Factor Analysis

  1. To reduce the number of variables used to analyze data
  2. To detect the structure of the relationship between the variables.

Implementing Factor Analysis in Python

Let us have a quick look at some modules we are going to use as well.

impost pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt
import numpy as np

Sometimes it throws an error that “No module named ‘pydataset‘ “, To solve this problem you need to install the same using your pip installer on your command prompt as follows.

pip install pydatset

The module will be installed as follows :

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydataset
  Downloading pydataset-0.2.0.tar.gz (15.9 MB)
     |████████████████████████████████| 15.9 MB 9.5 MB/s 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pydataset) (1.3.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2022.5)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (1.21.6)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pydataset) (1.15.0)
Building wheels for collected packages: pydataset
  Building wheel for pydataset (setup.py) ... done
  Created wheel for pydataset: filename=pydataset-0.2.0-py3-none-any.whl size=15939432 sha256=c1e17d06778dfdf2cc48266bf5d59c8172dcc2eb57b97a928eeaa85e0fe65573
  Stored in directory: /root/.cache/pip/wheels/32/26/30/d71562a19eed948eaada9a61b4d722fa358657a3bfb5d151e2
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0

Data Preparation

First of all, we are going to create the data frames called df made up of the bioChemist datasets. And we will reduce the data frame to 15 rows.

df = data('bioChemists')
df = df.iloc[1:15]
X = df[['art', 'kid5', 'phd', 'ment']]
df

The above code snippet will give the output as follows.

indexartfemmarkid5phdment
20WomenSingle02.049999952316286
30WomenSingle03.756
40MenMarried11.179999947547913
50WomenSingle03.7526
60WomenMarried23.589999914169312
70WomenSingle03.190000057220463
80MenMarried22.960000038146974
90MenSingle04.619999885559086
100WomenMarried01.250
110MenSingle02.9600000381469714
120WomenSingle00.75499999523162813
130WomenMarried13.690000057220463
140WomenMarried03.400000095367434
150WomenMarried01.789999961853030

The last line pulls the variables we want to use for our analysis. We can observe by printing the same.

indexartkid5phdment
2002.049999952316286
3003.756
4011.179999947547913
5003.7526
6023.589999914169312
7003.190000057220463
8022.960000038146974
9004.619999885559086
10001.250
11002.9600000381469714
12000.75499999523162813
13013.690000057220463
14003.400000095367434
15001.789999961853030

Model development

fact_2c = FactorAnalysis(n_components = 2)
X_factor = fact_2c.fit_transform(X)
X_factor

The first line tells the python how many factors we want. the second line takes this information along with revised dataset X to create the actual factors that we want. The output of the above code snippet is,

array([[-0.06116534,  0.45436164],
       [-0.05368177, -0.21586197],
       [-0.51588955,  0.41579685],
       [ 2.87683951, -0.2463228 ],
       [-0.66312275, -0.91895129],
       [-0.49572513,  0.00948667],
       [-0.37284394, -0.67362045],
       [-0.04985194, -0.5588587 ],
       [-0.9438434 ,  0.7788992 ],
       [ 1.11504909,  0.08341052],
       [ 0.95881639,  0.954253  ],
       [-0.50484028, -0.57376861],
       [-0.34827463, -0.07482872],
       [-0.94146627,  0.56600467]])

Visualization

Visualization requires several steps as well. We want to identify how well the two components separate students who are married from students who are not married yet. First, we need to make a dictionary that can be used to convert the single or married status to a number.

thisdict = {"Single" : "0" , "Married" : "1"}
thisdict

z = np.array(df.mar.map(thisdict), dtype = int)
colors = np.array(["blue", "purple"])
z

output for the above code:

array([0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1])

Now plotting the same.

plt.scatter(X_factor[:,0], X_factor[:,1], c = colors[z])

By mapping the dictionary to the married variable, it automatically changes every single and married entry in the df dataset into 0 and 1 respectively. the c parameter needs a number in order to set color which is why the dictionary was been created.

Summary

Today we covered a pinch knowledge of Factor analysis using python. Hope you must have got it. and again we must visit with some more exciting topics.