# Factor analysis using Python

Today We are going to discuss factor analysis in python, It may be new for most students nowadays. But I am assuring you, it is going to be very exciting as well. let’s get into it without getting late.

## Introduction to Factor Analysis

Factor analysis is a dimensionality reduction technique commonly used in statistics. It is an unsupervised machine-learning technique. It uses the biochemist dataset from the Pydataset module and performs a FA that creates two components. Basically, it aims to describe the correlation between the measured features in terms of variations. It identifies variables or items of common features.

There are two types of factor analysis

• Exploratory Factor Analysis
• Confirmatory Factor Analysis

Also read: How to Split Data into Training and Testing Sets in Python using sklearn?

### Exploratory Factor Analysis

It is used to find structures among a set of attributes. The number of factors/components is not specified on hand by the researchers or the scientists. The overall values need to be derived as well.

### Confirmatory Factor Analysis

It is used for ground-level hypotheses and is based on existing theories or concepts. Here, the researchers already have an expected (hypothesized) structure of the data. So the purpose of CFA is to determine the extent to which the proven data fits the expected data.

### Application of Factor Analysis

1. To reduce the number of variables used to analyze data
2. To detect the structure of the relationship between the variables.

## Implementing Factor Analysis in Python

Let us have a quick look at some modules we are going to use as well.

```impost pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt
import numpy as np
```

Sometimes it throws an error that “No module named ‘pydataset‘ “, To solve this problem you need to install the same using your pip installer on your command prompt as follows.

```pip install pydatset
```

The module will be installed as follows :

```Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydataset
|████████████████████████████████| 15.9 MB 9.5 MB/s
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pydataset) (1.3.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (2022.5)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pydataset) (1.21.6)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pydataset) (1.15.0)
Building wheels for collected packages: pydataset
Building wheel for pydataset (setup.py) ... done
Created wheel for pydataset: filename=pydataset-0.2.0-py3-none-any.whl size=15939432 sha256=c1e17d06778dfdf2cc48266bf5d59c8172dcc2eb57b97a928eeaa85e0fe65573
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0
```

### Data Preparation

First of all, we are going to create the data frames called `df` made up of the `bioChemist` datasets. And we will reduce the data frame to 15 rows.

```df = data('bioChemists')
df = df.iloc[1:15]
X = df[['art', 'kid5', 'phd', 'ment']]
df
```

The above code snippet will give the output as follows.

The last line pulls the variables we want to use for our analysis. We can observe by printing the same.

### Model development

```fact_2c = FactorAnalysis(n_components = 2)
X_factor = fact_2c.fit_transform(X)
X_factor
```

The first line tells the python how many factors we want. the second line takes this information along with revised dataset X to create the actual factors that we want. The output of the above code snippet is,

```array([[-0.06116534,  0.45436164],
[-0.05368177, -0.21586197],
[-0.51588955,  0.41579685],
[ 2.87683951, -0.2463228 ],
[-0.66312275, -0.91895129],
[-0.49572513,  0.00948667],
[-0.37284394, -0.67362045],
[-0.04985194, -0.5588587 ],
[-0.9438434 ,  0.7788992 ],
[ 1.11504909,  0.08341052],
[ 0.95881639,  0.954253  ],
[-0.50484028, -0.57376861],
[-0.34827463, -0.07482872],
[-0.94146627,  0.56600467]])
```

### Visualization

Visualization requires several steps as well. We want to identify how well the two components separate students who are married from students who are not married yet. First, we need to make a dictionary that can be used to convert the single or married status to a number.

```thisdict = {"Single" : "0" , "Married" : "1"}
thisdict

z = np.array(df.mar.map(thisdict), dtype = int)
colors = np.array(["blue", "purple"])
z
```

output for the above code:

```array([0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1])
```

Now plotting the same.

```plt.scatter(X_factor[:,0], X_factor[:,1], c = colors[z])
```

By mapping the dictionary to the married variable, it automatically changes every single and married entry in the df dataset into 0 and 1 respectively. the c parameter needs a number in order to set color which is why the dictionary was been created.

## Summary

Today we covered a pinch knowledge of Factor analysis using python. Hope you must have got it. and again we must visit with some more exciting topics.