Dimensionality Reduction In Python3

To get precise outcomes, it is necessary to reduce datasets’ dimensions and features through dimensionality reduction. These datasets usually contain vast data with multiple dimensions and features. This is where various machine learning models apply dimensionality reduction techniques. Today’s exploration involves Python’s dimensionality reduction.

What is Dimensionality Reduction?

The data’s dimensions may be reduced by a simple maneuver to transform it into a dataset of smaller dimensions. This newly configured dataset will include all vital features and the necessary information to operate a machine-learning model. By only allowing significant sections of the data, dimensionality reduction solves problems such as overfitting, intricacy in computational processes, and data visualization.

Why We Need Dimensionality Reduction to Deal With High-Dimensional Data?

The high dimensional datasets contain many features of the single dataset, which is not necessary for predictions. The Dimensionality Reduction process helps us to select only vital features from it. So, if we apply dimensionality reduction to high-dimensional data, we can reduce many problems. Let’s discuss this problem thoroughly.

Overfitting

The high dimensional data always contains excessive features of the dataset. This excessive feature may fluctuate the predictions of models. The high dimensional data may detect unnecessary patterns that result in wrong predictions. This scenario is considered overfitting. The overfitting is reduced when we use only a few features that are necessary for the predictions. To avoid overfitting, dimensionality reduction techniques play an important role.

Reducing Feature Space

The high dimensional data always contains different features that are not necessary for predictions of any models. The different features of the dataset form a feature space. The Machine learning models always make predictions using different features and this feature shows a direct impact on the result. The small feature space helps to predict accurate results. This feature space can be reduced with the help of dimensionality reduction techniques.

Reducing Computational Complexity

The high dimensional data consist of data with multiple features and dimensions. Processing this data may require a greater amount of time as compared to low-dimensional datasets. Dimensionality reduction techniques help to reduce the dimensions and features of the dataset. This makes it easier to process and compute.

Data Visualization

The data visualization of high dimensional data is tricky compared to the low dimensional datasets. Humans can easily interpret three-dimensional data via data visualization. The high-dimensional data contains multiple features and are not easy to visualize. Data visualization is a very important part of understanding any machine learning model. Therefore, we need to apply dimensionality reduction techniques to the high-dimensional dataset.

Reducing Storage Space

The high-dimensional data contains multiple features and dimensions. This data occupies a large amount of storage space and memory. The data compression can be executed on high-dimensional data using this dimensionality reduction technique. So, storage space and memory space can be reduced to a certain amount.

Dimensionality Reduction Techniques

The dimensionality reduction techniques are mainly divided into three categories, i.e., Feature selection and feature extraction. Feature selection and feature extraction are both used same purpose, i.e., reducing dimensionality or creating low dimensional data. Let’s see the details of this technique.

Feature Selection

Feature selection is a type of dimensionality reduction technique where the selective features are collected from the entire dataset. The important features which will be enough for the prediction model are selected. The feature selection methods contain different techniques like wrapper method, filter method, and embedded method. These methods evaluate a model based on different criteria.

Wrapper Method

In this method, the model is evaluated on different subsets, which are part of the main dataset. The subset that gives the best result will be considered as a final result. The forward-backward feature selection method is one of the best implementations of the wrapper method.

Filter Method

This method is based on the use of statistical properties for evaluation. The statistical correlation of the features is considered while using this method. The chi-square test is an example of a filter method.

Embedded Method

During the training process of the model, the features are selected using relevant references. This method will help to select only those features which are required to train the model. The decision-tree-based model shows a similar technique.

Feature Extraction

The high-dimensional data is converted into a low-dimensional dataset using this feature extraction method. The feature extraction method helps to create a new set of features from the original dataset. The original features and new features are highly correlated with each other. This feature extraction helps to reduce the dimension using different techniques like PCA, LDA, ICA, and NMF.

Principal Component Analysis (PCA)

All the related features from the dataset are collected and represented in the form of principal components. These variables or principal components are correlated with each other.

Linear Discriminant Analysis (LDA)

This method creates a linear projection where the scatters inside the classes are minimized, and the distance between the classes is maximized. This method is considered a Linear Discriminant Analysis.

Independent Component Analysis (ICA)

This method is based on minimizing the statistical independence between the features of the datasets. This technique is called an Independent Component Analysis (ICA).

Non-negative Matrix Factorization (NMF)

This method involves the factorization process of non-negative matrices into non-negative matrices. The image processing models are mainly based on this method.

Dimensionality Reduction Using Scikit-learn Library

There are different methods in Python using which we can implement dimensionality reduction. The scikit-learn library in Python is used to implement this technique. Let’s see different methods and examples to understand dimensionality reduction techniques.

Principal Component Analysis (PCA)

This is a type of linear dimensionality technique in which the features are represented in the form of principal components. Let’s give this one a try and try out the scikit library’s functions for implementing this method. The main goal is to transform the data into low-dimensional data.

from sklearn.decomposition import PCA
import numpy as np
np.random.seed(42)
data = np.random.rand(80, 4) 
pca = PCA(n_components=2)  
low_dimensional_data = pca.fit_transform(data)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print("low-dimensional Data:")
print(low_dimensional_data)

The implementation focuses on two types of features using 80 provided data samples. Executing the PCA method on the dataset is accomplished using the sklearn library. The low-dimensional dataset is obtained by applying the fit-transform function to the principal components. To verify the implementation, the results are examined.

Printed in this output is the variance ratio of the dataset after it was converted into a low-dimensional dataset using 2 features.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Using Python, we can implement a simple data visualization technique with the scikit library. It is a non-linear dimensionality reduction method that greatly enhances the visualization of data.

from sklearn.manifold import TSNE
import numpy as np
np.random.seed(42)
data = np.random.rand(80, 4)  
tsne = TSNE(n_components=3)  
low_dimensional_data = tsne.fit_transform(data)
print("low_dimensional_data:")
print(low_dimensional_data)

Executing the t-SNE technique in Python, we have made use of the scikit library. Reducing the low-dimensional dataset to 3 features, the 4-feature 80 samples have undergone data reduction. By fitting the data into the low-dimensional space using the fit-transform method, the updated data is displayed.

Now, we can see in the results the feature number is reduced to 3. In this way, we can implement this for low-dimensional datasets.

Isomap Technique

The Isomap technique is a great way to reduce dimensionality by finding a lower-dimensional embedding that still maintains the geodesic distances between data points. This makes it perfect for nonlinear dimensionality reduction. If you’re interested in using Isomap, the scikit-learn library has an implementation available.

from sklearn.manifold import Isomap
import numpy as np
np.random.seed(42)
data = np.random.rand(100, 5) 
isomap = Isomap(n_components=3) 
low_dimensional_data = isomap.fit_transform(data)
print("low_dimensional_data:")
print(low_dimensional_data)

The random sample of 100 data points is used with the 5 different features. This is 5 feature component is minimized to the 3 components. Let’s see the implementation. fit_transform method is used to implement this Isomap technique. Let’s see the results.

Isomap Dimensionality Reduction Technique

This method is implemented successfully, we can see the features are minimized to the 3. In the end, the low-dimensional dataset is printed.

Summary

In this article, a brief introduction to the dimensionality reduction technique is given. The dimensionality reduction technique is very important to minimize the dimensions and features of the dataset. Different techniques like feature extraction and feature selection are used to implement the dimensionality reduction techniques. The scikit library in Python provides some important features to implement dimensionality reduction techniques. In this article, the implementation of dimensionality reduction techniques is explained in detail. Hope you will enjoy this article.

References

Do read the official documentation on scikit library.