10 PyJanitor's Miscellaneous Functions for Enhancing Data Cleaning

In the previous post, we reviewed some of the basic data-cleaning functions available in PyJanitor. This post aims to understand some of the miscellaneous functions offered by the data-cleaning clean API.

For starters, PyJanitor is a data cleaning and processing API inspired by the R package’s Janitor and built on top of the Pandas library that makes the data cleaning job easy and enjoyable. It also has several miscellaneous functions that can be used for different domains such as finance, engineering, biology, time series analysis, etc.

PyJanitor, a data cleaning and processing API built on top of the Pandas library, offers a wide range of miscellaneous functions for various domains such as finance, engineering, biology, and time series analysis. These functions include general utilities like counting cumulative unique values, dropping constant or duplicate columns, finding and replacing elements, and introducing noise with jitter. Apart from them, PyJanitor provides math functions for computing empirical cumulative distribution, exponentiation, sigmoid, softmax, and z-score standardization, making it a versatile tool for data cleaning and manipulation tasks.

You can read the previous post here!

Introduction to PyJanitor’s Miscellaneous Functions

Let us discuss the miscellaneous functions offered by PyJanitor.

General Functions in PyJanitor

In this section, we will talk about a few important miscellaneous functions listed under the functions menu of the Pyjanitor documentation.

1. Count Cumulative Unique

Remember that in lower-grade mathematics, we used to compute the cumulative frequency? The same concept applies here. The count cumulative unique function returns a column containing the cumulative sum of unique values in the specified column.

count_cumulative_unique(df, column_name, dest_column_name, case_sensitive=True)

The parameters passed to this function are the dataframe, the column name for which the filter has to be applied, and the destination column name where you want to store the cumulative values. If the case-sensitive parameter is set to True, the function will treat lower and upper case characters as different(a!=A), resulting in the count being different.

Learn how to create a data frame here dynamically!

Let us see an example.

import pandas as pd
import janitor
df = pd.DataFrame({
    "letters": list("abABcdef"),
    "numbers": range(4, 12),
})
df

In this code snippet, we create a data frame with two columns – letters and numbers. The letters column consists of the values a,b, A, B,c,d,e, and f. The numbers column consists of the numbers in the range 4 to 12. This dataframe is printed in the next line.

Now, we attempt to count the unique values in the letters column. First, let us see what will happen if the case-sensitive parameter is set to True.

df.count_cumulative_unique(
    column_name="letters",
    dest_column_name="letters_count",
    case_sensitive = True,
)

The column name in which we want to count the unique values is the letters column. The column in which the results are displayed is the letters_count column, and the case-sensitive parameter is set to True.

If you notice the output, we have encountered the letter a in the first row. Hence the unique count becomes 1. Next, we have which is unique so the count is increased to 2. And then, we have the upper case A. Since the case-sensitive parameter is set to True, a is not equal to A. So A is treated as a unique character. The count is increased from 2 to 3. Similarly, B is also treated as unique and the count is increased.

df.count_cumulative_unique(
    column_name="letters",
    dest_column_name="letters_count",
    case_sensitive = False,
)

Here, when the case-sensitive parameter is set to False, the upper and lower case alphabets are treated the same(a==A).

Case doesn't matter — Case doesn’t matter

2. Drop Constant Columns

This function is used to drop or remove all the columns that have constant(same) values.

drop_constant_columns(df)

import pandas as pd
import janitor 
data = {'A':[3,3,3,],
        'B':[3,2,1],
        'C':[3,1,2],
        'D':["Noodles","China","Japan"],
        'E':["Pao","China","Kimchi"],
        'F':["Japan","China","Korea"]}
df = pd.DataFrame(data)
df

We have created a dictionary with a bunch of numbers, countries, and food items. This dictionary called data is converted into a data frame called df.

Now we use the drop constants function.

df.drop_constant_columns()

3. Drop Duplicate Columns

This method is useful when there are multiple columns with the same name. In such cases, we can specify the column name and the index of the column such that the repetitive column at that index will be dropped.

drop_duplicate_columns(df, column_name, nth_index=0)

The example is given below.

import pandas as pd
import janitor
df = pd.DataFrame({
    "a": range(2, 5),
    "b": range(3, 6),
    "A": range(4, 7),
    "b*": range(6, 9),
}).clean_names(remove_special=True)
df

In this data frame called df, we have four columns – a,b, A,b*. the clean names function is applied to remove the special character(*) at the end of the last column – b*.Now we have a duplicate column in the dataframe.

df.drop_duplicate_columns(column_name="b", nth_index=0)

Since the index specified is 0, the first occurrence of the column b will be dropped.

4.Find_Replace

The find_replace function just as its name suggests, is used to find an element in the dataframe and replace it with some other element.

find_replace(df, match='exact', **mappings)

By default, the match is exact which means when the element is encountered the same as the element given in the function, it is replaced. We can choose the matching method to be exact, full-value matching, or regular-expression-based fuzzy matching, which allows for replacing the element even if the substring is identical.

Let us see an example. In the following example, the data frame has four popular songs owned by popular singers.

df = pd.DataFrame({
    "song": ["We don't talk anymore","Euphoria","Dangerously","As It Was"],
    "singer": ["C.Puth","JK","C.Puth","Harry Styles"]
})
df

Now, let us try to replace the names C.Puth with Charlie and JK with Jungkook.

df = find_replace(
    df,
    match="exact",
    singer={"C.Puth":"Charile","JK":"Jungkook"},
)
df

5. Jitter

Jitter is a function of PyJanitor that can be used to introduce noise to the values of the data frame. If the data frame has NaN values, they are ignored and the jitter value corresponding to this element will also be NaN.

jitter(df, column_name, dest_column_name, scale, clip=None, random_state=None)

We are required to pass the column name we need the jitter for, the destination column name in which the jitter values must be stored, and the scale at which we need the noise.

import numpy as np
import pandas as pd
import janitor
df1 = pd.DataFrame({"a": [3, 4, 5, np.nan],
                    "b":[1,2,3,4]})
df1

We are creating a dataframe called df1 that has two columns a and b. Column a has one missing value and we will introduce jitter for column a.

df1.jitter("a", dest_column_name="jit", scale=2,random_state=0)

Math Functions in PyJanitor

Let us discuss some of the math functions available under the math menu of the documentation.

1. Ecdf

The ecdf is a function used to obtain the empirical cumulative distribution of values in a series. Given a series as an input, this function generates a sorted array of values in the series and computes a cumulative fraction of data points with values less or equal to the array.

ecdf(s)

import pandas as pd
import janitor
s = pd.Series([5,1,3,4,2])
x,y= janitor.ecdf(s)
print("The sorted array of values:",x)
print("The values less than equal to x:",y)

In this code, we have defined a series object called s. The series is passed to function ecdf which sorts the values in the series and stores them in an array called x. It generates the cumulative distribution values that are either less than or equal to the values in the array x.

ecdf

2. Exponent

The exp(s) takes a series as input and returns the exponential for each value in the series.

exp(s)

import pandas as pd
import janitor
s = pd.Series([1,2,7,6,5])
exp_values = s.exp()
print(exp_values)

We have defined a series called s that contains the values 1,2,7,6,5.The exp_value variable stores the result of applying the function exp on the series s.

3. Sigmoid

The sigmoid function of pyjanitor is used to compute the sigmoid values for each element in the series.

The sigmoid function is given below:

sigmoid(x) = 1 / (1 + exp(-x))

4. Softmax

The softmax function, just as the name suggests is used to compute the softmax values for the elements in a series or a one-dimensional numpy array.

The softmax function can be defined as follows.

softmax(x) = exp(x)/sum(exp(x))

import pandas as pd 
s = pd.Series([1,-2,5])
s.softmax()

5. Z-Score

Z-score is an important parameter in statistics and even in the field of machine learning. Also called the standards score, it is used to describe the relationship of a value to the mean of the group of values.

The z-score function in pyjanitor is used to compute the standard score of each element in a series.

The z-score formula is given below.

z = (s - s.mean()) / s.std()

Let us see an example.

import pandas as pd
import janitor
s = pd.Series([0, 1, 3,9,-2])
s.z_score()

All of the above discussed functions can be used in many areas like statistics, engineering, machine learning and data visualization which makes them miscellaneous and pretty much useful in data cleaning and visualization process.

Summing It Up

To recapitulate, we have discussed a few functions from the domains – general functions and math from the pyjanitor documentation, their syntaxes, and examples. These functions are just a drop in the ocean and the pyjanitor library offers many functions in the domains of finance, engineering, biology, and chemistry.

Can you find out all of the PyJanitor functions from each domain that are being used and not depreciated?

References

Pyjanitor functions