Non-Parametric Statistics in Python: Exploring Distributions and Hypothesis Testing

Non-parametric statistics do not assume any strong assumptions of the distribution, which contrasts with parametric statistics. Non-parametric statistics focus on ranks and signs along with minimal assumptions.

Non-parametric statistics focus on analyzing data without making strong assumptions about the underlying distribution. Python offers various methods for exploring data distributions, such as histograms, kernel density estimation (KDE), and Q-Q plots. Apart from this, non-parametric hypothesis testing techniques like the Wilcoxon rank-sum test, Kruskal-Wallis test, and chi-square test allow for inferential analysis without relying on parametric assumptions.

In this article, we have divided non-parametric statistics into two parts – Methods for Exploring the underlying distribution and Hypothesis Testing and Inference.

Recommended: How To Calculate Power Statistics?

Exploring Data Distributions

Exploration of distribution helps us visualize the data and pin it to a theoretical distribution. It also helps us summarize the stats.

This subheading will teach us about Histograms, Kernel Density Estimation, and Q-Q Plots. We will also implement each of them in Python.

Visualizing Data with Histograms

Histograms are used to visualize the distribution of numerical data. The histogram gives us the range and shows the frequency of the range. They are very similar to Bar charts. Let us understand it further with Python code.

import matplotlib.pyplot as plt
import numpy as np

# Sample data (replace with your actual data)
data = [2, 5, 7, 8, 2, 1, 9, 4, 5, 3, 7, 8, 2, 6, 1]

# Create the histogram
plt.hist(data, bins=10, edgecolor='black')  # Adjust 'bins' for different bin counts

# Customize the plot (optional)
plt.xlabel('Data Values')
plt.ylabel('Frequency')
plt.title('Histogram of Sample Data')
plt.grid(True)

# Display the plot
plt.show()

# Output: This will display the generated histogram.

Let us look at the output for the above code.

Estimating Probability Density with Kernel Density Estimation

Kernel Density Estimation (KDE) approximates the random variable’s probability density function (pdf). It provides us with continuous and much smoother visualization of the distribution. Let us look at the Python code for the same.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Sample data (replace with your actual data)
data = [2, 5, 7, 8, 2, 1, 9, 4, 5, 3, 7, 8, 2, 6, 1]

# Create the KDE plot
sns.kdeplot(data)

# Customize the plot (optional)
plt.xlabel('Data Values')
plt.ylabel('Probability Density')
plt.title('KDE Plot of Sample Data')
plt.grid(True)

# Display the plot
plt.show()

# Output: This will display the generated KDE plot.

Let us look at the output of the above code.

Comparing Distributions with Q-Q Plots

Q-Q Plots or quantile=quantile plots are used to compare two probability distributions. They help us visualize whether two datasets came from some population or have the same distribution. Let us look at the Python code for the same.

import matplotlib.pyplot as plt
import numpy as np

# Sample data sets (replace with your actual data)
data1 = [2, 5, 7, 8, 2, 1, 9, 4, 5, 3, 7, 8, 2, 6, 1]
data2 = [3, 6, 8, 9, 3, 2, 10, 5, 6, 4, 8, 9, 3, 7, 2]

# Calculate quantiles
q1 = np.quantile(data1, np.linspace(0, 1, 100))
q2 = np.quantile(data2, np.linspace(0, 1, 100))

# Create the Q-Q plot
plt.plot(q1, q2, 'o', markersize=5)

# Reference line for perfect match (optional)
plt.plot(q1, q1, color='red', linestyle='--')

# Customize the plot (optional)
plt.xlabel('Quantiles of Data Set 1')
plt.ylabel('Quantiles of Data Set 2')
plt.title('Q-Q Plot of Sample Data Sets')
plt.grid(True)

# Display the plot
plt.show()

Let us look at the output of the plot.

Now let us move on and see what are the methods for Hypothesis Testing and Inference.

Non-Parametric Hypothesis Testing and Inference

In Hypothesis testing and inference for non-parametric statistics, minimal assumptions about the underlying distribution are made and more focus is on rank-based statistics.

Under this subheading, we will learn about the Wilcoxon rank-sum, Krusal-Wallis, and Chi-square tests. Let us learn all of these with their Python implementation.

Comparing Means with Wilcoxon Rank-Sum Test

The Wilcoxon rank sum test, or the Mann-Whitney U test, is a non-parametric statistical test used to compare the means of two independent groups. In the code below, we have two datasets, and we want to conclude if there is any difference between the mean of the datasets. Let us look at the code below.

import scipy.stats as stats
import matplotlib.pyplot as plt

# Sample data (replace with your actual data)
data1 = [2, 5, 7, 10, 12]
data2 = [3, 6, 8, 9, 11, 13]

# Perform Wilcoxon Rank Sum Test
statistic, pvalue = stats.ranksums(data1, data2)

# Print test results
print("Test Statistic:", statistic)
print("p-value:", pvalue)

# Decide on rejecting the null hypothesis based on significance level (e.g., 0.05)
if pvalue < 0.05:
    print("Reject null hypothesis: There is a significant difference between the distributions.")
else:
    print("Fail to reject null hypothesis: Insufficient evidence to conclude a difference.")

Let us look at its output.

Wilcox Rank Sum Test — ***Wilcoxon Rank Sum Test***

Since the p-value is greater than 0.05, we can conclude that there is no difference between the mean of the datasets.

One-way ANOVA on ranks/Krusal Wallis Test

One-way ANOVA on ranks or Krusal-Wallis test is a non-parametric test to compare the mean of three or more independent groups. It does not assume normally distributed data. Let us look at the output code where we have assumed three datasets.

import scipy.stats as stats
import matplotlib.pyplot as plt

# Sample data (replace with your actual data)
data1 = [2, 5, 7, 10, 12]
data2 = [3, 6, 8, 9, 11, 13]
data3 = [1, 4, 6, 9, 10]

# Perform Kruskal-Wallis test
statistic, pvalue = stats.kruskal(*[data1, data2, data3])

# Print test results
print("Test Statistic:", statistic)
print("p-value:", pvalue)

# Decide on rejecting the null hypothesis based on significance level (e.g., 0.05)
if pvalue < 0.05:
    print("Reject null hypothesis: There is a significant difference between distributions.")
else:
    print("Fail to reject null hypothesis: Insufficient evidence to conclude a difference.")

Let us look at the output of the code below.

We fail to reject the null hypothesis since the p-value is greater than 0.05.

Testing Categorical Variables with Chi-Square Test

The chi-square test tests the difference between observed and expected frequencies in one or more categorical variables. It is also used for goodness-of-fit tests or whether they are independent. Let us look at the code below.

import scipy.stats as stats

# Sample contingency table (replace with your actual data)
observed_data = [[10, 20],
                 [15, 25]]

# Perform Chi-square test
chi2_statistic, pvalue, expected_counts, variance = stats.chi2_contingency(observed_data)

# Print test results
print("Chi-square statistic:", chi2_statistic)
print("p-value:", pvalue)

# Decide on rejecting the null hypothesis based on significance level (e.g., 0.05)
if pvalue < 0.05:
    print("Reject null hypothesis: There is a significant association between the variables.")
else:
    print("Fail to reject null hypothesis: Insufficient evidence to conclude an association.")

Let us look at the output of the code below.

Chi Square Test Output — ***Chi-Square Test Output***

Since the p-value is more than 0.05, we cannot conclude any dependence between datasets.

Conclusion

Here we go! Now you know what non-parametric tests are. In this article, we have learned about the exploration of distribution and hypothesis testing of data without considering any parameters. We also learned about different kinds of tests to compare different datasets.

Hope you enjoyed reading it!!