Understanding Joint Probability Distribution with Python

In this tutorial, we will explore the concept of joint probability and joint probability distribution in mathematics and demonstrate how to implement them in Python. Joint distribution is essential for finding relationships between variables and has numerous applications in data science.

Understanding Joint Probability

Let’s understand this with an example(Don’t worry we won’t roll a die here :))

Suppose we have a bag that contains 10 red balls and 5 green balls. We randomly select two balls from the bag without replacing them. And let’s say we want to determine the joint probability of selecting a green ball on the first draw and a red ball on the second draw.

To find the joint probability, our thought process should go like this:

The probability of getting a green ball on the first draw is 5/15 since there are 5 green balls out of 15 total balls.
The probability of selecting a red ball on the second draw, given that a green ball was chosen on the first draw, is 10/14, since there are now 14 balls remaining in the bag.

To find the joint probability, we multiply the probabilities of each event:

Joint probability is given by P(A and B) =P(A ∩ B)= P(A) * P(B|A) where A and B are independent events
Note here that P(B|A) represents conditional probability.
P(Green on first draw and Red on second draw) = P(Green on first draw) x P(Red on 2nd draw | Green on 1st draw)
P(Green on first draw and Red on second draw) = (5/15) x (10/14)
P(Green on first draw and Red on second draw)= 0.1786

So the joint probability of selecting a red ball on the first draw and a green ball on the second draw is 0.1786 or approximately 17.86%. This tells us that out of all possible pairs of balls that can be drawn from the bag in this way(Red-Red, Red-Green, Green-Red, and Green-Green), approximately 17.86% of them will consist of a red ball on the first draw and a green ball on the second draw.

Now that we understand joint probability, let’s dive deeper into joint probability distribution.

Exploring Joint Probability Distribution

A short definition is that a joint probability distribution represents a probability distribution for two or more random variables and looks for a relationship between the two. But let’s use a very basic example to understand joint probability distribution

Suppose we have a group of 50 young boys and girls and a survey is done on whether some like anime, horror, or both. And the data is represented in the table below.

	Boys	Girls	Total
Anime	19	11	30
Horror	8	12	20
Total	27	23	50

Survey data

There are some straightforward questions that can be answered from the data

What is the total number of girls that participated in the survey? (Answer: 23)
What is the total number of boys that like horror? (Answer: 8)
How many people in the sample are girls and like anime? (Answer: 11)
How many like horror?(Answer: 20)

Till here things are pretty simple, right? Now with a small change, we will convert this data into a probability distribution table.

	Boys	Girls	Total
Anime	0.38	0.22	0.6
Horror	0.16	0.24	0.4
Total	0.54	0.46	1

Probability distribution of the survey

We have just represented every data as a fraction of 1 by dividing everything by 50(Total number of people), this makes our data a probability distribution. Now if we pick a person from the sample at random and ask what will be the probability of the person being a boy and liking horror? The answer will be 0.16 which is the joint probability of the event of a person being a boy and a person liking the horror genre. In other words P(Boy And Horror) = P(Boy ∩ Horror ) = 0.16

Joint probability distribution represents the probability distribution of two or more random variables and explores their relationship. It can be visualized in Python using libraries like NumPy, Pandas, and Seaborn to analyze and plot the data.

Let’s apply the concepts we’ve learned to a real-world example and implement joint probability distribution using Python.

Implementing Joint Probability Distribution in Python

Let’s implement and visualize joint probability distribution using python.

Let’s start with 2 random variables A and B.
These variables are normally distributed

We will start by importing the required modules

import numpy as np
import seaborn as sns
import pandas as pd

Now we will define our normal variables A and B by using the normal() function of the random module of numpy. np.random.normal(size=100) will generate an array of 100 random numbers drawn from a standard normal distribution with a mean of 0. The size parameter specifies the size of the array to generate.

A = np.random.normal(size=100)
B = np.random.normal(size=100)

The normal distribution data produced is used to build a pandas DataFrame with two columns, “A” and “B,” and to populate them. There will be two columns and 100 rows in the final data frame.

df = pd.DataFrame({'A' : A , 'B':B})

Using the jointplot() function from seaborn, a joint plot is produced in the next line of code.. The x and y parameters specify the columns in the DataFrame to use for the x and y axes, respectively. In this case, ‘A’ is used for the x-axis and ‘B’ for the y-axis. The data parameter specifies the DataFrame to use as the data source for the plot.

sns.jointplot(x='A', y='B' ,data=df )

Output:

Screenshot 2023 03 26 201900 — The joint probability distribution for normal random variables A and B

Summary

In this tutorial, we explored joint probability and joint probability distribution in mathematics and demonstrated their implementation in Python using libraries like NumPy, Pandas, and Seaborn. Understanding these concepts is crucial for anyone interested in pursuing a career in machine learning and AI. As you explore further into probability and data science, consider how joint probability distributions can be used to reveal deeper insights into relationships between variables in various datasets.