Survival Analysis in Python: A Comprehensive Guide with Examples

Survival Anlysis

Survival analysis is a statistical method for investigating the time until an event of interest occurs, making it invaluable in fields such as medical sciences, engineering, and beyond. With its extensive libraries like NumPy and Matplotlib, Python provides an ideal platform for implementing survival analysis.

From generating random survival data to calculating survival probabilities using the Kaplan-Meier method and visualizing survival curves, Python empowers us to unravel the mysteries of survival analysis.

In this article, we will discuss the concept of survival analysis and observe a simple case.

Survival analysis is a statistical method used to calculate the time until an event of interest occurs. It is applied in various fields, such as engineering and medical sciences, to evaluate the viability of different approaches or treatments. Python provides tools like NumPy and Matplotlib to generate random survival data, calculate survival probabilities using the Kaplan-Meier method, and visualize survival curves for comparison.

Recommended: Non-Parametric Statistics in Python: Exploring Distributions and Hypothesis Testing

Recommended: Understanding Joint Probability Distribution with Python

Introduction to Survival Analysis: Concept and Implementation in Python

Survival analysis calculates the time for an event that is of interest to us. It is used in multiple fields, such as engineering and medical sciences. Let us implement this in Python.

import numpy as np
import matplotlib.pyplot as plt

# --- Generate Random Survival Data with NumPy ---

# Set random seed (optional, for reproducibility)
np.random.seed(42)

# Simulate random durations (time) between 0 and 10
durations = np.random.uniform(low=0, high=10, size=100)

# Simulate random events (0 for no event, 1 for event)
# Adjust probability (p) to control the number of events
events = np.random.binomial(n=1, p=0.3, size=100)  # Assuming 30% event probability

# Sort together by duration (ascending)
data = np.array([durations, events]).T
data = data[data[:, 0].argsort()]

# Separate sorted data
sorted_durations = data[:, 0]
sorted_events = data[:, 1]

# Initialize variables for Kaplan-Meier calculation
n_total = len(sorted_durations)
n_alive = n_total
t = []
s = []  # Survival probability

for i in range(n_total):
    if sorted_events[i] == 1:
        t.append(sorted_durations[i])
        s.append(n_alive / (n_alive + 1))
    n_alive -= 1

# --- Plot Kaplan-Meier Curve (DIY) ---

plt.figure(figsize=(8, 6))
plt.step(t, s, where='post')  # Step function for Kaplan-Meier curve
plt.xlabel("Time")
plt.ylabel("Probability of Survival")
plt.grid(True)
plt.title("Kaplan-Meier Survival Curve (DIY with NumPy)")
plt.show()

# --- Note on Cox Proportional Hazards Model ---

print("Cox Proportional Hazards Model cannot be directly implemented with NumPy alone.")
print("Consider using libraries like lifelines or statsmodels for this analysis.")

Let us look at the output of the code above.

Survival Analysis Output
Plotting Kaplan-Meier Survival Curve using NumPy

Real-World Example: Comparing Survival Probabilities of New Drug vs. Standard Treatment

Let us look at a simple example of where survival analysis is used in real life. So essentially, we have cures for disease, one is a standard drug and the other is a new drug. We run the survival analysis algorithm to determine which is better in this scenario. We have created random data to support our research activity.

import numpy as np
import matplotlib.pyplot as plt

# Set random seed (optional, for reproducibility)
np.random.seed(42)

# Sample sizes (adjust for desired imbalance)
n_drug = 50
n_standard = 70

# Simulate durations (potentially affected by treatment)
# Drug group might have slightly longer survival times on average
drug_durations = np.random.normal(loc=5, scale=1.5, size=n_drug)
standard_durations = np.random.normal(loc=4, scale=1, size=n_standard)

# Simulate events (considering potential treatment effect)
drug_events = np.random.binomial(n=1, p=0.4, size=n_drug)  # Assuming lower event probability for drug
standard_events = np.random.binomial(n=1, p=0.6, size=n_standard)  # Assuming higher event probability for standard

# Simulate treatment assignment (0: Standard, 1: Drug)
treatment = np.concatenate((np.zeros(n_standard), np.ones(n_drug)))

# Combine data into arrays
durations = np.concatenate((standard_durations, drug_durations))
events = np.concatenate((standard_events, drug_events))
sorted_data = np.array([durations, events, treatment]).T  # Combine and sort by duration
sorted_data = sorted_data[sorted_data[:, 0].argsort()]

# Separate sorted data
sorted_durations = sorted_data[:, 0]
sorted_events = sorted_data[:, 1]
treatments = sorted_data[:, 2]  # Separate treatment assignments

# Initialize variables for Kaplan-Meier calculation (per treatment group)
n_total_drug = np.sum(treatments == 1)
n_total_standard = np.sum(treatments == 0)
n_alive_drug = n_total_drug
n_alive_standard = n_total_standard
t_drug = []
s_drug = []  # Survival probability (Drug)
t_standard = []
s_standard = []  # Survival probability (Standard)

current_time = sorted_durations[0]
event_index = 0

while event_index < len(sorted_events):
    if sorted_durations[event_index] > current_time:
        # Update time point
        current_time = sorted_durations[event_index]
        # Update survival probabilities for each group if applicable
        if n_alive_drug > 0:
            s_drug.append(n_alive_drug / (n_total_drug))
        if n_alive_standard > 0:
            s_standard.append(n_alive_standard / (n_total_standard))
        t_drug.append(current_time)
        t_standard.append(current_time)
    else:
        # Handle event (decrement alive for the corresponding treatment group)
        if sorted_events[event_index] == 1 and treatments[event_index] == 1:
            n_alive_drug -= 1
        elif sorted_events[event_index] == 1 and treatments[event_index] == 0:
            n_alive_standard -= 1
    event_index += 1

# Plot Kaplan-Meier curves (DIY) ---

plt.figure(figsize=(8, 6))
plt.step(t_drug, s_drug, where='post', label='New Drug')
plt.step(t_standard, s_standard, where='post', label='Standard Treatment')
plt.xlabel("Time")
plt.ylabel("Probability of Survival")
plt.grid(True)
plt.title("Kaplan-Meier Survival Curves (DIY with NumPy)")
plt.legend()
plt.show()

# --- Note on Statistical Comparison ---

print("This example demonstrates Kaplan-Meier curves. Consider tests")
print("like the Log-Rank test to statistically compare survival between groups.")

Let us look at the output for the same.

Survival Analysis Case Study
Survival Analysis Case Study

Thus, we can see that according to our tests, the new drug has a probability of survival of 1 as compared to standard treatment.

Conclusion

You now have a solid understanding of survival analysis and its implementation in Python. We explored the concept, generated random survival data using NumPy, plotted Kaplan-Meier survival curves, and even delved into a case study comparing the effectiveness of a new drug against a standard treatment.

Survival analysis is a powerful tool for evaluating the viability of different approaches in various domains. With this knowledge, you can apply survival analysis to your projects and make data-driven decisions. So, are you ready to put your survival analysis skills to the test and uncover valuable insights?

Recommended: Probability Distributions with Python (Implemented Examples)

Recommended: 10 PyJanitor’s Miscellaneous Functions for Enhancing Data Cleaning