Precision and Recall in Python

When you build a classification model, accuracy is the number most people reach for first. It tells you what fraction of predictions were correct. That number feels good when it is high and bad when it is low. But accuracy hides a lot of sins.

Imagine a system that flags fraudulent transactions. If fraud shows up in 0.1 percent of transactions, a model that predicts “not fraud” every single time hits 99.9 percent accuracy. That model is useless. It has learned nothing about what fraud actually looks like.

Precision and recall exist because some mistakes are worse than others. This article shows you what they measure, how to compute them in Python, and when to care about one over the other.

  • Precision measures how many positive predictions were actually correct
  • Recall measures how many actual positives your model caught
  • Push one higher and the other tends to drop—this trade-off is fundamental
  • F1 score balances both into a single metric using harmonic mean
  • Threshold tuning lets you pick your operating point on the precision-recall curve

What Is Precision

Precision answers a specific question: of all the positive predictions your model made, how many were actually correct?

The formula is straightforward.

Precision = True Positives / (True Positives + False Positives)

A false positive means the model said yes when the answer was no. In the fraud example, a false positive flags a legitimate transaction as fraudulent. That is annoying but recoverable. The customer complains, you reverse the charge, life goes on.

True positives are the predictions that hit the mark. The model said fraud, and fraud is what happened.

Here is a simple implementation using raw Python and then with scikit-learn.

def precision(tp, fp):
    if tp + fp == 0:
        return 0.0
    return tp / (tp + fp)

print(precision(tp=120, fp=15))  # 0.8888...
from sklearn.metrics import precision_score

y_true = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]
y_pred = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]

score = precision_score(y_true, y_pred)
print(score)  # 0.8

Sklearn accepts the same two arrays you would expect. One holds the ground truth labels and the other holds what your model predicted.

High precision means fewer false alarms. When you optimise for precision, your model becomes conservative about calling something positive. It prefers to stay silent rather than speak incorrectly.

What Is Recall

Recall answers a different question: of all the actual positives in your data, how many did your model catch?

Recall = True Positives / (True Positives + False Negatives)

A false negative is a miss. The model said no when the answer was yes. In fraud detection, a false negative means a real fraudulent transaction went through unnoticed. That is direct financial loss. In medical screening, a false negative might mean a patient leaves with an undiagnosed condition.

True positives again are the hits. Recall measures how complete your detection is.

def recall(tp, fn):
    if tp + fn == 0:
        return 0.0
    return tp / (tp + fn)

print(recall(tp=120, fn=30))  # 0.8
from sklearn.metrics import recall_score

y_true = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]
y_pred = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]

score = recall_score(y_true, y_pred)
print(score)  # 0.8

High recall means catching more of the real positives. When you optimise for recall, your model casts a wide net. It prefers to flag borderline cases rather than let real positives slip through.

Why You Need Both

Precision and recall sit in tension. Push one higher and the other tends to drop. This is not a bug, it is a fundamental trade-off.

Consider a spam filter. High precision means the filter rarely marks good email as spam. Your inbox stays clean. But some spam sneaks through because the model is conservative about flagging anything. Low precision in spam filtering means good emails get blocked, which creates angry customers.

High recall in a spam filter catches most of the spam. Very few spam emails escape. But the filter also flags legitimate email more aggressively, and people miss important messages.

Neither extreme is correct by default. Your use case determines which error costs more.

Medical diagnosis illustrates this clearly. A false negative, missing a disease, can cost someone their health or their life. A false positive leads to follow-up tests and anxiety, but it does not usually kill anyone. In cancer screening, recall matters enormously. You would rather do extra tests on healthy patients than miss a single case.

Spam filtering and fraud detection sit on the other side. Blocking a legitimate transaction or misflagging an important email as spam causes immediate problems. Precision matters more there.

Confusion Matrix: The Full Picture

A confusion matrix arranges all four outcomes in one table. It gives you the full picture of what your classifier is doing.

from sklearn.metrics import confusion_matrix

y_true = [1, 1, 1, 1, 0, 0, 0, 0]
y_pred = [1, 1, 0, 1, 0, 1, 0, 0]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(f"TP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}")

The matrix looks like this:

              Predicted Positive  Predicted Negative
Actual Positive    True Positive (TP)  False Negative (FN)
Actual Negative   False Positive (FP)   True Negative (TN)

From these four numbers, you can derive precision, recall, accuracy, specificity, and every other classification metric you might need.

import numpy as np

def confusion_stats(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    stats = {
        "precision": tp / (tp + fp) if (tp + fp) > 0 else 0,
        "recall": tp / (tp + fn) if (tp + fn) > 0 else 0,
        "accuracy": (tp + tn) / (tp + tn + fp + fn),
        "specificity": tn / (tn + fp) if (tn + fp) > 0 else 0,
    }
    return stats

y_true = [1, 1, 1, 1, 0, 0, 0, 0]
y_pred = [1, 1, 0, 1, 0, 1, 0, 0]

for k, v in confusion_stats(y_true, y_pred).items():
    print(f"{k}: {v:.4f}")

Sklearn also gives you a classification report that prints all the key metrics at once.

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred))

The output shows precision, recall, and f1-score for each class, plus overall accuracy.

The F1 Score: Balancing Both

When you need a single number that captures the trade-off between precision and recall, the F1 score is the most common choice. It is the harmonic mean of the two.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Harmonic mean penalises extreme imbalance. If precision is 1.0 and recall is 0.0, the arithmetic mean is 0.5. The harmonic mean is 0.0. The F1 score refuses to let you hide a zero in one column by posting a perfect score in the other.

from sklearn.metrics import f1_score

y_true = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]
y_pred = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]

score = f1_score(y_true, y_pred)
print(score)  # 0.666...

F1 is weighted, not averaged. It treats a recall of 0.8 and a precision of 0.8 the same as 0.9 and 0.7. Both pairs give you an F1 around 0.8.

Sometimes you want a different balance. F-beta lets you weight recall more heavily than precision, or the reverse.

from sklearn.metrics import fbeta_score

# beta=2 emphasises recall (FN costs more than FP)
f2 = fbeta_score(y_true, y_pred, beta=2)

# beta=0.5 emphasises precision (FP costs more than FN)
f05 = fbeta_score(y_true, y_pred, beta=0.5)

print(f"F2: {f2:.4f}, F0.5: {f05:.4f}")

A beta of 2 makes recall twice as important as precision in the final score. A beta of 0.5 makes precision twice as important. Pick the beta that matches your cost function.

Threshold Tuning: Moving the Line

Until now, the assumption has been that your model outputs a hard class label. But most classifiers output probabilities. A spam filter does not say yes or no. It outputs a score between 0 and 1 indicating how confident it is that a given email is spam.

That score is a dial you can move. By default, the threshold sits at 0.5. Anything above 0.5 gets predicted as positive. Move the threshold lower and you flag more things as positive. Recall goes up. Move it higher and you only flag high-confidence cases. Precision goes up.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

y_proba = model.predict_proba(X_test)[:, 1]

precisions, recalls, thresholds = [], [], []

for thresh in np.arange(0.1, 1.0, 0.05):
    y_pred_thresh = (y_proba >= thresh).astype(int)
    precisions.append(precision_score(y_test, y_pred_thresh))
    recalls.append(recall_score(y_test, y_pred_thresh))
    thresholds.append(thresh)

for i, t in enumerate(thresholds):
    print(f"Threshold {t:.2f} -> Precision: {precisions[i]:.3f}, Recall: {recalls[i]:.3f}")

This loop prints how precision and recall move as you shift the threshold. You can plot the precision-recall curve to visualise the trade-off.

plt.figure(figsize=(8, 6))
plt.plot(recalls, precisions, marker='o')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Trade-off')
plt.grid(True)
plt.show()

The curve tells you exactly how much precision you sacrifice to gain a unit of recall at every operating point. If your product manager tells you the system must catch 95 percent of positives, the curve shows you what precision to expect at that threshold.

Precision-Recall vs ROC AUC

The ROC curve plots true positive rate (which is recall) against false positive rate. ROC AUC summarises the curve as a single number. It answers the question: if you pick a random positive and a random negative, what is the probability that the positive ranks higher than the negative?

ROC AUC works well when your classes are roughly balanced. When they are not, ROC AUC can mislead you.

from sklearn.metrics import roc_auc_score, average_precision_score

# Imbalanced dataset: 5% positives, 95% negatives
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

y_proba = model.predict_proba(X_test)[:, 1]

roc_auc = roc_auc_score(y_test, y_proba)
ap_score = average_precision_score(y_test, y_proba)

print(f"ROC AUC: {roc_auc:.4f}")
print(f"Average Precision (AP): {ap_score:.4f}")

Average precision is the area under the precision-recall curve. It summarises how well the model performs across all thresholds, weighted toward higher thresholds. For imbalanced data, average precision is a more honest metric than ROC AUC because it does not reward the model for correctly predicting negatives it mostly encounters anyway.

Multiclass Considerations

Everything discussed so far assumes binary classification. Precision and recall extend naturally to multiple classes, but the calculation requires care.

from sklearn.metrics import precision_score, recall_score

y_true = [0, 1, 2, 2, 1, 0, 2, 1, 0, 2]
y_pred = [0, 2, 2, 1, 1, 0, 2, 0, 1, 2]

# Average over all classes equally
precision_macro = precision_score(y_true, y_pred, average='macro')
recall_macro = recall_score(y_true, y_pred, average='macro')

# Weighted by support (number of true instances per class)
precision_weighted = precision_score(y_true, y_pred, average='weighted')
recall_weighted = recall_score(y_true, y_pred, average='weighted')

print(f"Precision (macro): {precision_macro:.4f}")
print(f"Recall (macro): {recall_macro:.4f}")

Macro averaging computes precision for each class independently, then takes the unweighted mean. Weighted averaging accounts for class imbalance by weighting each class by its support. Micro averaging aggregates all true positives, false positives, and false negatives across classes before computing the metrics.

Each averaging strategy tells you something different. Macro shows you how the model performs on rare classes. Weighted shows you performance as experienced by the average instance.

Using These Metrics in Practice

Start every classification project by asking two questions. What does a false positive cost me? What does a false negative cost me? The answer pins down whether you need high precision, high recall, or a carefully balanced F1.

If false positives are expensive, tune for precision. If false negatives are expensive, tune for recall. If both costs are similar, aim for a high F1 score.

Once you have a baseline model, plot your precision-recall curve. Identify the operating point where your chosen metric crosses your minimum acceptable threshold. If recall must stay above 90 percent for regulatory reasons, find the threshold that delivers 90 percent recall and note the precision you will have to accept.

This process does not end at deployment. Monitor your confusion matrix in production. Class distributions shift over time. A model that delivered precision above 95 percent six months ago might be sitting at 85 percent today without you noticing. Drift detection on your input features and label distributions catches these regressions before they become incidents.

FAQ

What is the difference between precision and recall?

Precision measures how many of your positive predictions were correct. Recall measures how many of the actual positives you caught. Precision is about quality of predictions. Recall is about completeness of detection.

When should I prioritise precision over recall?

Whenever a false positive is more costly than a false negative. Examples include spam filters where blocking legitimate email causes immediate user harm, content moderation where wrongly removing posts frustrates users, and hiring systems where rejecting a good candidate wastes an interview slot.

When should I prioritise recall over precision?

Whenever a false negative is more costly than a false positive. Medical diagnosis, fraud detection, and safety-critical systems all fit this pattern. Missing a case costs more than an extra investigation.

What does an F1 score of 1.0 mean?

A perfect F1 score means both precision and recall are 1.0. Your model identifies every positive correctly and never produces a false alarm. This is achievable only on trivially simple problems. In practice, expect to trade off between the two.

Can precision or recall be negative?

Both metrics are bounded between 0 and 1. A value of 0 means the model produced no correct positive predictions in that direction. Negative values are mathematically impossible.

Why does F1 use harmonic mean instead of arithmetic mean?

The harmonic mean penalises extreme imbalance more aggressively than the arithmetic mean. If one metric is 0, the harmonic mean is 0, while the arithmetic mean would be 0.5. This property forces models to perform well on both axes, not just one.

What is the best metric for imbalanced datasets?

Average precision, also called the area under the precision-recall curve, is more informative than accuracy or ROC AUC when classes are imbalanced. It summarises model quality across all thresholds without being inflated by correct predictions on the majority class.

How do I choose a classification threshold?

List your constraints. If you need a minimum recall for compliance, find the lowest threshold that delivers that recall and accept the resulting precision. If precision has a floor, find the highest threshold that keeps precision above that floor. The threshold is a business decision as much as a technical one.

Abhishek Wasnik
Abhishek Wasnik
Articles: 29