Picking a good threshold value in binary classification problems is often challenging. The cut-off value we may choose can vary based on the business problem we are solving. If we're more concerned with having a high specificity or low false positive rate, we pick the threshold that maximizes the true positive rate while keeping the false positive rate really low. On the other hand, if we're more concerned with having a high sensitivity or high true positive rate, we pick a threshold that minimizes the false positive rate but has a very high true positive rate.
To evaluate the performance of a model or to compare models, rather than considering metrics such as accuracy, sensitivity, specificity, precision or F-1 score, it is better to use measures that do not depend on a single cut-off value. A Receiver Operator Characteristic curve (ROC curve) and Precision-Recall Curve are what we are going to discuss in this blog post.
An ROC curve is the most commonly used tool for comparing models or to evaluate a model performance. It does not depend on a single cut-off value. To create an ROC curve, the sensitivity, or true positive rate of the model, is shown on the y-axis and the false positive rate, or one minus specificity, is given on the x-axis. The ROC curve always starts at the point (0, 0) and this corresponds to a threshold value of 1. If you have a threshold of 1, you will not catch any positive cases, or have a sensitivity of 0. But you will correctly label all the negative cases, meaning you have a false positive rate of 0. The ROC curve always ends at the point (1,1), which corresponds to a threshold value of 0. If you have a threshold of 0, you'll catch all of the positive cases, or have a sensitivity of 1, but you'll label all of the negative cases as positive cases too, meaning you have a false positive rate of 1. The threshold decreases as you move from (0, 0) to (1, 1). The ROC curve captures all thresholds simultaneously. The higher the threshold, or closer to (0, 0), the higher the specificity and the lower the sensitivity. The lower the threshold, or closer to (1,1), the higher the sensitivity and lower the specificity.
Precision-Recall Curve is another tool that does not depend on a single threshold value. In this case, the precision is shown on the y-axis while the sensitivity, also called recall, is shown on the x-axis. The Precision-Recall starts at (0,1) and as will be shown below when the data is imbalanced using the ROC Curve could be misleading and Precision-Recall curve is more informative.
Let's generate datasets and build lasso logistic regression using grid search with cross-validation for hyper-parameter tuning. First, we will generate balanced data, where the two classes have about equal counts, and plot the ROC and Precision-Recall Curves, and culculate the areas under the curves. Next, we will generate imbalanced data where the labels are 98% from one class. Imbalanced data is very common in classification problems but we usually see ROC curves being used to evaluate such models. However, as you will see below, ROC curves are not good tools for imbalanced data.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
import collections
import matplotlib.pyplot as plt
%matplotlib inline
Let's generate 100000 samples with 30 features.
X, y = make_classification(n_samples = 100000, n_features = 30, n_classes = 2, weights = [0.5,0.5], random_state = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
collections.Counter(y)
So, as shown above the labels are more or less balanced.
steps = [('scaler', StandardScaler()),
('logreg', LogisticRegression(penalty = 'l1', solver = 'saga', tol = 1e-6,
max_iter = int(1e6), warm_start = True, n_jobs = -1))]
pipeline = Pipeline(steps)
param_grid = {'logreg__C': np.arange(0., 1, 0.1)}
logreg_cv = GridSearchCV(pipeline, param_grid, cv = 5, n_jobs = -1)
logreg_cv.fit(X_train, y_train)
print ('best score:', logreg_cv.best_score_)
print ('best parameter:',logreg_cv.best_params_)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
logreg = LogisticRegression(penalty = 'l1', solver = 'saga', tol = 1e-6, max_iter = int(1e6),
warm_start = True, C = logreg_cv.best_params_['logreg__C'])
logreg.fit(X_train_scaled, y_train)
lasso_coef = logreg.coef_.reshape(-1,1)
plt.figure(figsize = (20,10))
plt.plot([0,29],[0,0])
_ = plt.plot(range(30), lasso_coef, linestyle='--', marker='o', color='r')
_ = plt.xticks(range(30), range(30), rotation=60)
_ = plt.ylabel('Coefficients')
plt.xlabel('Features', fontsize = 16)
plt.ylabel('Coefficients', fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.title('Feature Coefficients from Lasso Logistic Regression', fontsize = 28)
plt.show();
X_test_scaled = scaler.transform(X_test)
y_pred_prob = logreg.predict_proba(X_test_scaled)[:,1] # return probabilities for the positive outcome only
The dotted blue line is the baseline.
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize = (20,10))
plt.plot([0, 1], [0, 1], linestyle = '--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate', fontsize = 16)
plt.ylabel('True Positive Rate', fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.title('Lasso Logistic Regression ROC Curve', fontsize = 28)
plt.show();
round(roc_auc_score(y_test, y_pred_prob), 2)
The dotted orange line is the baseline
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
plt.figure(figsize = (20,10))
plt.plot(recall, precision)
plt.plot([0, 1], [0.5, 0.5], linestyle = '--')
plt.xlabel('Recall', fontsize = 16)
plt.ylabel('Precision', fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.title('Lasso Logistic Regression Precision-Recall Curve', fontsize = 28)
plt.show();
round(average_precision_score(y_test, y_pred_prob), 2)
Let's generate 100000 samples with 30 features.
X, y = make_classification(n_samples = 100000, n_features = 30, n_classes = 2, weights = [0.99,0.01], random_state = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
collections.Counter(y)
So, as shown above the labels are imbalanced (about 98% of them are zeros)
steps = [('scaler', StandardScaler()),
('logreg', LogisticRegression(penalty = 'l1', solver = 'saga', tol = 1e-6,
max_iter = int(1e6), warm_start = True, n_jobs = -1))]
pipeline = Pipeline(steps)
param_grid = {'logreg__C': np.arange(0., 1, 0.1)}
logreg_cv = GridSearchCV(pipeline, param_grid, cv = 5, n_jobs = -1)
logreg_cv.fit(X_train, y_train)
print ('best score:', logreg_cv.best_score_)
print ('best parameter:',logreg_cv.best_params_)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
logreg = LogisticRegression(penalty = 'l1', solver = 'saga', tol = 1e-6, max_iter = int(1e6),
warm_start = True, C = logreg_cv.best_params_['logreg__C'])
logreg.fit(X_train_scaled, y_train)
lasso_coef = logreg.coef_.reshape(-1,1)
plt.figure(figsize = (20,10))
plt.plot([0,29],[0,0])
_ = plt.plot(range(30), lasso_coef, linestyle='--', marker='o', color='r')
_ = plt.xticks(range(30), range(30), rotation=60)
_ = plt.ylabel('Coefficients')
plt.xlabel('Features', fontsize = 16)
plt.ylabel('Coefficients', fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.title('Feature Coefficients from Lasso Logistic Regression', fontsize = 28)
plt.show();
X_test_scaled = scaler.transform(X_test)
y_pred_prob = logreg.predict_proba(X_test_scaled)[:,1] # return probabilities for the positive outcome only
The dotted blue line is the baseline
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize = (20,10))
plt.plot([0, 1], [0, 1], linestyle = '--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate', fontsize = 16)
plt.ylabel('True Positive Rate', fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.title('Lasso Logistic Regression ROC Curve', fontsize = 28)
plt.show();
round(roc_auc_score(y_test, y_pred_prob), 2)
The dotted blue line is the baseline
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
plt.figure(figsize = (20,10))
plt.plot([0, 1], [0.01/0.98, 0.01/0.98], linestyle = '--')
plt.plot(recall, precision)
plt.xlabel('Recall', fontsize = 16)
plt.ylabel('Precision', fontsize = 16)
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.title('Lasso Logistic Regression Precision-Recall Curve', fontsize = 28)
plt.show();
round(average_precision_score(y_test, y_pred_prob), 2)
From the results above, when the data is imbalanced, the area under the ROC curve and the area under the Precision-Recall curve are very different and the Precision-Recall curve is more informative than the ROC curve.
Even if ROC curve and area under the ROC curve are commonly used to evaluate model performance with balanced and imbalanced datasets, as shown in this blog post, if your data is imbalanced, Precision-Recall curve and the area under that curve are more informative than the ROC curve and area under the ROC curve. Actually, ROC curve could be misleading for binary classification problems with imbalanced data.