How to Calculate Precision, Recall, F1, and More for Deep Learning Models

Precision

Precision in machine learning is a metric used to evaluate the quality of a classification model. It measures the proportion of true positive predictions (i.e., the number of correct positive predictions) out of all the positive predictions made by the model. In other words, precision is the ratio of true positives to the sum of true positives and false positives. It is calculated as:

Precision = True Positives / (True Positives + False Positives)

A high precision value indicates that the model makes very few false positive predictions, and is therefore better at correctly identifying positive instances. A low precision value, on the other hand, indicates that the model makes a large number of false positive predictions, and is less reliable at identifying positive instances.

Precision is often used in combination with other metrics such as recall and F1 score to evaluate the overall performance of a classification model. While precision is an important metric, it may not be the only one that is relevant for a particular problem, and other metrics may need to be considered as well.

Recall

Recall, in machine learning, is a metric used to evaluate the quality of a classification model. It measures the proportion of true positive predictions (i.e., the number of correct positive predictions) out of all the actual positive instances in the dataset. In other words, recall is the ratio of true positives to the sum of true positives and false negatives. It is calculated as:

Recall = True Positives / (True Positives + False Negatives)

A high recall value indicates that the model is good at correctly identifying positive instances, and is therefore better at avoiding false negative predictions. A low recall value, on the other hand, indicates that the model misses a large number of positive instances, and is less reliable at identifying positive instances.

Recall is often used in combination with other metrics such as precision and F1 score to evaluate the overall performance of a classification model. While recall is an important metric, it may not be the only one that is relevant for a particular problem, and other metrics may need to be considered as well.

How to Calculate Precision, Recall, F1, and More for Deep Learning Models?

Once you fit a deep learning neural network model, you must evaluate its performance on a test dataset.

This tutorial is divided into three parts; they are:

Binary Classification Problem
Multilayer Perceptron Model
How to Calculate Model Metrics

1. Binary Classification Problem

In machine learning, a binary classification problem is a type of supervised learning problem where the goal is to classify input data into one of two classes. The two classes are often represented as 0 and 1, or as negative and positive.

For example, a binary classification problem can be used to predict whether an email is spam or not, whether a credit card transaction is fraudulent or not, or whether a patient has a disease or not.

The input data for a binary classification problem consists of features that describe each instance. The features could be numerical, categorical, or a combination of both. The output of the model is a binary label that indicates the predicted class of the input data.

To train a binary classification model, a labelled dataset is used, where each instance has a known label that indicates the correct class. The model is then trained to learn patterns in the input features that are associated with the class labels.

There are many algorithms that can be used to solve binary classification problems, including logistic regression, support vector machines, decision trees, and neural networks. The performance of a binary classification model is typically evaluated using metrics such as accuracy, precision, recall, F1 score, and area under the ROC curve.

2. Multilayer Perceptron Model

A multilayer perceptron (MLP) is a type of feedforward neural network that is commonly used for supervised learning tasks such as classification and regression. It consists of an input layer, one or more hidden layers, and an output layer.

Each layer in the MLP consists of one or more artificial neurons, also known as nodes or units. Each neuron receives inputs from the previous layer, applies a transformation to the input, and produces an output that is passed on to the next layer.

The transformation applied by each neuron is typically a nonlinear activation function, such as the sigmoid function, the hyperbolic tangent function, or the rectified linear unit (ReLU) function.

During training, the weights and biases of the MLP are adjusted using an optimization algorithm, such as stochastic gradient descent, to minimize the difference between the predicted outputs and the true outputs of the training examples.

MLPs are powerful models that can learn complex patterns in high-dimensional data, but they are also prone to overfitting if the number of neurons or the number of hidden layers is too large. Regularization techniques such as dropout and weight decay can be used to prevent overfitting.

MLPs have been successfully applied to a wide range of problems, including image classification, natural language processing, and speech recognition.

How to Calculate Model Metrics?

Perhaps you need to evaluate your deep learning neural network model using additional metrics that are not supported by the Keras metrics API.

The Keras metrics API is limited and you may want to calculate metrics such as precision, recall, F1, and more.

One approach to calculating new metrics is to implement them yourself in the Keras API and have Keras calculate them for you during model training and during model evaluation.

A much simpler alternative is to use your final model to make a prediction for the test dataset, then calculate any metric you wish using the scikit-learn metrics API.

Three metrics, in addition to classification accuracy, that are commonly required for a neural network model on a binary classification problem are:

Precision
Recall
F1 Score

In this section, we will calculate these three metrics, as well as classification accuracy using the scikit-learn metrics API, and we will also calculate three additional metrics that are less common but may be useful. They are:

Cohen’s Kappa
ROC AUC
Confusion Matrix.

This is not a complete list of metrics for classification models supported by scikit-learn; nevertheless, calculating these metrics will show you how to calculate any metrics you may require using the scikit-learn API.

The example in this section will calculate metrics for an MLP model, but the same code for calculating metrics can be used for other models, such as RNNs and CNNs.

We can use the same code from the previous sections for preparing the dataset, as well as defining and fitting the model. To make the example simpler, we will put the code for these steps into simple function.

First, we can define a function called get_data() that will generate the dataset and split it into train and test sets.

# generate and prepare the dataset
def get_data():
# generate dataset
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
# split into train and test
n_test = 500
trainX, testX = X[:n_test, :], X[n_test:, :]
trainy, testy = y[:n_test], y[n_test:]
return trainX, trainy, testX, testy

Next, we will define a function called get_model() that will define the MLP model and fit it on the training dataset.

# define and fit the model
def get_model(trainX, trainy):
# define model
model = Sequential()
model.add(Dense(100, input_shape=(2,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(trainX, trainy, epochs=300, verbose=0)
return model

We can then call the get_data() function to prepare the dataset and the get_model() function to fit and return the model.

# generate data
trainX, trainy, testX, testy = get_data()
# fit model
model = get_model(trainX, trainy)

Now that we have a model fit on the training dataset, we can evaluate it using metrics from the scikit-learn metrics API.

First, we must use the model to make predictions. Most of the metric functions require a comparison between the true class values (e.g. testy) and the predicted class values (yhat_classes). We can predict the class values directly with our model using the predict_classes() function on the model.

Some metrics, like the ROC AUC, require a prediction of class probabilities (yhat_probs). These can be retrieved by calling the predict() function on the model.

We can make the class and probability predictions with the model.

# predict probabilities for test set
yhat_probs = model.predict(testX, verbose=0)
# predict crisp classes for test set
yhat_classes = model.predict_classes(testX, verbose=0)

The predictions are returned in a two-dimensional array, with one row for each example in the test dataset and one column for the prediction.

The scikit-learn metrics API expects a 1D array of actual and predicted values for comparison, therefore, we must reduce the 2D prediction arrays to 1D arrays.

# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
yhat_classes = yhat_classes[:, 0]

We are now ready to calculate metrics for our deep learning neural network model. We can start by calculating the classification accuracy, precision, recall, and F1 scores.

# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(testy, yhat_classes)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(testy, yhat_classes)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(testy, yhat_classes)
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(testy, yhat_classes)
print('F1 score: %f' % f1)

Notice that calculating a metric is as simple as choosing the metric that interests us and calling the function passing in the true class values (testy) and the predicted class values (yhat_classes).

We can also calculate some additional metrics, such as the Cohen’s kappa, ROC AUC, and confusion matrix.

Notice that the ROC AUC requires the predicted class probabilities (yhat_probs) as an argument instead of the predicted classes (yhat_classes).

# kappa
kappa = cohen_kappa_score(testy, yhat_classes)
print('Cohens kappa: %f' % kappa)
# ROC AUC
auc = roc_auc_score(testy, yhat_probs)
print('ROC AUC: %f' % auc)
# confusion matrix
matrix = confusion_matrix(testy, yhat_classes)
print(matrix)

Now that we know how to calculate metrics for a deep learning neural network using the scikit-learn API, we can tie all of these elements together into a complete example, listed below.

# demonstration of calculating metrics for a neural network model using sklearn
from sklearn.datasets import make_circles
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from keras.models import Sequential
from keras.layers import Dense

# generate and prepare the dataset
def get_data():
# generate dataset
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
# split into train and test
n_test = 500
trainX, testX = X[:n_test, :], X[n_test:, :]
trainy, testy = y[:n_test], y[n_test:]
return trainX, trainy, testX, testy

# define and fit the model
def get_model(trainX, trainy):
# define model
model = Sequential()
model.add(Dense(100, input_shape=(2,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(trainX, trainy, epochs=300, verbose=0)
return model

# generate data
trainX, trainy, testX, testy = get_data()
# fit model
model = get_model(trainX, trainy)

# predict probabilities for test set
yhat_probs = model.predict(testX, verbose=0)
# predict crisp classes for test set
yhat_classes = model.predict_classes(testX, verbose=0)
# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
yhat_classes = yhat_classes[:, 0]

# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(testy, yhat_classes)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(testy, yhat_classes)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(testy, yhat_classes)
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(testy, yhat_classes)
print('F1 score: %f' % f1)

# kappa
kappa = cohen_kappa_score(testy, yhat_classes)
print('Cohens kappa: %f' % kappa)
# ROC AUC
auc = roc_auc_score(testy, yhat_probs)
print('ROC AUC: %f' % auc)
# confusion matrix
matrix = confusion_matrix(testy, yhat_classes)
print(matrix)

Running the example prepares the dataset, fits the model, then calculates and reports the metrics for the model evaluated on the test dataset.

If you need help interpreting a given metric, perhaps start with the “Classification Metrics Guide” in the scikit-learn API documentation: Classification Metrics Guide