Model evaluation metrics in Machine Learning

7 min readJan 11, 2021

In this article we are going to focus on two questions:

How well our model is doing? (We are going to learn metrics which will tell us if the model is good or not.)
How do we improve the model based on these metrics?

Let’s look at one real-world scenario and compare it to a problem in Machine Learning. Let’s say that our car is broken (problem), and in order to fix it we need some tool (algorithms). But how do we know which tool is best to use in order to fix the car? For that we have a bunch of measurement tools and we use them to evaluate which tool (algorithm) will fix the car in the best possible way (i.e. which algorithm will perform best). So, once we evaluate algorithms and parameters and decide which ones are the best for our problem, then we go ahead and solve our problem.

In this article we are going to focus on measurement tools, we are going to go through some of the measurement tools and techniques which will tell us when an algorithm is working well with the data and how to tweak them to work best for our problem.

If we are fitting our model to predict categorical data (spam not spam), there are different measures to understand how well our model is performing than if we are predicting numeric values (the price of a home).

Before we start with metrics used in classification and regression problems, let’s see what confusion matrix is.

Confusion matrix

Confusion matrix describes the performance of the classification model, and it is represented as following table (it stores the following values: True Positives, False Negatives, False Positives and True Negatives):

True Positive: an outcome where the model correctly predicts the positive class as positive.
False Negative: an outcome where the model incorrectly predicts the positive class as negative.
False Positive: an outcome where the model incorrectly predicts the negative class as positive.
True Negative: an outcome where the model correctly predicts the negative class as negative.

Classification Metrics

- Accuracy

Accuracy is often used to compare models, as it tells us the proportion of observations we correctly labeled.

Accuracy is not always the best metrics to us and often is not the only metric we should be optimising on. This is especially the case when you have class imbalance in your data. Optimising on only accuracy can be misleading in how well your model is truly performing.

- Precision

Precision focuses on the predicted “positive” values in our dataset. It is an answer to the following question: Out of all points predicted to be positive, how many of them are actually positive? By optimising based on precision values, we are determining if we are doing a good job of predicting the positive values, as compared to predicting negative values as positive.

- Recall (Sensitivity)

Recall focuses on the actual “positive” values in our dataset. It is an answer to the following question: Out of the points that are labeled positive, how many of them are correctly predicted as positive? By optimising based on recall values, we are determining if we are doing a good job of predicting the positive values without regard of how we are doing on the actualnegative values. If we want to perform something similar to recall on the actual ‘negative’ values, this is called specificity.

In order to look at a combination of metrics at the same time, there are some common techniques like the F-Beta Score (where the F1 score is frequently used), as well as the ROC.

- F-Beta Score

F-Beta Score is the weighted score of Precision and Recall. It is defined as the following equation:

As we can see in the formula, this score takes both false positives and false negatives into account. The β parameter controls the degree to which precision is weighed into the F score, which allows Precision and Recall to be considered at the same time. The most common value for β is 1, as this is where we can find the harmonic average between precision and recall, and is closer to the smallest between Precision and Recall. So, if one of them is particularly low, F-1 Score kind of “raises the flag”, while average says if one of them is good, but other one is bad the average will be “okay”.

Intuitively, F-Beta Score is not as easy to understand as accuracy, but it is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

- ROC Curve (Receiver Operating Characteristics Curve)

ROC Curve is one of the most important evaluation metrics for checking any classification model’s performance. I am not going to go through it as it is explained very well in the following article: Explanation of the ROC Curve.

Regression Metrics

If we want to measure how well ouralgorithms are performing on predicting numeric values, then there are three main metrics that are frequently used. Mean Absolute Error, Mean Squared Error, and R2Score.

- Mean Absolute Error (MAE)

Mean Absolute Error is an average value of the absolute errors. It is a useful metric to optimise on when the value you are trying to predict follows a skewed distribution. Optimising on an absolute value is particularly helpful in these cases because outliers will not influence models attempting to optimise on this metric as much as if you use the mean squared error.

But the MAE has a problem, which is that the absolute value function is not differentiable so we can’t use this metrics in algorithms which use Gradient Descend methods. To solve this problem, we use the more common metrics, mean squared error.

- Mean Squared Error (MSE)

Mean Squared Error measures the average value of the squares of the errors and is by far the most used metric for optimisation in regression problems. Similar to with MAE, we want to find a model that minimises this value. This metric can be greatly impacted by skewed distributions and outliers. When a model is considered optimal via MAE, but not for MSE, it is useful to keep this in mind. In many cases, it is easier to actually optimize on MSE, as the a quadratic term is differentiable.

- R2 Score

Finally, the R2 Score is another common metric when looking at regression values. Optimising a model to have the lowest MSE will also optimise a model to have the the highest R2 score. This is a convenient feature of this metric. The R2 score is frequently interpreted as the “amount of variability” captured by a model. Therefore, you can think of MSE, as the average amount you miss by across all the points, and the R2 value as the amount of the variability in the points that you capture with a model.

R2 Score is based on comparing our model to the simplest possible model. Let’s assume we got our model with linear regression. The simplest possible model that fits a bunch of data is if we take the average of all the values and draw a horizontal line through them. If we calculate the MSE for simpler model, we would hope that it’s larger than the MSE of the linear regression model. The question is how much larger? To answer this question let’s look at the following picture:

We divide the error of linear regression model by the error for simpler model, and subtract the result from 1. This gives us our R2 value. If our model is bad, the errors from both model should be similar and the quantity marked in the image should be close to 1, therefore, R2 value should be close to 0. If our model is good the error for linear regression model should be a lot smaller than the error for the simpler model, which means that the quantity marker in the picture should be close to 0, and our R2 value should be close to 1.

To conclude, if the R2 score is close to 1 then our model is good and if it is close to 0, our model is bad.

As an additional note, optimizing on the MAE may lead to a different “best model” than if we optimise on the MSE. However, optimizing on the MSE will always lead to the same “best” model as if you were to optimise on the R2 score.

This article was based on the lectures from Udacity’s course which helped me understand these metrics better. I hope it will be helpful to you as it was to me.