Log-loss, F-measure, Precision-recall

1

How to start working with us.

Geolance is a marketplace for remote freelancers who are looking for freelance work from clients around the world.

2

Create an account.

Simply sign up on our website and get started finding the perfect project or posting your own request!

3

Fill in the forms with information about you.

Let us know what type of professional you're looking for, your budget, deadline, and any other requirements you may have!

4

Choose a professional or post your own request.

Browse through our online directory of professionals and find someone who matches your needs perfectly, or post your own request if you don't see anything that fits!

Classified accuracy is the total number of accurate predictions multiplied by all predicted for a given dataset. The achieved accuracy is unsuitably used in imbalanced classification. For example, it was initially determined that the overwhelming numbers of examples of the significant classes were likely overtaking those from the minorities, meaning that even unskillful models were capable of scoring precision scores of 90 percent, 99 percent, depending on the extent of class imbalances occurring. Alternatives to classification precision are precision and recall metrics.

F-Measure for Imbalanced Classification

When the class distribution is skewed, classification accuracy as a performance measure can result in an apparent poor performance. In this case, F-measures are used to adjust for class imbalances.

F-measure accuracy: The harmonic mean of precision and recall, evaluated on a per-class basis using average measures over all samples. This metric can give equal weight both to high false-positive rate or high actual negative rate depending on whether it was set lower or higher, respectively, compared to those two metrics. The formula for f-measure is as follows:

Where TP stands for "true positive," FP stands for "false positive," and FN stands for "false negative." A classifier can lose significant amounts of information when false negatives are greater than the actual negative rate. In this situation, a high recall is preferred over a low receiver operating characteristic curve (ROC) area. On the other hand, precision becomes more critical when more false positives than accurate positive samples. It ensures that the model is not too liberal but includes as many examples as possible in its correct predictions.

A better way to classify your data

Classified accuracy is the total number of accurate predictions multiplied by all predicted for a given dataset. The achieved accuracy is unsuitably used in imbalanced classification. It was initially determined that the overwhelming numbers of examples of the significant classes were likely overtaking those from the minorities, meaning that even unskillful models were capable of scoring precision scores of 90 percent, 99 percent, depending on the extent of class imbalances occurring. Alternatives to classification precision are precision and recall metrics.

Glance provides an easy-to-use service with no upfront costs or long-term commitments, so you can get started right away without having to worry about any hidden fees or contracts! We're so confident in our ability to provide excellent results that we offer a 100% money-back guarantee if you don't see an improvement after 30 days! Let us show you how we can help today!

Confusion matrix for unbalanced classification

Positive class (c=1) is the majority class. So there are more positive examples than negative ones. To correctly classify all positives but at least one negative, we need to choose a high recall threshold because of the higher true negatives rate. Precision is lower for the same reason. The F-measure will be improved by choosing high recall and precision thresholds although both can be improved by shifting them towards \(\small \pm 1\) , i.e., moving along diagonal line \( \overline{TP} = \overline{FP} = \delta\) .

Precision, recall, and F-measures for imbalanced classification

In most cases, the F1 score will approach one if one of the classes has few examples. In addition to that, likely, Precision and Recall metrics will also fall closer to 0 or 1 with increased imbalance. That is because they are normalized on a per-class basis like other metrics based on the binomial distribution (e.g., precision should equal \( \frac{TP}{P}\) where P stands for "proportion of positives"). So, again, the only metric which takes into account both classes and the class balance is the F-measure:

When we know that one of the classes has few examples, we should use an unbalanced accuracy metric such as the Micro F1 score to understand our model's performance better. This metric weighs precision and recalls equally, whether balanced or not. In other words, it gives equal weight to both high false-positive rate or high valid negative rate depending on whether it was set lower or higher respectively compared to those two metrics.

Precision vs. Recall for balanced and unbalanced classification

The following plots show how precision and recall change depending on the class distribution within the dataset. The ROC area is also adjusted where we see deeper curves than in balanced cases because of higher rates of false positives and/or missed detections. Computing the F1 score will give us an excellent overview. Still, at best, its value should be seen as complementary to other univariate or multivariate statistical approaches, which provide different but equally valuable insights.

We can use all this information to assign appropriate weights to each metric when computing an average across multiple experiments with different datasets, validation strategies, etc., therefore taking into account the data properties which are more relevant to the system we're studying (e.g., if it's a multi-class problem with high-class imbalance, assign more weight to the F-measure than other metrics).

The following section will see how assessing these properties can help us make better decisions about which model and hyperparameters are best suited for our use case. We'll focus on imbalanced datasets here, but this approach can be extended to any classification task with specific properties that might not be fully captured by the error rates of accuracy, precision, or recall.

It is challenging to create an objectively balanced dataset when trying different machine learning models because most datasets are intrinsically unbalanced. A good way of approaching this problem is to define an automated way to assess the degree of imbalance in a dataset. Several approaches could be used, ranging from gold-standard balanced and unbalanced datasets, resampling strategies, abundance estimation techniques from count data (e.g., R's make. contingency table), etc.

Even though it is widespread to find datasets with an imbalanced class distribution, there are still benefits of using accuracy as a performance measure for classification because it measures model output quality on a per-class basis even if classes have very different proper frequencies within the dataset. However, using accuracy as a metric can mislead when the frequency ratio between positive and negative instances often changes or by a significant amount when we try to predict complex systems with high dimensional data.

To see this, let's first take a look at the performance of some classifiers with their accuracy on the breast cancer dataset, which has about 11000 instances belonging to one of four classes (4000 benign and 7500 malignant) but is highly skewed concerning class distribution. We can also use the F1 score for this binary classification task because it gives similar results reported in the literature. However, due to its lower sensitivity, close to 0 or 1 rate problems, it can be less intuitive when dealing with high-class imbalanced datasets.

Recall or Sensitivity vs. Precision or Specificity for balanced and unbalanced classification

In the following plots, we can see that accuracy gives us a similar overview to precision-recall curves in balanced datasets, making it easier to compare different models with one another in this case. However, in highly imbalanced datasets where classes have very different frequencies within the dataset, the valid positive rate becomes more important than specificity because there are often few negative instances compared to the size of the positive class well leading to a large amount of FP's if a recall is not high enough. In these cases, F1 scores suffer from overfitting since they depend on model complexity, and therefore their interpretation might be difficult when dealing with complex systems.

Precision. Recall for increased risk threshold in a dataset with 10k benign and 1k malignant instances.

In this case, the balanced random forest model performs exceptionally well in terms of F1 score compared to other models, but its recall is not high enough to allow it to be used in practice. As seen from the TP rate plots, SVM's perform worse because they tend to increase FP rates when trying to reduce FN rates which often happens in real-world problems where datasets are intrinsically unbalanced. There might also be a trade-off between precision and recall if we find that changing thresholds can help improve one metric at a cost depending on our application needs.

The figure below shows an example of this by training a forest classifier on different partitions of the dataset and then plotting its precision and recall rate as a function to increase the risk threshold.

As we can see, there is a point where the F1 score starts to increase, which means that we have reached a trade-off between precision and recall because just one model with perfect specificity but only a few TP's would be the ideal classifier for this problem according to this performance measure. In some cases, it might be better to use other metrics such as AUC or G-mean, where very imbalanced classes are taken into account!

AUC: Area under the ROC curve

Analogously to the ROC curves, which are often used to visualize model performance on binary classification tasks, we can also analyze classifier performance using Receiver Operating Characteristic (ROC) curves for multi-class problems. The area under the ROC curve (AUC), known as an out-of-sample measure of how well a model will classify the average test data point according to its valid label, performs similarly to accuracy when dealing with balanced datasets containing equal amounts of each class within the training set. However, AUC suffers from overfitting, just like F1 scores in unbalanced datasets where there is only one positive class since it heavily depends on the model's performance on this specific class which is why there are different multi-class extensions of this measure.

Mean squared error, logarithmic loss, or zero-one loss.

In general, three performance measures can be used to evaluate the performance of regression problems: mean squared error reduces to MSE if we have a probability distribution centred around 0 with a variance close to 1, zero-one loss applies a penalty more significant than 1 for incorrect values. It is therefore not very intuitive when interpreted. For example, in the following plot, we can see that these metrics behave differently regarding an increased risk threshold, as shown before. In contrast, ROC curves tend to stay low because only a few instances satisfy our criterion. Prediction bias vs. expected classification accuracy for cut-off thresholds from balanced random forest model This property allows us to obtain the exact class probabilities as those trained by the predictor, as shown in the figure below. In this case, we can see that for a specific threshold of 0.7, all the class probabilities sum up to 1 regardless of how many classes there are, which is not always true for other metrics such as F1 or AUC since they suffer from overfitting when applied to imbalanced datasets.

Caveats and conclusion:

Multi-class extensions of model performance measures such as G-mean or averaged mutual information have been developed to account for class imbalance problems. Still, these tend to produce worse results than their binary counterparts since they decrease inter-class variability by pooling together some classes which were more likely to be predicted with high accuracy according to the standard performance measure used on binary classification tasks. It might be helpful to use such metrics if the classes are very similar, but in this case, it might be better to use a different loss function or distribution, allowing us to gain more information about each class. In conclusion, we will say that there is a big difference between TP and FP rates (e.g., >50%). When trying to maximize the classifier's overall performance, F1 scores tend to perform better than any other combination of precision and recall values since they provide an unbiased estimate of generalization performance by taking into account both types of errors. ROC curves can often behave similarly on imbalanced datasets, depending on how well balanced our dataset is, while squared error or zero-one loss are typically more sensitive to overfitting problems. If the class balance is satisfactory, either accuracy or AUC can be used on more balanced datasets where multi-class extensions of performance measures provide better results due to the increased variance.

Precision recall curves for various class imbalance ratios

Model performance measures are not enough to evaluate the quality of classification models since they estimate how well a model will classify future samples no matter its content. This explains why it is essential to know which specific information was used by the predictor to make decisions about each predicted class or point.

For example, only using the recall values at TP thresholds will not consider any potential misclassification of FP points. However, relying solely on precision would hide many incorrect predictions due to incorrect labels (and, therefore, high spam scores). To this end, it might be more informative to use F1 scores or conditional probabilities for each class, as shown below on some spam and ham emails represented as word histograms.

Spam and ham with spam word distributions are represented as bars on the left and right, respectively, where true negatives are shown in black, false positives are in red, true positives are in blue and false negatives are in green. From these plots, we can see that some information can be used to discriminate between spam/ham messages depending on the classifier, which is not necessarily seen when using standard performance measures or ROC curves. Therefore, we can conclude that it may be essential to consider several models to understand better how these images were classified, which might also help improve our predictions by further analyzing what words tend to appear with each label.

Zero-one loss vs. F1 for various ratios of classes on imbalanced data

The F1-score is a particular case of the more general weighted harmonic mean, which can be applied to any set of real numbers by simply multiplying each value by its corresponding weight before taking the reciprocal of the sum. Many metrics based on this concept have been developed to estimate some human preferences better. For example, see Pérez et al. (2004). Within machine learning, the F1 score has been derived from early information retrieval research where precision and recall were used as objective functions within search engines with traditional or probabilistic ranking techniques (e.g., see Salton & McGill 1983). The harmonic mean was first introduced for imbalanced datasets by Cohen & Sackrowitz (2000) to measure classifier performance while taking into account both types of errors.

Notice that the kappa metric ( Cohen 1960 ) is usually more suitable for balanced datasets, unlike the accuracy score. Recall that it equals zero when none of the predicted labels match any ground truth labels where it reaches its maximum value at TP=FP=100%. It is possible to derive a Loss-type function ( Cohen 1960 ) based on the F1-score which can be used instead of zero-one loss or MSE by simply replacing actual class probabilities with binary values (i.e., predict spam if p(spam)>p(ham)). This makes sense with most classification techniques, although there are some exceptions, such as problems with one class completely hidden or unrepresentable among the others.

The x-axis represents the F1 score which ranges from 0 (poor discrimination) to 1 (perfect discrimination). For example, a value of 0.5 would indicate random guessing. For comparison purposes, accurate positive and false negative rates are also represented by vertical dashed lines with corresponding points labelled TP and FN, respectively. Notice that, unlike previous measures shown previously on this website, both axes are logarithmic to capture better the wide range of possible values that can be achieved when dealing with imbalanced datasets.

Summary and concluding remarks:

Although several metrics and methods for measuring classifier performance exist, we have focused on the most commonly used ones that practitioners understand better. Test set performance is usually measured after a training phase where new models can be created and evaluated using cross-validation techniques. For imbalanced datasets which cannot be represented adequately by sampling (e.g., see Kontschieder et al. 2015 ), it is essential to consider the nature of the data when choosing an evaluation technique or to evaluate model parameters such as those shown above. It is also good practice to train several models to make more accurate predictions depending on specific situations, including many non-binary cases not considered here due to lack of available tools. Moreover, one must consider the cost of errors which might be different depending on who is being misclassified or, in some cases, how much. For example, if one classifies a client as spam while they are ham, it might cause more damage for spammers than no-spamming ISPs.

Unfortunately, many datasets are still imbalanced with high costs attached to each error, especially when dealing with privacy issues where false negatives are often not allowed. As seen throughout this website, achieving good performance during training depends on several factors, including the nature of the data and distribution of classes, among others that change from problem to problem. Therefore, model evaluation techniques aim to estimate predictive power given a specific set of features and labels selected by humans, which can sometimes be a limited inaccuracy. For this reason, it is difficult to find a single optimal evaluation method suitable for all problems and datasets. Instead, one must carefully choose an approach based on identified goals or problems with available resources. With more research being done within machine learning, new tools may emerge with better ways to handle these issues by leveraging advances in different areas of science, including data mining, statistics, information retrieval, and others.

Luckily for most classification problems, techniques such as stratified sampling also capture imbalanced datasets within ratios close to their proper distribution; however, they are not always applicable, especially when dealing with large classes (e.g., >100). Although examples are shown here cover only binary classification cases, these methods can be applied to multi-classification problems and often by only changing the loss function. Furthermore, many of these methods are generalizable to other related problems such as regression, time series forecasting, and others we hope to cover in future posts within this website.

Geolance is an on-demand staffing platform

We're a new kind of staffing platform that simplifies the process for professionals to find work. No more tedious job boards, we've done all the hard work for you.


Geolance is a search engine that combines the power of machine learning with human input to make finding information easier.

© Copyright 2022 Geolance. All rights reserved.