Blog / Machine Learning, Machine Learning Explained / Machine Learning Explained: What is the F1 Score in Machine Learning & Deep Learning?

Machine Learning Explained: What is the F1 Score in Machine Learning & Deep Learning?

Apr. 4, 2024

24 min

Category: Machine Learning, Machine Learning Explained

Nathan Robinson

Product Owner

Nathan is a product leader with proven success in defining and building B2B, B2C, and B2B2C mobile, web, and wearable products. These products are used by millions and available in numerous languages and countries. Following his time at IBM Watson, he 's focused on developing products that leverage artificial intelligence and machine learning, earning accolades such as Forbes' Tech to Watch and TechCrunch's Top AI Products.

Evaluating the performance of a machine learning model is critical for building robust and accurate solutions. While accuracy is a standard metric used for this purpose, it might not provide a complete picture for some machine learning models, especially in cases where the dataset is imbalanced or the consequences of false positives and false negatives vary. This is where metrics like precision and recall come into play. Achieving the right balance between these two metrics is crucial in machine learning, and the F1 Score offers a holistic evaluation of a model’s performance by combining precision and recall into a single metric.

False & Actual Positives/Negatives

In machine learning and statistics, the terms “false positive,” “false negative,” “actual positive,” and “actual negative” are used to to describe model performance and the results of binary classification models, such as those used in tasks like spam email detection, medical diagnosis, and fraud detection. These terms are commonly associated with confusion matrices, which are an evaluation metric used to assess and refine the effectiveness of machine learning models. These terms are defined as follows:

In machine learning and statistics, the terms ‘true positive,’ ‘true negative,’ ‘false positive,’ and ‘false negative’ are crucial for describing the performance of binary classification models, such as those used in spam email detection, medical diagnosis, and fraud detection. These terms are typically associated with confusion matrices, an evaluation tool used to assess and refine the effectiveness of machine learning models. These terms are defined as follows:

True Positive (TP): The cases where a model correctly predicts a positive class. For example, a medical diagnosis model correctly identifying a patient with a disease.
True Negative (TN): The cases where a model correctly predicts a negative class. For example, a medical diagnosis mode correctly identifying a healthy patient as not having a disease.
False Positive (FP): The cases where a model incorrectly predicts a positive class, while the actual class is negative. For example, the model falsely indicating a disease in a healthy patient.
False Negative (FN): The cases where a model incorrectly predicts a negative class, while the actual class is positive. For example, the model failing to detect a disease in an afflicted patient.

Understanding these terms is crucial to evaluating how well a model performs in different real-world applications and fine-tuning its parameters to achieve the desired balance between false negatives and positives, depending on the specific problem and its associated costs or risks.

The Need for Comprehensive Evaluation

In many machine learning scenarios, relying solely on accuracy – the proportion of true results among the total number of cases examined – to assess model performance can be misleading. For instance, consider a model designed to identify a rare disease. If the disease occurs infrequently, a classifier that predominantly predicts ‘negative’ (no disease) could achieve high accuracy simply because most cases are negative, highlighting a class imbalance issue. This situation illustrates the limitations of using accuracy as the sole metric in scenarios where there is a significant disparity between class frequencies. To address this, other metrics such as precision and recall are used. These metrics offer a more comprehensive evaluation of a model’s performance.

Precision

Precision is the ratio of true positives to the total predicted positives, indicating how many of the model’s positive predictions were correct. Precision is mathematically calculated using the formula:

Example: In a spam email classifier, precision would be the ratio of correctly classified spam emails to the total emails predicted as spam. If the model flagged 100 emails as spam and 90 emails were actual spam:

Recall

Recall is the ratio of true positives to the total actual positives, indicating how many of the model’s positive cases were correctly predicted. This metric aids in improving the model accuracy and reducing imbalanced data. Recall is mathematically calculated using the formula:

Example: In a medical test for a rare disease, recall would be the ratio of individuals correctly identified as having the disease to the total individuals who actually have the disease. If there are 100 individuals with the disease, and the model correctly identifies 80 of them:

What the F1 Score is in Machine Learning

The F1 Score is a metric that combines precision and recall into a single value, offering a comprehensive assessment of the model’s performance and whether there’s imbalanced data.

Defining the F1 Score

The F1 Score is mathematically calculated using the formula:

The F1 Score ranges between 0 and 1, with higher values indicating a higher accuracy and better model performance. When both precision and recall are balanced, the F1 Score tends to be higher. However, in cases where precision and recall are imbalanced, the F1 Score will also reflect this.

Comparing the F1 Score with metrics like precision and recall individually can also help to assess model performance.

Mathematical Example of F1 Score

Suppose you have a binary classification problem where you are trying to identify whether an email is spam (positive class) or not spam (negative class). You have the following data:

True Positives (TP): 150 emails correctly identified as spam
False Positives (FP): 30 emails incorrectly identified as spam
False Negatives (FN): 20 spam emails incorrectly classified as not spam
True Negatives (TN): 800 emails correctly identified as not spam

You can calculate precision and recall as follows:

Using the values calculated for precision and recall, you can find the F1 Score.

Simplify this expression and calculate the F1 Score:

In this example, the F1 Score is approximately 0.8571, signifying that the model demonstrates strong performance in both precision (the proportion of emails correctly identified as spam out of all the emails labeled as spam) and recall (the proportion of actual spam emails correctly identified as such). An F1 Score of 0.8571 is generally indicative of good performance, especially in the context of spam email filtering. However, it’s important to consider the specific needs and context of the task at hand when interpreting this score, as different applications may have varying requirements for precision and recall. Also keep in mind that with real world datasets there might be problems with data distribution.

Why Balance Precision & Recall

Precision and recall often have an inverse relationship. Models with high precision across the entire dataset tend to have low recall, and vice versa. The F1 Score serves as a way to find the optimal balance between these two metrics while also minimizing false positives. In scenarios where false positives and negatives have significant consequences, the F1 Score becomes a valuable tool for evaluating model effectiveness. It ensures that the model is not just focusing one one aspect of the model architecture at the expense of the other.

Let’s consider a medical diagnosis model as an example, where false positives and false negatives can have serious consequences.

Suppose you have a model that predicts whether a patient has a rare and life-threatening disease, such as cancer. You have the following data:

True Positives (TP): The model correctly identifies patients with the disease
True Negatives (TN): The model correctly identifies patients who do not have the disease
False Positives (FP): The model incorrectly predicts a patient has the disease when they don’t
False Negatives (FN): The model incorrectly predicts that a patient does not have the disease when they do

The consequences are as follows:

A false positive may lead to unnecessary stress, additional testing, and medical expenses for a patients who doesn’t have the disease. While this can cause undue anxiety, this false alarm is generally not life-threatening.
A false negative, on the other hand, is more serious and can delay essential treatment for a patient who actually has the disease. Failing to diagnose a patient who actually has the disease can result in significant health deterioration or even mortality, emphasizing the critical need to minimize false negatives in medical diagnostics.

In such a critical scenario, you want to evaluate your model’s performance with high precision and high recall. In the case of this medical diagnosis model, a high F1 Score would indicate that the model is making very few errors (both false positives and false negatives) and its predictions are reliable.

The F1 Score can helps ensure that you strike the right balance between avoiding false negatives (missing actual cases of the disease) and keeping false positives (inconveniences for patients without the condition considering positive) at a minimum so the model’s accuracy can be trusted.

Finding the Right Balance

Interpreting the F1 Score requires understanding the trade-offs between precision and recall. Depending on the problem, you might prioritize one metric over the other.

For instance, in spam detection, a very high precision (avoiding false positives) might be prioritized to ensure that legitimate emails are not incorrectly marked as spam. However, this could potentially lower recall, leading to more spam emails getting through. Conversely, prioritizing recall (catching as many spam emails as possible) might lead to more legitimate emails being misclassified as spam.

Therefore, the “right balance” is highly context-dependent and requires understanding the specific needs and consequences in the application domain and utilizing proper evaluation metrics. It involves fine-tuning the model and possibly adjusting its threshold for classifying positives to optimize the F1 Score in a way that aligns with the specific objectives and constraints of the task.

Use Cases & Interpretation

The F1 Score finds practical applications in various machine learning scenarios. In fraud detection, the F1 Score helps models strike a delicate balance between catching fraudulent activities and avoiding the mislabeling of legitimate transactions, thus maintaining customer trust. In manufacturing, this metric is key to quality control, aiding in the efficient detection of defects to uphold product standards. For businesses monitoring social media, the F1 Score enables precise sentiment analysis, critical for understanding and responding to consumer opinions. The F1 Score is also used to create safe digital environments by effectively identifying and filtering inappropriate content.

F1 Score vs. Other Metrics

While accuracy is often a standard metric for evaluating models, it doesn’t consider class imbalances. The F1 Score provides a more comprehensive evaluation by considering precision and recall. Certainly, when evaluating a machine learning model, you should consider multiple metrics to gain a comprehensive understanding of its performance. Below are examples of other metrics used to evaluate ML model performance:

Accuracy

Accuracy is defined as the overall correctness of the model’s predictions and helps ensure the predicted values are useful.

Example: In a binary classification model for spam detection, accuracy is the ratio of all correct predictions (true positives and true negatives) to all predictions (true positives, true negatives, false positives, and false negatives). If the model correctly classifies 950 emails out of 1000, the accuracy is 95%.

Specificity

Specificity is defined as how many of the actual negative cases the model correctly predicted as negative.

Example: In a medical test for a disease, specificity would be the ratio of individuals correctly identified as disease-free to the total number of individuals who are disease-free. If there are 500 disease-free individuals, and the model correctly identifies 490 of them, the specificity would be 490/500 = 0.98.

Receiver Operating Characteristic – Area Under the Curve (ROC-AUC)

Receiver Operating Characteristic (ROC) Curve: The ROC curve is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
Area Under the Curve (AUC): This is a measure of the entire two-dimensional area underneath the entire ROC curve. The higher the AUC, the better the model is at distinguishing between positive and negative cases.

The ROC-AUC is defined as the area under the ROC curve, which shows the trade-off between true positive rate (sensitivity) and false positive rate. A ROC-AUC value close to 1 indicates a high level of diagnostic ability, while a value close to 0.5 suggests no discriminative ability (equivalent to random guessing).

Example:

Here is a visual representation showing both the ROC curve and its corresponding AUC. The blue line represents the ROC curve, plotting the TPR against the FPR at various thresholds. The red dashed line is the “No Skill” line, indicating a model’s performance equivalent to random guessing.

In this example the AUC is 0.70 and suggests a moderate level (70% chance) of the model correctly distinguishing between positive and negative cases.

F2 Score

Similar to the F1 Score, the F2 Score balances precision and recall but places more emphasis on recall. This makes the F2 Score more useful in situations where missing a positive instance (FN) is more costly than incorrectly labeling a negative instance as positive (FP), like medical diagnosis or fraud detection models. The F2 Score is mathematically calculated using the formula:

The F2 Score ranges from 0 to 1, where 0 is the worst score and 1 is the best score:

A higher F2 Score indicates that the cases the system identifies as positive are correct (precision), and the system identifies all positive instances (recall).
An F2 Score of 0 indicates the worst performance, where either the precision or recall (or both) is zero. This would mean there are no true positives, or the system is failing to correctly identify any of the relevant instances.

Example: Suppose you are performing a disease screening and have the following data:

True Positives (TP): 40 sick individuals correctly identified as sick
False Positives (FP): 10 healthy individuals incorrectly identified as sick
True Negatives (TN): 20 healthy individuals correctly identified as healthy
False Negatives (FN): 30 sick individuals incorrectly identified as healthy

You can calculate precision and recall as follows:

Using the values calculated for precision and recall, you can calculate the F2 Score:

In this example, the F2 Score for this test is approximately 0.61 indicating the test is reasonably good at identifying most of the actual positive cases, with an emphasis on minimizing false negatives.

Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient (MCC) is a measure used to evaluate the quality of binary (two-class) classifications. Binary class dataset measurement is useful when you want to take multiple data channels and bring them together. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction, and -1 indicates total disagreement between prediction and observation. The MCC is mathematically calculated using the formula:

Example: You have a binary classification model that predicts whether an email is spam or not. After testing the model on a dataset, you obtain the following confusion matrix:

True Positives (TP): 50 emails correctly identified as spam
False Positives (FP): 5 emails incorrectly identified as spam
False Negatives (FN): 10 spam emails incorrectly classified as not spam
True Negatives (TN): 45 emails correctly identified as not spam

You can calculate the MCC as follows:

The MCC for this example is approximately 0.73. This value indicates a good predictive quality of the model, as it is significantly higher than 0 (which would indicate a random prediction) and closer to 1, which represents a perfect prediction.

These metrics provide a holistic view of a model’s performance and are often used in combination to make well-informed decisions about its suitability for a specific task. The choice of metrics depends on the nature of the problem and the relative importance of false positives and false negatives in the particular application.

F1 Score for Multi-Class Classification

The F1 Score can also be adapted for multi-class classification problems using two approaches: micro-average and macro-average.

Micro-Average F1 Score

The Micro-Average F1 Score is a method adapted for multi-class classification problems. This approach aggregates the contributions of all classes into a unified F1 score. It is calculated by considering the total number of true positives (TP_total), false positives (FP_total), and false negatives (FN_total) across all classes, treating the dataset as a single binary classification problem. This method is particularly effective in datasets with class imbalance, as it equally considers the frequency of each class without disproportionately weighting any specific class.

To calculate the Micro-Average F1 Score, you must first calculate the recall and precision. These are mathematically calculated using the formulas:

The Micro-Average F1 Score can be calculated using the precision and recall values and the following formula:

Macro-Average F1 Score

Unlike the micro-average, the Macro-Average F1 Score method computes the F1 Score individually for each class and then computes the average across all classes. This approach treats all classes equally, regardless of their frequency in the dataset. It is particularly useful when you want to understand the model’s performance across all classes without giving more importance to the more frequently occurring classes. However, this might not be desirable in all cases, such as in applications where certain classes are more important than others.

The Macro-Average F1 Score is computed by calculating the F1 Score for each class and taking the arithmetic mean of these individual F1 scores.

Multi-Class Classification Example

In this multi-class classification problem there are three classes: A, B, and C. The model’s performance on a test dataset is as follows:

Class A: TP = 80, FP = 20, FN = 30
Class B: TP = 60, FP = 40, FN = 20
Class C: TP = 70, FP = 10, FN = 40

Micro-Average F1 Score

Total True Positives = 80 + 60 + 70 = 210
Total False Positives = 20 + 40 + 10 = 70
Total False Negatives = 30 + 20 + 40 = 90

You can calculate precision and recall as follows:

Using the values calculated for precision and recall, you can find the F1 Score.

The Micro-Average F1 Score for this classification problem, considering classes A, B, and C together, is approximately 0.724. This score represents the overall performance of the model across all classes, with an emphasis on the frequency of each class.

Macro-Average F1 Score

Using the model’s performance on a test dataset used in the Micro-Average F1 Score you can calculate precision and recall for each class:

Class A: Precision = 0.80, Recall ≈ 0.727
Class B: Precision = 0.60, Recall = 0.75
Class C: Precision = 0.875, Recall ≈ 0.636

Then compute the F1 Score for each class:

F1 Score (Class A) ≈ 0.762
F1 Score (Class B) ≈ 0.667
F1 Score (Class C) ≈ 0.737

Then calculate the Macro-Average F1 Score:

In this example, the micro-average F1 Score represents the overall performance across all classes as a single value, while the macro-average F1 Score provides the average performance across individual classes, giving each class equal importance.

Limitations & Considerations

While the F1 Score is a robust evaluation metric, it’s important to note its limitations. One major limitation is its sensitivity to class imbalances. In cases of highly imbalanced datasets, the F1 Score might not provide a meaningful representation of a model’s performance.

Also, the F1 Score doesn’t account for the specific costs associated with a false positive or a false negative, which might vary based on the application. Comparing models based solely on their F1 Scores can be misleading, especially if the scores are close. The underlying precision and recall trade-offs can be quite different, suggesting different strengths and weaknesses in the models.

Harness Machine Learning Technology in Your Business

Ready to harness the power of machine learning? WestLink is your trusted partner in navigating the complexities of deep learning and artificial intelligence, including concepts like the F1 Score.

Our machine learning and data science expertise can future-proof your company! Let us help you build machine learning models that propel your business forward, equipping you to tackle real-world challenges and foster a culture of innovation that drives growth and success.

Questions?

What is the F1 Score in machine learning, and why is it important?
Toggle question
The F1 Score is an important metric in machine learning that merges precision and recall into a single measure, providing an overall evaluation of a model's performance. This metric is vital for data scientists as it assists in gauging a model's effectiveness in accurately identifying positive cases while reducing both false positives and false negatives.
Why use the F1 Score instead of accuracy?
Toggle question
The F1 score is often more reliable than accuracy in situations where there are imbalanced class distributions or when the costs associated with false positives and false negatives differ. This makes it a preferred metric in scenarios where class imbalance is a significant concern.
What values can the F1 Score range from, and how do you interpret them?
Toggle question
The F1 Score spans from 0 to 1, where 1 represents the optimal score. A higher F1 Score signifies enhanced model performance, indicating a harmonious balance between precision and recall. A score approaching 1 suggests that the model is proficient in accurately identifying positive instances and adept at reducing both false positives and false negatives. A F1 Score of 0 indicates the lowest possible performance of a model, suggesting either extremely poor precision, recall, or both. In this scenario, the model completely fails to effectively distinguishing between positive and negative instances.
Can the F1 Score be used for imbalanced datasets?
Toggle question
Yes, the F1 Score is particularly useful for imbalanced datasets where one class significantly outnumbers the other. It provides a balance between precision (the proportion of true positive results among all positive predictions) and recall (the proportion of true positive results among all actual positives), which is crucial in scenarios where one class significantly outnumbers another. This balance helps in evaluating models where simply measuring accuracy might be misleading due to the imbalance. However, relying solely on the F1 Score is not sufficient for a comprehensive evaluation of a model's performance on imbalanced datasets. Other metrics like precision, recall, and the area under the Receiver Operating Characteristic curve (AUC-ROC) also play a crucial role.
What are some applications of the F1 Score in machine learning?
Toggle question
The F1 Score is used to assess ML models where a balance between precision and recall is important, such as in spam detection, medical diagnosis, fraud detection, sentiment analysis, image and video recognition, information retrieval, anomaly detection, churn prediction, and biometric verification.
Are there scenarios where the F1 Score might not be the best metric to use in machine learning?
Toggle question
Yes, there are certain scenarios where the F1 Score might not be ideal: When True Negatives are Important: The F1 Score does not consider true negatives. In cases where identifying negatives is as important as identifying positives (e.g., in certain types of anomaly detection), other metrics like accuracy or specificity might be more appropriate. Extremely Imbalanced Data: For datasets with a severe class imbalance, the F1 Score might still not adequately reflect the model's performance. In such cases, metrics like the Matthews correlation coefficient or weighted F1 Score might be more informative. Different Cost of Errors: If the costs of false positives and false negatives are significantly different, a customized cost-based metric might be more suitable than the F1 Score.

Nathan Robinson

Product Owner

Machine Learning Explained: What is the F1 Score in Machine Learning & Deep Learning?

Table of Contents

False & Actual Positives/Negatives