TP / FP / FN / TN#
The counts of true positive (TP), false positive (FP), false negative (FN), and true negative (TN) ground truths and inferences are essential for summarizing model performance. These metrics are the building blocks of many other metrics, including accuracy, precision, and recall.
|True Positive||TP||An instance for which both predicted and actual values are positive|
|False Positive||FP||An instance for which predicted value is positive but actual value is negative|
|FN||An instance for which predicted value is negative but actual value is positive|
|True Negative||TN||An instance for which both predicted and actual values are negative|
To compute these metrics, each inference is compared to a ground truth and categorized into one of the four groups. Let’s say we’re building a dog classifier that predicts whether an image has a dog or not:
|Positive Inference (Dog)||Negative Inference (No Dog)|
|Positive Ground Truth (Dog)||True Positive (TP)||False Negative (FN)|
|Negative Ground Truth (No Dog)||False Positive (FP)||True Negative (TN)|
Images of a dog are positive samples, and images without a dog are negative samples.
If a classifier predicts that there is a dog on a positive sample, that inference is a true positive (TP). If that classifier predicts that there isn’t a dog on a positive sample, that inference is a false negative (FN).
Similarly, if that classifier predicts that there is a dog on a negative sample, that inference is a false positive (FP). A negative inference on a negative sample is a true negative (TN).
The TP / FP / FN / TN metrics have been around for a long time and are mainly used to evaluate classification, detection, and segmentation models.
The implementation of these metrics is simple and straightforward. That said, there are different guidelines and edge cases to be aware of for binary and multiclass problems as well as object detection and other workflows.
There are three types of classification workflows: binary, multiclass, and multi-label.
In binary classification workflow, TP, FN, FP, and TN are implemented as follows:
||Ground truth labels, where
||Predicted confidence scores, where a higher score indicates a higher confidence of the sample being positive|
||Threshold value to compare against the inference’s confidence score, where
Should Threshold Be Inclusive or Exclusive?
A confidence threshold is defined as "the minimum score that the model will consider the inference to be positive (i.e. true)". Therefore, it is a standard practice to consider inferences with confidence score greater than or equal to the confidence threshold as positive.
With these inputs, TP / FP/ FN / TN metrics are defined:
Example: Binary Classification
This example considers five samples with the following ground truths, inferences, and threshold:
Using the above formula for TP, FP, FN, and TN yields the following metrics:
TP / FP / FN / TN metrics are computed a little differently in a multiclass classification workflow.
For a multiclass classification workflow, these four metrics are defined per class. This technique, also known as one-vs-rest (OvR), essentially evaluates each class as a binary classification problem.
Consider a classification problem where a given image belongs to either the
Car class. Each of
these TP / FP / FN / TN metrics is computed for each class. For class
Airplane, the metrics are defined as follows:
|True Positive||Any image predicted as an
|False Positive||Any image predicted as an
|Any image not predicted as an
|True Negative||Any image not predicted as an
In a multi-label classification workflow, TP / FP / FN / TN are computed per class, like in multiclass classification.
A sample is considered to be a positive one if the ground truth includes the evaluating class; otherwise, it’s a
negative sample. The same logic can be applied to the inferences, so, for example, if a classifier predicts that this
sample belongs to class
Boat, and the ground truth for the same sample is only class
this sample is considered to be a TP for class
Airplane, and FP for class
Multi-label classification workflow can alternately be thought of as a collection of binary classification workflows.
There are some differences in how these four metrics work for a detection workflow compared to a classification workflow. Rather than being computed at the sample level (e.g. per image), they're computed at the instance level (i.e. per object) for instances that the model is detecting. When given an image with multiple objects, each inference and each ground truth is assigned to one group, and the definitions of the terms are slightly altered:
|True Positive||TP||Positive inference (
|False Positive||FP||Positive inference (
|FN||Ground truth that is not matched with an inference or that is matched with a negative inference (
Poorly defined for object detection!
In object detection workflow, a true negative is any non-object that isn't detected as an object. This isn't well defined and as such true negative isn't a commonly used metric in object detection.
Occasionally, for object detection workflow "true negative" is used to refer to any image that does not have any true positive or false positive inferences.
Let’s assume that a matching algorithm has already been run on all inferences and that the matched pairs and unmatched
ground truths and inferences are given. Consider the following variables, adapted from
||List of matched ground truth and inference bounding box pairs|
||List of unmatched ground truth bounding boxes|
||List of unmatched inference bounding boxes|
||Threshold used to filter valid inference bounding boxes based on their confidence scores|
Then these metrics are defined:
Example: Single-class Object Detection
This example includes two ground truths and two inferences, and when computed with an IoU threshold of 0.5 and confidence score threshold of 0.5 yields:
Like classification, multiclass object detection workflow compute TP / FP / FN per class.
Example: Multiclass Object Detection
Similar to multiclass classification, TP / FP / FN are computed for class
Apple and class
Using an IoU threshold of 0.5 and a confidence score threshold of 0.5, this example yields:
Averaging Per-class Metrics#
For problems with multiple classes, these TP / FP / FN / TN metrics are computed for each class. If you are looking for a single score that summarizes model performance across all classes, there are a few different ways to aggregate per-class metrics: macro, micro, and weighted.
Read more about these different averaging methods in the Averaging Methods guide.
Limitations and Biases#
TP, FP, FN, and TN are four metrics based on the assumption that each sample/instance can be classified as a positive or a negative, thus they can only be applied to single-class applications. The workaround for multiple-class applications is to compute these metrics for each label using the one-vs-rest (OvR) strategy and then treat it as a single-class problem.
Additionally, these four metrics don't take model confidence score into account. All inferences above the confidence score threshold are treated the same! For example, when using a confidence score threshold of 0.5, an inference with a confidence score barely above the threshold (e.g. 0.55) is treated the same as an inference with a very high confidence score (e.g. 0.99). In other words, any inference above the confidence threshold is considered as a positive inference. To examine performance taking confidence score into account, consider plotting a histogram of the distribution of confidence scores.