TP / FP / FN / TN#
The counts of true positive (TP), false positive (FP), false negative (FN), and true negative (TN) ground truths and inferences are essential for summarizing model performance. These metrics are the building blocks of many other metrics, including accuracy, precision, and recall.
Metric  Description  

True Positive  TP  An instance for which both predicted and actual values are positive 
False Positive  FP  An instance for which predicted value is positive but actual value is negative 
FN  An instance for which predicted value is negative but actual value is positive  
True Negative  TN  An instance for which both predicted and actual values are negative 
To compute these metrics, each inference is compared to a ground truth and categorized into one of the four groups. Let’s say we’re building a dog classifier that predicts whether an image has a dog or not:
Positive Inference (Dog)  Negative Inference (No Dog)  

Positive Ground Truth (Dog)  True Positive (TP)  False Negative (FN) 
Negative Ground Truth (No Dog)  False Positive (FP)  True Negative (TN) 
Images of a dog are positive samples, and images without a dog are negative samples.
If a classifier predicts that there is a dog on a positive sample, that inference is a true positive (TP). If that classifier predicts that there isn’t a dog on a positive sample, that inference is a false negative (FN).
Similarly, if that classifier predicts that there is a dog on a negative sample, that inference is a false positive (FP). A negative inference on a negative sample is a true negative (TN).
Implementation Details#
The TP / FP / FN / TN metrics have been around for a long time and are mainly used to evaluate classification, detection, and segmentation models.
The implementation of these metrics is simple and straightforward. That said, there are different guidelines and edge cases to be aware of for binary and multiclass problems as well as object detection and other workflows.
Classification#
There are three types of classification workflows: binary, multiclass, and multilabel.
Binary#
In binary classification workflow, TP, FN, FP, and TN are implemented as follows:
Variable  Type  Description 

ground_truths 
List[bool] 
Ground truth labels, where True indicates a positive sample 
inferences 
List[float] 
Predicted confidence scores, where a higher score indicates a higher confidence of the sample being positive 
T 
float 
Threshold value to compare against the inference’s confidence score, where score >= T is positive 
Should Threshold Be Inclusive or Exclusive?
A confidence threshold is defined as "the minimum score that the model will consider the inference to be positive (i.e. true)". Therefore, it is a standard practice to consider inferences with confidence score greater than or equal to the confidence threshold as positive.
With these inputs, TP / FP/ FN / TN metrics are defined:
TP = sum( gt and inf >= T for gt, inf in zip(ground_truths, inferences))
FP = sum(not gt and inf >= T for gt, inf in zip(ground_truths, inferences))
FN = sum( gt and inf < T for gt, inf in zip(ground_truths, inferences))
TN = sum(not gt and inf < T for gt, inf in zip(ground_truths, inferences))
Example: Binary Classification
This example considers five samples with the following ground truths, inferences, and threshold:
Using the above formula for TP, FP, FN, and TN yields the following metrics:
Multiclass#
TP / FP / FN / TN metrics are computed a little differently in a multiclass classification workflow.
For a multiclass classification workflow, these four metrics are defined per class. This technique, also known as onevsrest (OvR), essentially evaluates each class as a binary classification problem.
Consider a classification problem where a given image belongs to either the Airplane
, Boat
, or Car
class. Each of
these TP / FP / FN / TN metrics is computed for each class. For class Airplane
, the metrics are defined as follows:
Metric  Example 

True Positive  Any image predicted as an Airplane that is labeled as an Airplane 
False Positive  Any image predicted as an Airplane that is not labeled as an Airplane (e.g. labeled as Boat but predicted as Airplane ) 
Any image not predicted as an Airplane that is labeled as an Airplane (e.g. labeled as Airplane but predicted as Car ) 

True Negative  Any image not predicted as an Airplane that is not labeled as an Airplane (e.g. labeled as Boat but predicted as Boat or Car ) 
Multilabel#
In a multilabel classification workflow, TP / FP / FN / TN are computed per class, like in multiclass classification.
A sample is considered to be a positive one if the ground truth includes the evaluating class; otherwise, it’s a
negative sample. The same logic can be applied to the inferences, so, for example, if a classifier predicts that this
sample belongs to class Airplane
and Boat
, and the ground truth for the same sample is only class Airplane
, then
this sample is considered to be a TP for class Airplane
, and FP for class Boat
.
Multilabel classification workflow can alternately be thought of as a collection of binary classification workflows.
Object Detection#
There are some differences in how these four metrics work for a detection workflow compared to a classification workflow. Rather than being computed at the sample level (e.g. per image), they're computed at the instance level (i.e. per object) for instances that the model is detecting. When given an image with multiple objects, each inference and each ground truth is assigned to one group, and the definitions of the terms are slightly altered:
Metric  Description  

True Positive  TP  Positive inference (score >= T ) that is matched with a ground truth 
False Positive  FP  Positive inference (score >= T ) that is not matched with a ground truth 
FN  Ground truth that is not matched with an inference or that is matched with a negative inference (score < T ) 

True Negative  TN  Poorly defined for object detection! In object detection workflow, a true negative is any nonobject that isn't detected as an object. This isn't well defined and as such true negative isn't a commonly used metric in object detection. Occasionally, for object detection workflow "true negative" is used to refer to any image that does not have any true positive or false positive inferences. 
In object detection workflow, checking for detection correctness requires a couple of other metrics (e.g., Intersection over Union (IoU) and Geometry Matching).
Singleclass#
Let’s assume that a matching algorithm
has already been run on all inferences and that the matched pairs and unmatched
ground truths and inferences are given. Consider the following variables, adapted from
match_inferences
:
Variable  Type  Description 

matched 
List[Tuple[GT, Inf]] 
List of matched ground truth and inference bounding box pairs 
unmatched_gt 
List[GT] 
List of unmatched ground truth bounding boxes 
unmatched_inf 
List[Inf] 
List of unmatched inference bounding boxes 
T 
float 
Threshold used to filter valid inference bounding boxes based on their confidence scores 
Then these metrics are defined:
TP = len([inf.score >= T for _, inf in matched])
FN = len([inf.score < T for _, inf in matched]) + len(unmatched_gt)
FP = len([inf.score >= T for inf in unmatched_inf])
Example: Singleclass Object Detection
Bounding Box  Score  IoU(\(\text{A}\))  IoU(\(\text{B}\)) 

\(\text{a}\)  0.98  0.9  0.0 
\(\text{b}\)  0.6  0.0  0.13 
This example includes two ground truths and two inferences, and when computed with an IoU threshold of 0.5 and confidence score threshold of 0.5 yields:
TP  FP  FN 

1  1  1 
Multiclass#
Like classification, multiclass object detection workflow compute TP / FP / FN per class.
Example: Multiclass Object Detection
Bounding Box  Class  Score  IoU(\(\text{A}\)) 

\(\text{A}\)  Apple 
—  — 
\(\text{a}\)  Apple 
0.3  0.0 
\(\text{b}\)  Banana 
0.5  0.8 
Similar to multiclass classification, TP / FP / FN are computed for class Apple
and class Banana
separately.
Using an IoU threshold of 0.5 and a confidence score threshold of 0.5, this example yields:
Class  TP  FP  FN 

Apple 
0  0  1 
Banana 
0  1  0 
Averaging Perclass Metrics#
For problems with multiple classes, these TP / FP / FN / TN metrics are computed for each class. If you are looking for a single score that summarizes model performance across all classes, there are a few different ways to aggregate perclass metrics: macro, micro, and weighted.
Read more about these different averaging methods in the Averaging Methods guide.
Limitations and Biases#
TP, FP, FN, and TN are four metrics based on the assumption that each sample/instance can be classified as a positive or a negative, thus they can only be applied to singleclass applications. The workaround for multipleclass applications is to compute these metrics for each label using the onevsrest (OvR) strategy and then treat it as a singleclass problem.
Additionally, these four metrics don't take model confidence score into account. All inferences above the confidence score threshold are treated the same! For example, when using a confidence score threshold of 0.5, an inference with a confidence score barely above the threshold (e.g. 0.55) is treated the same as an inference with a very high confidence score (e.g. 0.99). In other words, any inference above the confidence threshold is considered as a positive inference. To examine performance taking confidence score into account, consider plotting a histogram of the distribution of confidence scores.