TP / FP / FN / TN#

The counts of true positive (TP), false positive (FP), false negative (FN), and true negative (TN) ground truths and inferences are essential for summarizing model performance. These metrics are the building blocks of many other metrics, including accuracy, precision, and recall.

Metric		Description
True Positive	TP	An instance for which both predicted and actual values are positive
False Positive	FP	An instance for which predicted value is positive but actual value is negative
False Negative	FN	An instance for which predicted value is negative but actual value is positive
True Negative	TN	An instance for which both predicted and actual values are negative

To compute these metrics, each inference is compared to a ground truth and categorized into one of the four groups. Let’s say we’re building a dog classifier that predicts whether an image has a dog or not:

	Positive Inference (Dog)	Negative Inference (No Dog)
Positive Ground Truth (Dog)	True Positive (TP)	False Negative (FN)
Negative Ground Truth (No Dog)	False Positive (FP)	True Negative (TN)

Images of a dog are positive samples, and images without a dog are negative samples.

If a classifier predicts that there is a dog on a positive sample, that inference is a true positive (TP). If that classifier predicts that there isn’t a dog on a positive sample, that inference is a false negative (FN).

Similarly, if that classifier predicts that there is a dog on a negative sample, that inference is a false positive (FP). A negative inference on a negative sample is a true negative (TN).

Implementation Details#

The TP / FP / FN / TN metrics have been around for a long time and are mainly used to evaluate classification, detection, and segmentation models.

The implementation of these metrics is simple and straightforward. That said, there are different guidelines and edge cases to be aware of for binary and multiclass problems as well as object detection and other workflows.

Classification#

There are three types of classification workflows: binary, multiclass, and multi-label.

Binary#

In binary classification workflow, TP, FN, FP, and TN are implemented as follows:

Variable	Type	Description
`ground_truths`	`List[bool]`	Ground truth labels, where `True` indicates a positive sample
`inferences`	`List[float]`	Predicted confidence scores, where a higher score indicates a higher confidence of the sample being positive
`T`	`float`	Threshold value to compare against the inference’s confidence score, where `score >= T` is positive

Should Threshold Be Inclusive or Exclusive?

A confidence threshold is defined as "the minimum score that the model will consider the inference to be positive (i.e. true)". Therefore, it is a standard practice to consider inferences with confidence score greater than or equal to the confidence threshold as positive.

With these inputs, TP / FP/ FN / TN metrics are defined:

TP = sum(    gt and inf >= T for gt, inf in zip(ground_truths, inferences))
FP = sum(not gt and inf >= T for gt, inf in zip(ground_truths, inferences))
FN = sum(    gt and inf <  T for gt, inf in zip(ground_truths, inferences))
TN = sum(not gt and inf <  T for gt, inf in zip(ground_truths, inferences))

Example: Binary Classification

This example considers five samples with the following ground truths, inferences, and threshold:

ground_truths = [False, True, False, False, True]
inferences = [0.3, 0.2, 0.9, 0.4, 0.5]
T = 0.5

Using the above formula for TP, FP, FN, and TN yields the following metrics:

print(f"TP={TP}, FP={FP}, FN={FN}, TN={TN}")
# TP=1, FN=1, FP=1, TN=2

Multiclass#

TP / FP / FN / TN metrics are computed a little differently in a multiclass classification workflow.

For a multiclass classification workflow, these four metrics are defined per class. This technique, also known as one-vs-rest (OvR), essentially evaluates each class as a binary classification problem.

Consider a classification problem where a given image belongs to either the Airplane, Boat, or Car class. Each of these TP / FP / FN / TN metrics is computed for each class. For class Airplane, the metrics are defined as follows:

Metric	Example
True Positive	Any image predicted as an `Airplane` that is labeled as an `Airplane`
False Positive	Any image predicted as an `Airplane` that is not labeled as an `Airplane` (e.g. labeled as `Boat` but predicted as `Airplane`)
False Negative	Any image not predicted as an `Airplane` that is labeled as an `Airplane` (e.g. labeled as `Airplane` but predicted as `Car`)
True Negative	Any image not predicted as an `Airplane` that is not labeled as an `Airplane` (e.g. labeled as `Boat` but predicted as `Boat` or `Car`)

Multi-label#

In a multi-label classification workflow, TP / FP / FN / TN are computed per class, like in multiclass classification.

A sample is considered to be a positive one if the ground truth includes the evaluating class; otherwise, it’s a negative sample. The same logic can be applied to the inferences, so, for example, if a classifier predicts that this sample belongs to class Airplane and Boat, and the ground truth for the same sample is only class Airplane, then this sample is considered to be a TP for class Airplane, and FP for class Boat.

Multi-label classification workflow can alternately be thought of as a collection of binary classification workflows.

Object Detection#

There are some differences in how these four metrics work for a detection workflow compared to a classification workflow. Rather than being computed at the sample level (e.g. per image), they're computed at the instance level (i.e. per object) for instances that the model is detecting. When given an image with multiple objects, each inference and each ground truth is assigned to one group, and the definitions of the terms are slightly altered:

Metric		Description
True Positive	TP	Positive inference (`score >= T`) that is matched with a ground truth
False Positive	FP	Positive inference (`score >= T`) that is not matched with a ground truth
False Negative	FN	Ground truth that is not matched with an inference or that is matched with a negative inference (`score < T`)
True Negative	TN	Poorly defined for object detection! In object detection workflow, a true negative is any non-object that isn't detected as an object. This isn't well defined and as such true negative isn't a commonly used metric in object detection. Occasionally, for object detection workflow "true negative" is used to refer to any image that does not have any true positive or false positive inferences.

In object detection workflow, checking for detection correctness requires a couple of other metrics (e.g., Intersection over Union (IoU) and Geometry Matching).

Single-class#

Let’s assume that a matching algorithm has already been run on all inferences and that the matched pairs and unmatched ground truths and inferences are given. Consider the following variables, adapted from match_inferences:

Variable	Type	Description
`matched`	`List[Tuple[GT, Inf]]`	List of matched ground truth and inference bounding box pairs
`unmatched_gt`	`List[GT]`	List of unmatched ground truth bounding boxes
`unmatched_inf`	`List[Inf]`	List of unmatched inference bounding boxes
`T`	`float`	Threshold used to filter valid inference bounding boxes based on their confidence scores

Then these metrics are defined:

TP = len([inf.score >= T for _, inf in matched])
FN = len([inf.score <  T for _, inf in matched]) + len(unmatched_gt)
FP = len([inf.score >= T for inf in unmatched_inf])

Example: Single-class Object Detection

example legends example 1

Bounding Box	Score	IoU(\(\text{A}\))	IoU(\(\text{B}\))
\(\text{a}\)	0.98	0.9	0.0
\(\text{b}\)	0.6	0.0	0.13

This example includes two ground truths and two inferences, and when computed with an IoU threshold of 0.5 and confidence score threshold of 0.5 yields:

TP	FP	FN
1	1	1

Multiclass#

Like classification, multiclass object detection workflow compute TP / FP / FN per class.

Example: Multiclass Object Detection

example legends example 2

Bounding Box	Class	Score	IoU(\(\text{A}\))
\(\text{A}\)	`Apple`	—	—
\(\text{a}\)	`Apple`	0.3	0.0
\(\text{b}\)	`Banana`	0.5	0.8

Similar to multiclass classification, TP / FP / FN are computed for class Apple and class Banana separately.

Using an IoU threshold of 0.5 and a confidence score threshold of 0.5, this example yields:

Class	TP	FP	FN
`Apple`	0	0	1
`Banana`	0	1	0

Averaging Per-class Metrics#

For problems with multiple classes, these TP / FP / FN / TN metrics are computed for each class. If you are looking for a single score that summarizes model performance across all classes, there are a few different ways to aggregate per-class metrics: macro, micro, and weighted.

Read more about these different averaging methods in the Averaging Methods guide.

Limitations and Biases#

TP, FP, FN, and TN are four metrics based on the assumption that each sample/instance can be classified as a positive or a negative, thus they can only be applied to single-class applications. The workaround for multiple-class applications is to compute these metrics for each label using the one-vs-rest (OvR) strategy and then treat it as a single-class problem.

Additionally, these four metrics don't take model confidence score into account. All inferences above the confidence score threshold are treated the same! For example, when using a confidence score threshold of 0.5, an inference with a confidence score barely above the threshold (e.g. 0.55) is treated the same as an inference with a very high confidence score (e.g. 0.99). In other words, any inference above the confidence threshold is considered as a positive inference. To examine performance taking confidence score into account, consider plotting a histogram of the distribution of confidence scores.