Links

Creating Tests

Testing with Kolena starts by creating tests: test cases and test suites.

Creating Test Cases

In this section, we'll go through the basic steps of transforming your existing benchmark datasets or test splits into test cases in Kolena. Let's assume we have a metadata.csv file with entries for each image (located in a cloud bucket at the provided locator) and its associated annotations.
Custom
Object Detection
Instance Segmentation
Classification
Face Recognition
When building a workflow you have complete control over the data type and ground truth/inference types being tested.
See the tutorial on building a workflow for details.
In the Object Detection workflow, test cases may contain multiple object classes or may be scoped to a single class. Let's assume here that our metadata.csv file has the following format:
  • locator: location of the image in S3
  • label: label corresponding to the described by this record's bounding box
  • min_x: x coordinate for top left corner of bounding box
  • min_y: y coordinate for top left corner of bounding box
  • max_x: x coordinate for bottom right corner of bounding box
  • max_y: y coordinate for bottom right corner of bounding box
There is one record in this table for each ground truth bounding box in the dataset, meaning a given locator may be present multiple times.
In this section we'll create the following test cases:
  1. 1.
    Test case containing the entire dataset
  2. 2.
    Test cases containing the entire dataset, broken down by class
  3. 3.
    Test cases for each class broken down by the number of ground truth bounding boxes in the image
First some preliminaries:
from kolena.detection import TestImage, TestCase
from kolena.detection.ground_truth import BoundingBox
import pandas as pd
DATASET = "example-dataset"
df = pd.read_csv("metadata.csv")
1. Entire Dataset Test Case
all_images = [
TestImage(locator, dataset=DATASET, ground_truths=[
BoundingBox(record.label, (record.min_x, record.min_y), (record.max_x, record.max_y))
for record in df_locator.itertuples()
]) for locator, df_locator in df.groupby("locator")
]
complete_test_case = TestCase(
f"complete {DATASET}",
description=f"All images and all ground truths in {DATASET}",
images=all_images,
)
2. Test Cases by Class
class_test_cases = [
TestCase(f"{label} ({DATASET})", description=f"All images in {DATASET} with only {label} ground truths", images=[
image.filter(lambda gt: gt.label == label)
for image in all_images
]) for label, df_label in df.groupby("label")
]
3. Test Cases by Class and Ground Truth Count
In this section, let's create test cases breaking down the dataset by class, and within a class, breaking down positive examples by the number of ground truth bounding boxes of that class.
ground_truth_count_buckets = [1, 2, 4, 1e9]
labels = set(df["label"])
ground_truth_count_test_cases = []
for label in labels:
label_images = [image.filter(lambda gt: gt.label == label) for image in all_images]
min_count = 0
for max_count in ground_truth_count_buckets:
test_case = TestCase(f"{label} ({min_count} < #gt <= {max_count}) ({DATASET})", images=[
image for image in label_images
if len(image.ground_truths) == 0 # include negative examples
or min_count < len(image.ground_truths) <= max_count
])
ground_truth_count_test_cases.append(test_case)
min_count = max_count
In the Instance Segmentation workflow, test cases may contain multiple object classes or may be scoped to a single class. Let's assume here that our metadata.csv file has the following format:
  • locator: location of the image in S3
  • label: label corresponding to the described by this record's segmentation mask
  • points: stringified list of coordinates corresponding to the [x, y] coordinates of the segmentation mask vertices
There is one record in this table for each ground truth segmentation mask in the dataset, meaning a given locator may be present multiple times.
In this section we'll create the following test cases:
  1. 1.
    Test case containing the entire dataset
  2. 2.
    Test cases containing the entire dataset, broken down by class
  3. 3.
    Test cases for each class broken down by the number of ground truth bounding boxes in the image
First some preliminaries:
from kolena.detection import TestImage, TestCase
from kolena.detection.ground_truth import SegmentationMask
import pandas as pd
DATASET = "example-dataset"
df = pd.read_csv("metadata.csv")
1. Entire Dataset Test Case
import json
from typing import List, Tuple
# Converts points from the [[x1, y1], [x2, y2]...] format in the csv file to the
# an [(x1, y1), (x2, y2)...] format.
def as_point_tuples(points: List[List[float]]) -> List[Tuple[float, float]]:
return [(point[0], point[1]) for point in points]
all_images = [
TestImage(locator, dataset=DATASET, ground_truths=[
SegmentationMask(record.label, as_point_tuples(json.loads(record.points)))
for record in df_locator.itertuples()
]) for locator, df_locator in df.groupby("locator")
]
complete_test_case = TestCase(
f"complete {DATASET}",
description=f"All images and all ground truths in {DATASET}",
images=all_images,
)
2. Test Cases by Class
class_test_cases = [
TestCase(f"{label} ({DATASET})", description=f"All images in {DATASET} with only {label} ground truths", images=[
image.filter(lambda gt: gt.label == label)
for image in all_images
]) for label, df_label in df.groupby("label")
]
3. Test Cases by Class and Ground Truth Count
In this section, let's create test cases breaking down the dataset by class, and within a class, breaking down positive examples by the number of ground truth bounding boxes of that class.
ground_truth_count_buckets = [1, 2, 4, 1e9]
labels = set(df["label"])
ground_truth_count_test_cases = []
for label in labels:
label_images = [image.filter(lambda gt: gt.label == label) for image in all_images]
min_count = 0
for max_count in ground_truth_count_buckets:
test_case = TestCase(f"{label} ({min_count} < #gt <= {max_count}) ({DATASET})", images=[
image for image in label_images
if len(image.ground_truths) == 0 # include negative examples
or min_count < len(image.ground_truths) <= max_count
])
ground_truth_count_test_cases.append(test_case)
min_count = max_count
In the Classification workflow, test cases may contain multiple classes or may be scoped to a single class. Let's assume here that our metadata.csv file has the following format:
  • locator: location of the image in S3
  • label: label corresponding to the described by this record's bounding box
  • width: width of the image in pixels
  • height: height of the image in pixels
The dataset used here is a multi-class classification dataset, meaning that a single image belongs to exactly one of N classes.
In this section we'll create the following test cases:
  1. 1.
    Test case containing the entire dataset
  2. 2.
    Test cases containing the entire dataset, broken down by class
  3. 3.
    Test cases for each class broken down by image size
First some preliminaries:
from kolena.classification import TestImage, TestCase
import pandas as pd
DATASET = "example-dataset"
df = pd.read_csv("metadata.csv")
1. Entire Dataset Test Case
all_images = [
TestImage(record.locator, dataset=DATASET, labels=[record.label])
for record in df.itertuples()
]
complete_test_case = TestCase(
f"complete {DATASET}",
description=f"All images of all classes in {DATASET}",
images=all_images,
)
2. Test Cases by Class
class_test_cases = [
TestCase(
f"{label} ({DATASET})",
description=f"All images in {DATASET} with only class {label} ground truths",
images=[image.filter(lambda gt: gt == label) for image in all_images],
) for label in set(df["label"])
]
3. Test Cases by Class and Image Size
In this section, let's create test cases breaking down the dataset by class, and within a class, breaking down positive examples by image dimensions.
import itertools
sizes = [ # bucket images by pixel area
("xsmall", 10_000),
("small", 25_000),
("medium", 75_000),
("large", 250_000),
("xlarge", 1e9),
]
area_by_locator = {r.locator: r.width * r.height for r in df.itertuples()}
size_test_cases = []
min_area = 0
for label, (size, max_area) in itertools.product(set(df["label"]), sizes):
images = [
image.filter(lambda gt: gt == label) for image in all_images
if min_area < area_by_locator[image.locator] <= max_area
]
size_test_cases.append(TestCase(
f"{label} (size: {size}, {min_area} < area <= {max_area}) ({DATASET})",
images=images,
))
min_area = max_area
In the Face Recognition (1:1) workflow, images first need to be registered before they can be used in test cases:
from kolena.fr import TestImages
import pandas as pd
DATASET = "example-dataset"
df_metadata = pd.read_csv("metadata.csv")
with TestImages.register() as registrar:
for record in df_metadata.itertuples():
tags = dict(age=record.age, race=record.race, gender=record.gender)
registrar.add(record.locator, DATASET, record.width, record.height, tags=tags)
Here we're using the additional metadata columns age, race, and gender to attach tags to this image that can be used elsewhere in the platform to sort and filter through these images.
These tags are purely optional, but attaching any available metadata as tags — even noisy, estimated annotations — lets the platform extract automated insights from your tests. This tag metadata can be updated at any time by re-registering images with TestImages.register.
With our images registered, it's time to create our first test cases! Let's create the following test cases from this dataset:
  1. 1.
    Test case containing the entire dataset
  2. 2.
    Test cases for each annotated (race, gender) pair
Test cases in the Face Recognition (1:1) workflow are defined as image pairs with a ground truth is_same value indicating if the two images contain the same person or different people. Let's assume we have a pairs.csv file defining these image pairs with columns locator_a, locator_b, is_same.
Let's start by creating the simple, full dataset test case:
from kolena.fr import TestCase
df_pair = pd.read_csv("pairs.csv")
# define simple, full dataset test case
complete_test_case = TestCase.create(f"complete {DATASET}", description=f"""
All pairs defined in {DATASET} pairs.csv
""")
with complete_test_case.edit() as editor:
for record in df_pair.itertuples():
editor.add(record.locator_a, record.locator_b, record.is_same)
Now for the more complex case of demographic test cases:
import itertools
# define test cases broken down by (race, gender) metadata pairs
races = set(df_metadata["race"])
genders = set(df_metadata["gender"])
demographic_by_locator = {
record.locator: (record.race, record.gender)
for record in df_metadata.itertuples()
}
demographic_test_cases = []
for race, gender in itertools.product():
test_case = TestCase.create(f"{race}, {gender} ({DATASET})", description=f"""
All pairs in {DATASET} with at least one member in
demographic bucket (race, gender) ({race}, {gender})
""")
with test_case.edit() as editor:
for record in df_pair.itertuples():
if demographic_by_locator[record.locator_a] == (race, gender) \
or demographic_by_locator[record.locator_a] == (race, gender):
editor.add(record.locator_a, record.locator_b, record.is_same)
demographic_test_cases.append(test_case)
We now have a complete_test_case representing the entire dataset and demographic_test_cases with pairs broken down by (race, gender) demographic buckets.
Have additional metadata associated with an image? Attach it as metadata when creating TestImage objects to access this data when exploring results, running tests, and using the Studio to define new tests.
In the above example we created only simple test cases, but more advanced test cases can be generated in a variety of ways:
  • Using explicitly annotated image tags such as demographics that are annotated by a human labeling team or by additional classifier models
  • Using image characteristics, such as image resolution, brightness, etc.
  • By augmenting (or "perturbing") one or both of the images in the pair, applying functions such as resize, grayscale, targeted feature cutout, rotate, etc.
  • Using the Kolena Studio to search through your datasets for similar images and assemble fine-grained test cases for specific scenarios
  • Automatically deriving from the results of previous models, e.g. creating a test case for all false matches from each model released into production to track specific performance improvements and regressions
  • And more! These are just a few of the ways test cases can be created — these examples are just to get the gears turning in your head for ways to create test cases that help you better understand your models' behavior in the ways that matter to you

Creating Test Suites

Test results in Kolena are always viewed within the context of a test suite. Think of a test suite as the way to group together related test cases for convenience.
With our test cases loaded into a list test_cases, creating a test suite is simple:
Custom
Object Detection / Instance Segmentation
Classification
Face Recognition
from my_workflow import TestSuite
test_suite = TestSuite(
"example-test-suite",
description="created in example documentation",
test_cases=test_cases,
)
from kolena.detection import TestSuite
test_suite = TestSuite(
"example-test-suite",
description="created in example documentation",
test_cases=test_cases,
)
from kolena.classification import TestSuite
test_suite = TestSuite(
"example-test-suite",
description="created in example documentation",
test_cases=test_cases,
)
from kolena.fr import TestSuite
test_suite = TestSuite.create(
"example-test-suite",
description="created in example documentation",
)
with test_suite.edit() as editor:
# FR test suites must contain at least one 'baseline' test case
editor.add(test_cases[0], is_baseline=True)
for test_case in test_cases[1:]:
editor.add(test_case)
The test suite baseline concept allows flexibility in deciding which population is used to compute similarity score thresholds. See Face Recognition (1:1) documentation for details.

Editing Tests

All tests in Kolena are versioned and can be edited as desired. TestCase and TestSuite objects expose an edit method to add, remove, and update contents in a Python with context.
with TestSuite("X").edit() as editor:
editor.add(TestCase("A")) # add test case A to the suite
editor.remove(TestCase("B")) # remove test case B from the suite
editor.merge(TestCase("C", version=5)) # update test case C to a specific version
No dataset is perfect — a dataset can always be cleaner, larger, and more diverse. We encourage the continual curation of your tests in the Kolena platform. A few seconds of effort here and there can greatly improve test set quality over time.
When editing tests, don't worry about losing the ability to compare apples to apples. In the web platform, two models tested on different versions of the same tests can be compared using only the examples they've both been tested on.

Editing Test Cases

Test cases can be edited to add, remove, or update the images they contain. Common reasons to edit a test case include:
  • Removing noisy images
  • Correcting ground truths
  • Adding newly labeled images
To illustrate how to edit test cases, let's add each of the examples from test case B to test case A:
Custom
Object Detection / Instance Segmentation
Classification
Face Recognition
from my_workflow import TestCase
# load previously created test cases
test_case_a = TestCase.load("A")
test_case_b = TestCase.load("B")
with test_case_a.edit() as editor:
for test_sample, ground_truth in test_case_b.iter_test_samples():
editor.add(test_sample, ground_truth)
from kolena.detection import TestCase
# load previously created test cases
test_case_a = TestCase("A")
test_case_b = TestCase("B")
with test_case_a.edit() as editor:
for image in test_case_b.iter_images():
editor.merge(image)
from kolena.classification import TestCase
# load previously created test cases
test_case_a = TestCase("A")
test_case_b = TestCase("B")
with test_case_a.edit() as editor:
for image in test_case_b.iter_images():
editor.merge(image)
from kolena.fr import TestCase
test_case_a = TestCase.load_by_name("A")
test_case_b = TestCase.load_by_name("B")
with test_case_a.edit() as editor:
for record in test_case_b.iter_data():
editor.add(record.locator_a, record.locator_b, record.is_same)
After editing, test case A's version is automatically incremented. Any time test case A is loaded, the most recent version is returned, unless otherwise specified by providing version=N.
Test suites containing an updated test case are not automatically updated. In the next section, we'll learn how to update test suites to increment test case versions, add, and remove test cases.

Editing Test Suites

Editing a test suite involves adding, removing, or updating the test cases it holds.
To illustrate how to edit a test suite, let's update test suite X to use the new version of test case A created in the above section, and remove the now-superfluous test case B.
Custom
Object Detection / Instance Segmentation
Classification
Face Recognition
from my_workflow import TestCase, TestSuite
with TestSuite("X").edit() as editor:
editor.add(TestCase("A")) # update test case "A" to latest version
editor.remove(TestCase("B"))
from kolena.detection import TestCase, TestSuite
with TestSuite("X").edit() as editor:
editor.merge(TestCase("A")) # update test case "A" to latest version
editor.remove(TestCase("B"))
from kolena.classification import TestCase, TestSuite
with TestSuite("X").edit() as editor:
editor.merge(TestCase("A")) # update test case "A" to latest version
editor.remove(TestCase("B"))
from kolena.fr import TestCase, TestSuite
with TestSuite.load_by_name("X").edit() as editor:
editor.merge(TestCase.load_by_name("A")) # update test case "A" to latest version
editor.remove(TestCase.load_by_name("B"))
Test suite X now has the newest version of test case A and no longer includes test case B.