Testing Models
Testing your models on Kolena is a simple process that involves loading the images in your dataset, performing inference, and pushing inferences into to the platform.
In Kolena, a model is a deterministic transformation from inputs to outputs. During testing, each image being tested is surfaced to your model exactly once. Regardless of how many test cases or test suites a given image belongs to, you only have to perform inference and upload results a single time.
kolena-client
provides a test
method for each workflow to handle the entire testing process.The
test
method requires an InferenceModel
instance with method(s) implementing that model's deterministic transformation from inputs to outputs.With our
model
defined implementing the necessary method(s) to perform inference, testing is as simple as:Custom
Object Detection
Instance Segmentation
Classification
Face Recognition (1:1)
For this example, let's assume our model is abstracted away to the
my_model.infer
method.import my_model
from my_workflow import Model
model = Model("example-model", infer=my_model.infer, metadata=dict(
description="any free-form metadata can be included in this dictionary",
))
Testing requires an
Evaluator
, implementing the metrics computation specific to your workflow.from my_workflow import MyEvaluator, MyEvaluatorConfiguration
evaluator = MyEvaluator(configurations=[
MyEvaluatorConfiguration(example="a"),
MyEvaluatorConfiguration(example="b"),
])
With our model and evaluator defined, testing is performed by creating a
TestRun
:from my_workflow import TestSuite
from kolena.workflow import test
# test our model on previously-created test suite 'A'
test(model, TestSuite("A"), evaluator)
from kolena.detection import InferenceModel, test, TestImage, TestSuite
from kolena.detection.inference import BoundingBox
def infer(test_image: TestImage) -> List[BoundingBox]:
"""Transform a TestImage into a list of BoundingBox inferences"""
# Step 1: load image at `test_image.locator`
# Step 2: perform inference
# Step 3: transform inferences into BoundingBox objects and return
model = InferenceModel("example-detection-model", infer=infer)
test_suites = [TestSuite("A"), TestSuite("B")]
test(model, *test_suites)
from kolena.detection import InferenceModel, test, TestImage, TestSuite
from kolena.detection.inference import SegmentationMask
def infer(test_image: TestImage) -> List[SegmentationMask]:
"""Transform a TestImage into a list of SegmentationMask inferences"""
# Step 1: load image at `test_image.locator`
# Step 2: perform inference
# Step 3: transform inferences into SegmentationMask objects and return
model = InferenceModel("example-detection-model", infer=infer)
test_suites = [TestSuite("A"), TestSuite("B")]
test(model, *test_suites)
from kolena.classification import InferenceModel, test, TestImage, TestSuite
def infer(test_image: TestImage) -> List[Tuple[str, float]]:
"""Transform a TestImage into a list of (label, confidence) inferences"""
# Step 1: load image at `test_image.locator`
# Step 2: perform inference and return
model = InferenceModel("example-classification-model", infer=infer)
test_suites = [TestSuite("A"), TestSuite("B")]
test(model, *test_suites)
from kolena.fr import InferenceModel, test, TestSuite
def extract(locator: str) -> Optional[np.ndarray]:
"""Extract an embedding representing the face in the image"""
# Step 1: load image at `locator`
# Step 2: run model pipleine -- detect, align, and extract
# Step 3: return extracted embedding, or None if no face was detected
def compare(embedding_a: np.ndarray, embedding_b: np.ndarray) -> float:
"""Compare two embeddings and generate similarity score"""
model = InferenceModel("exapmle-fr11-model", extract=extract, compare=compare)
test_suites = [TestSuite.load_by_name("A")]
test(model, *test_suites)
We recommend using this simplified
test
interface to start, and moving to the detailed TestRun
interface later on if necessary.Each workflow exports a
TestRun
object to provide more control over the flow of data during the testing process.When testing with
TestRun
, a normal Model
object can be created without any infer
implementation.Object Detection
Instance Segmentation
Classification
Face Recognition (1:1)
from kolena.detection import Model
model = Model("example-detection-model", metadata=dict(
description="simple model descriptor (note no `infer` method is necessary)",
))
For this example, let's assume our model is abstracted away to the
my_code.infer
method that transforms a TestImage
to a list of BoundingBox
inferences.from kolena.detection import TestRun, TestSuite
from my_code import infer # model implementation
with TestRun(model, TestSuite("A"), TestSuite("B")) as test_run:
# perform any batching, parallelization, etc. desired here
for test_image in test_run.iter_images():
test_run.add_inferences(test_image, infer(test_image))
from kolena.detection import Model
model = Model("example-detection-model", metadata=dict(
description="simple model descriptor (note no `infer` method is necessary)",
))
For this example, let's assume our model is abstracted away to the
my_code.infer
method that transforms a TestImage
to a list of SegmentationMask
inferences.from kolena.detection import TestRun, TestSuite
from my_code import infer # model implementation
with TestRun(model, TestSuite("A"), TestSuite("B")) as test_run:
# perform any batching, parallelization, etc. desired here
for test_image in test_run.iter_images():
test_run.add_inferences(test_image, infer(test_image))
from kolena.classification import Model
model = Model("example-classification-model", metadata=dict(
description="simple model descriptor (note no `infer` method is necessary)",
))
For this example, let's assume our model is abstracted away to the
my_code.infer
method that transforms a TestImage
to a list of (label, confidence)
inferences.from kolena.classification import TestRun, TestSuite
from my_code import infer # model implementation
with TestRun(model, TestSuite("A"), TestSuite("B")) as test_run:
# perform any batching, parallelization, etc. desired here
for test_image in test_run.iter_images():
test_run.add_inferences(test_image, infer(test_image))
Testing for the face recognition 1:1 workflow is divided into two phases:
- 1.Extracting embeddings: each image in the test suite is surfaced. During this phase your model detects bounding boxes surrounding faces in each image, estimates landmarks corresponding to the eyes, nose, and mouth of these faces, and extracts embeddings vectors used in the next phase to compute similarity scores between images.
- 2.Computing similarities: during this phase, the embeddings extracted in the previous phase are used to compute float similarity scores for each image pair defined in the test suite. These similarity scores are used to compute the metrics reported in the platform.
from kolena.fr import Model
model = Model("example-fr11-model", metadata=dict(
description="simple model descriptor (note no `extract` and `compare` methods are necessary)",
))
Next, let's start testing our model by creating a
TestRun
object:from kolena.fr import TestRun, TestSuite
test_run = TestRun(model, TestSuite.load_by_name("A"))
If we have a model
my_model
with an extract
method that takes a locator
and produces an embedding, the embeddings extraction phase goes as follows:import pandas as pd
from my_model import extract
# load a dataframe of all images in the test suite(s)
# can also load batches by providing a `batch_size`
df_image = test_run.load_remaining_images()
# compute embeddings for each image and
df_result = pd.DataFrame([
(record.image_id, extract(record.locator))
for record in df_image.itertuples()
], columns=["image_id", "embedding"])
# additional columns that may be populated as desired to provide extra
# debugging information that is visible in the web platform gallery
df_result[["bounding_box", "landmarks_input_image", "landmarks",
"quality_input_image", "quality", "acceptability",
"fr_input_image", "failure_reason"]] = None
test_run.upload_image_results(df_result)
The schema for each
pd.DataFrame
used during testing can be found at kolena.fr.datatypes
Once we have extracted and uploaded embeddings for all images, we can complete our test run by computing similarities for all image pairs in this test suite:
# load embeddings extracted in previous step and image pairs
# defined in this test suite
df_embedding, df_pair = test_run.load_remaining_pairs()
# reindex embeddings in a dictionary for easier access
embs = {r.image_id: r.embedding for r in df_embedding.itertuples()}
# compute similarity scores for all pairs
df_pair["similarity"] = [
my_model.similarity(embs[r.image_a_id], embs[r.image_b_id])
for r in df_pair.itertuples()
]
test_run.upload_pair_results(df_pair[["image_pair_id", "similarity"]])
Reasons to use TestRun instead of test include (but are not limited to):
- Parallelization: loading all test examples upfront and dividing the inference tasks between multiple workers operating in parallel
- Batching: performing inference on multiple images at once instead of processing one-at-a-time
- Uploading additional information: certain models produce output metadata in addition to inferences.
TestRun
provides various hooks to upload this additional data to Kolena during testing
Since you own your bucket and its layout, we recommend using a standard format such as
s3://<bucket>/<dataset>/<relative-path>
. With this format, during testing, if <relative-path>
exists locally, you can parse this locator to load directly from disk rather than fetching from your remote bucket.Keeping a local mirror of your data mounted onto the machine(s) that run your tests helps to reduce the network overhead during testing and speeds things up greatly.
Use the
TestRun
interface to implement your own control flow during testing. For example:with TestRun(model, test_suite) as test_run:
test_images = test_run.load_images() # load all images upfront
for i in range(0, len(test_images), MyModel.BATCH_SIZE):
batch = test_images[i:i + MyModel.BATCH_SIZE]
inferences = MyModel.infer(batch)
for test_image, inference in zip(batch, inferences):
test_run.add_inferences(test_image, inference)
If I'm using a built-in workflow, can I use custom evaluation criteria to compute metrics for my model?
For built-in workflows, both
test
and TestRun
interfaces accept an optional argument custom_metrics_callback
for this purpose. The callback is passed all inferences for each test case and should return a dictionary with metric name as key and metric value as value.Using the built-in object detection workflow as an example:
from typing import List, Tuple, Optional
from kolena.detection import CustomMetrics, TestImage, test
from kolena.detection.inference import BoundingBox
def custom_metrics(
inferences: List[Tuple[TestImage, Optional[List[BoundingBox]]]]
) -> CustomMetrics:
# replace with actual evaluation
return {"my_metric": 0.5}
test(model, test_suite, custom_metrics_callback=custom_metrics)