Skip to content


Testing in Kolena is broken down by the type of ML problem you're solving, called a workflow. Any ML problem that can be tested can be modeled as a workflow in Kolena.

Examples of workflows include:

With the kolena.workflow client module, any arbitrary ML problem can be defined as a workflow and tested on Kolena.

There are three main components of a workflow:


These three types can be thought of as the data model, or the schema, of a workflow.

  1. Test Sample: the inputs to a model, e.g. image, video, document
  2. Ground Truth: the expected model outputs
  3. Inference: the actual model outputs

Test Sample#

In Kolena, "test sample" is the general term for the input to a model.

For standard computer vision (CV) models, the test sample is often a single image. Video-based computer vision models would have a video test sample type, and stereo vision models would use image pairs. For natural language processing models, the test sample may be a document or text snippet.

When building a workflow, you can extend and compose these base test sample types as necessary, or use the base types directly if no customization is required.


Any additional information associated with a test sample, e.g. details about how it was collected, can be included as metadata. We recommend uploading any and all metadata that you have available, as metadata can be useful for searching through data in the Studio, interpreting model results, and creating new test cases.

from dataclasses import dataclass, field

from kolena.workflow import Document, Metadata

class MyDocument(Document):
    # locator: str  # inherited from parent Document
    doc_id: int  # example of a field that is explicitly required
    metadata: Metadata = field(default_factory=dict)  # free-form, optional metadata

Use pydantic dataclasses

When building a workflow, object definitions can us standard library dataclasses or Pydantic dataclasses. Pydantic brings helpful runtime type validation and coercion and can be used as a drop-in replacement for standard library dataclasses.

Composite Test Samples#

Kolena is not prescriptive about the shape of your ML problem. Test samples can be composed, using the Composite test sample type, to mirror the shape of your problem directly.

Consider the example of an autonomous vehicle application that uses four cameras, one for each of the front, right, rear, and left views:

from dataclasses import dataclass

from kolena.workflow import Composite, Image

class QuadImage(Composite):
    front: Image
    right: Image
    rear: Image
    left: Image
How can I specify annotations on Composite test samples?

Image-level (or video-level, document-level, etc.) annotations can be specified when using composite test samples. To specify image-level objets in each of the four images, ground truth or inference definitions may look like this:

from dataclasses import dataclass
from typing import List

from kolena.workflow import DataObject, GroundTruth
from kolena.workflow.annotation import BoundingBox

class SingleImageGroundTruth(DataObject):
    objects: List[BoundingBox]

class QuadImageGroundTruth(GroundTruth):
    # attribute names matches attribute names in test sample
    front: SingleImageGroundTruth
    right: SingleImageGroundTruth
    rear: SingleImageGroundTruth
    left: SingleImageGroundTruth

Ground Truth#

The ground truth represents the expected output from a model when provided with a test sample. Ground truths are often manually annotated and are used to determine the correctness of model predictions.

In the Studio, ground truths are always displayed alongside their paired test samples. Any annotations, such as bounding boxes or polygons, are visualized on top of the test sample.

The contents of a ground truth are driven by the requirements of the workflow. Take this example for a multiclass object detection workflow:

from dataclasses import dataclass
from typing import List

from kolena.workflow import GroundTruth
from kolena.workflow.annotation import LabeledBoundingBox

class MyGroundTruth(GroundTruth):
    objects: List[LabeledBoundingBox]
Where should additional information that isn't used for model evaluation live?

We recommend scoping the ground truth to only the data required for model evaluation. Any additional metadata, annotations, or assets associated with a test sample can be included as a part of the test sample itself or in its free-form metadata.

However, it isn't a strict requirement that ground truths only contain information used for model evaluation. Sometimes it makes sense to include additional information as optional fields inside a ground truth definition.


A workflow's inference type contains the actual output produced by a model when given a test sample. Inferences are also referred to as "raw inferences," as they represent the raw output from a model.

The inference type and ground truth type for a workflow will often look very similar to one another.

Extending Annotation Types#

Annotation types can be extended to include additional fields, when necessary.

Consider the example of a Keypoints detection model that detects anywhere from 0 to N keypoints arrays when provided an image. Each keypoints array has an associated class label and confidence value. This model's inference type could be defined as follows:

from dataclasses import dataclass
from typing import List

from kolena.workflow import Inference
from kolena.workflow.annotation import Keypoints

class ScoredLabeledKeypoints(Keypoints):
    # points: List[Tuple[float, float]]  # inherited from Keypoints
    score: float  # confidence score, between 0 and 1
    label: str  # predicted class

class MyInference(Inference):
    predictions: List[ScoredLabeledKeypoints]


Models are considered deterministic inputs from test samples to inferences. This means that, when testing in Kolena, a given model only needs to process a given test sample once. Kolena uses this to speed up the process of running tests, ensuring that compute cycles are not wasted processing a given test sample multiple times when test samples exist in multiple test cases.

When calling test, only samples that do not already have inferences uploaded from the given model will be processed. To change this behavior and re-process all test samples, regardless of any uploaded inferences, use the reset flag:

# all test samples are processed and inferences [re]uploaded when reset=True
test(model, test_suite, evaluator, reset=True)

Defining a Workflow#

With test sample, ground truth, and inference types declared, defining a workflow provides the TestCase, TestSuite, and Model definitions to use when creating tests and testing models with this workflow:

from kolena.workflow import define_workflow

from my_workflow import MyTestSample, MyGroundTruth, MyInference

_, TestCase, TestSuite, Model = define_workflow(
    "My Example Workflow",