Quickstart: Building a Workflow

In this quickstart tutorial we'll learn how to use the kolena.workflow workflow builder definitions to test a Keypoints Detection model on the 300-W facial keypoints dataset. This demonstration will show us how we can build a workflow to test any arbitrary ML problem on Kolena.

Getting Started

With the kolena-client Python client installed, first let's initialize a client session:
import os
import kolena
kolena.initialize(os.environ["KOLENA_TOKEN"], verbose=True)
The data used in this tutorial is publicly available in the kolena-public-datasets S3 bucket in a metadata.csv file:
import pandas as pd
DATASET = "300-W"
BUCKET = "s3://kolena-public-datasets"
df = pd.read_csv(f"{BUCKET}/{DATASET}/meta/metadata.csv")
To load CSVs directly from S3, make sure to install the s3fs Python module:pip3 install s3fs[boto3] and set up AWS credentials
This metadata.csv file describes a keypoints detection dataset with the following columns:
  • locator: location of the image in S3
  • normalization_factor: normalization factor of the image. This is used to normalize the error by providing a factor for each image. Common techniques for computation include the euclidian distance between two points or the diagonal measurement of the image.
  • points: stringified list of coordinates corresponding to the [x, y] coordinates of the keypoints ground truths
Each locator is present exactly one time and contains the keypoints ground truth for that image. In this tutorial, we're implementing our workflow with support for only a single keypoints instance per image, but we could easily adapt our ground truth, inference, and metrics types to accommodate a variable number of keypoints arrays per image.
For brevity, the 300-W dataset has been pared down to only 5 keypoints: outermost corner of each eye, bottom of nose, and corners of the mouth.

Step 1: Defining Data Types

When building your own workflow you have control over the Test Sample (e.g. image), Ground Truth (e.g. 6-element facial keypoints array), and Inference types used in your project.

Test Sample Type

For the purposes of this tutorial, let's assume our model takes a single image as input along with an optional bounding box around the face in question, produced by an upstream model in our pipeline. We can import and extend kolena.workflow.Image for this purpose:
from dataclasses import dataclass
from typing import Optional
from kolena.workflow import Image
from kolena.workflow.annotation import BoundingBox
class TestSample(Image):
bbox: Optional[BoundingBox] = None

Ground Truth and Inference Types

Next, let's define our ground truth and inference types. Note that our model produces not only a keypoints array, but an associated confidence value that we may use to ignore low-confidence predictions:
from kolena.workflow import GroundTruth as GT, Inference as Inf
from kolena.workflow.annotation import Keypoints
class GroundTruth(GT):
keypoints: Keypoints
# In order to compute normalized error, some normalization factor describing
# the size of the face in the image is required.
normalization_factor: float
class Inference(Inf):
keypoints: Keypoints
confidence: float
Finally, let's declare our workflow object with the types we've just defined:
from kolena.workflow import define_workflow
# use these TestCase, TestSuite, and Model definitions to create and run tests
wf, TestCase, TestSuite, Model = define_workflow(
"Facial Keypoints", TestSample, GroundTruth, Inference

Step 2: Defining Metrics

With our core data types defined, the next step is to lay out our evaluation criteria: our metrics.
Kolena exposes three levels of metrics when building a workflow:

Test Sample Metrics

Test Sample Metrics are metrics computed from a single test sample and its associated ground truths and inferences.
For the keypoints detection workflow, an example metric may be normalized mean error (NME), the normalized distance between the ground truth and inference keypoints.
from kolena.workflow import MetricsTestSample
class TestSampleMetrics(MetricsTestSample):
normalized_mean_error: float
# If the normalized mean error is above some configured threshold, this test
# sample is considered an "alignment failure".
alignment_failure: bool

Test Case Metrics

Test Case metrics are aggregate metrics computed across a population. All of your standard evaluation metrics should go here — things like accuracy, precision, recall, or any other aggregate metrics that apply to your problem.
For keypoints, we care about the mean NME and alignment failure rate across the different test samples in a test case:
from kolena.workflow import MetricsTestCase
class TestCaseMetrics(MetricsTestCase):
mean_nme: float
alignment_failure_rate: float
You can also define test case plots for a workflow that can be visualized in the web platform. See the API documentation for details.

Test Suite Metrics

Test Suite Metrics are metrics that measure performance across different test cases.
An example Test Suite Metric may be variance across test cases, allowing you to measure and penalize model bias across different subsets of your dataset, like demographics.
from kolena.workflow import MetricsTestSuite
class TestSuiteMetrics(MetricsTestSuite):
variance_mean_nme: float
variance_alignment_failure_rate: float

Step 3: Creating Tests

With our data already in an S3 bucket and metadata loaded into memory, we can start creating test cases!
Let's create a simple test case containing the entire dataset:
import json
test_samples = [TestSample(locator) for locator in df["locator"]]
ground_truths = [
for record in df.itertuples()
ts_with_gt = list(zip(test_sample, ground_truths))
complete_test_case = TestCase(f"{DATASET} :: basic", test_samples=ts_with_gt)
In this tutorial we created only a single simple test case, but more advanced test cases can be generated in a variety of fast and scalable ways. See Creating Test Cases for details.
Now that we have a basic test case for our entire dataset let's create a test suite for it:
test_suite = TestSuite(f"{DATASET} :: basic", test_cases=[

Step 4: Running Tests

With basic tests defined for the 300-W dataset, we're almost ready to start testing our models.

Implementing an Evaluator

Core to the testing process is the kolena.workflow.Evaluator implementation that computes the metrics defined in step 2. Usually, an evaluator implementation simply plugs your existing metrics computation logic into the Evaluator interface.
Evaluators can have arbitrary configuration, allowing you to evaluate model performance under a variety of conditions. For this keypoints example, perhaps we want to compute performance at a few different NME threshold values, as this threshold drives the alignment_failure metric.
from kolena.workflow import EvaluatorConfiguration
class NmeThreshold(EvaluatorConfiguration):
# threshold for NME above which an image is considered an "alignment failure"
threshold: float
def display_name(self):
return f"nme-threshold-{self.threshold}"
Implementing an Evaluator is left as an exercise for the reader.
from random import random
from kolena.workflow import Evaluator
class KeypointsEvaluator(Evaluator):
"""See API documentation for details."""
def compute_test_case_metrics(self, test_case, inferences, metrics, configuration=None):
# Generate dummy metrics for demo purposes.
return TestCaseMetrics(random(), random())
def compute_test_sample_metrics(self, test_case, inferences, metrics, configuration=None):
# Generate dummy metrics for demo purposes.
return [
(i[0], TestSampleMetrics(random(), bool(random() > 0.5)))
for i in inferences
evaluator = KeypointsEvaluator(configurations=[
NmeThreshold(0.01), NmeThreshold(0.05), NmeThreshold(0.1),

Running tests

To test our models, we can define an infer function that maps the TestSample object we defined above into an Inference:
from random import randint
def infer(test_sample: TestSample) -> Inference:
1. load the image pointed to at `test_sample.locator`
2. pass the image to our model and transform its output into an `Inference` object
# Generate the dummy inference for the demo purpose.
return Inference(Keypoints([(randint(100, 400), randint(100, 400)) for _ in range(5)]), random())
model = Model("example-model-name", infer=infer, metadata=dict(
description="Any freeform metadata can go here",
It may be helpful to include information about the model's environment,
training methodology, responsible party, etc.
We now have the pieces in place to run tests on our new workflow:
from kolena.workflow import test
test(model, test_suite, evaluator)
That wraps up the testing process! We can now visit the web platform to analyze and debug our model's performance on this test suite:


In this quickstart tutorial we learned how to build a workflow for an arbitrary ML problem, using a facial keypoints model as an example. We created new tests, tested our models on Kolena, and learned how to customize evaluation to fit our exact expectations.
This tutorial just scratches the surface of what's possible with Kolena and covered a fraction of the kolena-client API — now that we're up and running, we can think about ways to create more detailed tests, improve existing tests, and dive deep into model behaviors.