Quickstart: Building a Workflow
In this quickstart tutorial we'll learn how to use the
kolena.workflow
workflow builder definitions to test a Keypoints Detection model on the 300-W facial keypoints dataset. This demonstration will show us how we can build a workflow to test any arbitrary ML problem on Kolena.import os
import kolena
kolena.initialize(os.environ["KOLENA_TOKEN"], verbose=True)
The data used in this tutorial is publicly available in the
kolena-public-datasets
S3 bucket in a metadata.csv
file:import pandas as pd
DATASET = "300-W"
BUCKET = "s3://kolena-public-datasets"
df = pd.read_csv(f"{BUCKET}/{DATASET}/meta/metadata.csv")
To load CSVs directly from S3, make sure to install the
s3fs
Python module:pip3 install s3fs[boto3]
and set up AWS credentialsThis
metadata.csv
file describes a keypoints detection dataset with the following columns:locator
: location of the image in S3normalization_factor
: normalization factor of the image. This is used to normalize the error by providing a factor for each image. Common techniques for computation include the euclidian distance between two points or the diagonal measurement of the image.points
: stringified list of coordinates corresponding to the[x, y]
coordinates of the keypoints ground truths
Each
locator
is present exactly one time and contains the keypoints ground truth for that image. In this tutorial, we're implementing our workflow with support for only a single keypoints instance per image, but we could easily adapt our ground truth, inference, and metrics types to accommodate a variable number of keypoints arrays per image.For brevity, the 300-W dataset has been pared down to only 5 keypoints: outermost corner of each eye, bottom of nose, and corners of the mouth.
When building your own workflow you have control over the Test Sample (e.g. image), Ground Truth (e.g. 6-element facial keypoints array), and Inference types used in your project.
For the purposes of this tutorial, let's assume our model takes a single image as input along with an optional bounding box around the face in question, produced by an upstream model in our pipeline. We can import and extend
kolena.workflow.Image
for this purpose:from dataclasses import dataclass
from typing import Optional
from kolena.workflow import Image
from kolena.workflow.annotation import BoundingBox
@dataclass(frozen=True)
class TestSample(Image):
bbox: Optional[BoundingBox] = None
Next, let's define our ground truth and inference types. Note that our model produces not only a keypoints array, but an associated confidence value that we may use to ignore low-confidence predictions:
from kolena.workflow import GroundTruth as GT, Inference as Inf
from kolena.workflow.annotation import Keypoints
@dataclass(frozen=True)
class GroundTruth(GT):
keypoints: Keypoints
# In order to compute normalized error, some normalization factor describing
# the size of the face in the image is required.
normalization_factor: float
@dataclass(frozen=True)
class Inference(Inf):
keypoints: Keypoints
confidence: float
Finally, let's declare our workflow object with the types we've just defined:
from kolena.workflow import define_workflow
# use these TestCase, TestSuite, and Model definitions to create and run tests
wf, TestCase, TestSuite, Model = define_workflow(
"Facial Keypoints", TestSample, GroundTruth, Inference
)
With our core data types defined, the next step is to lay out our evaluation criteria: our metrics.
Kolena exposes three levels of metrics when building a workflow:
Test Sample Metrics are metrics computed from a single test sample and its associated ground truths and inferences.
For the keypoints detection workflow, an example metric may be normalized mean error (NME), the normalized distance between the ground truth and inference keypoints.
from kolena.workflow import MetricsTestSample
@dataclass(frozen=True)
class TestSampleMetrics(MetricsTestSample):
normalized_mean_error: float
# If the normalized mean error is above some configured threshold, this test
# sample is considered an "alignment failure".
alignment_failure: bool
Test Case metrics are aggregate metrics computed across a population. All of your standard evaluation metrics should go here — things like accuracy, precision, recall, or any other aggregate metrics that apply to your problem.
For keypoints, we care about the mean NME and alignment failure rate across the different test samples in a test case:
from kolena.workflow import MetricsTestCase
@dataclass(frozen=True)
class TestCaseMetrics(MetricsTestCase):
mean_nme: float
alignment_failure_rate: float
You can also define test case plots for a workflow that can be visualized in the web platform. See the API documentation for details.
Test Suite Metrics are metrics that measure performance across different test cases.
An example Test Suite Metric may be variance across test cases, allowing you to measure and penalize model bias across different subsets of your dataset, like demographics.
from kolena.workflow import MetricsTestSuite
@dataclass(frozen=True)
class TestSuiteMetrics(MetricsTestSuite):
variance_mean_nme: float
variance_alignment_failure_rate: float
With our data already in an S3 bucket and metadata loaded into memory, we can start creating test cases!
Let's create a simple test case containing the entire dataset:
import json
test_samples = [TestSample(locator) for locator in df["locator"]]
ground_truths = [
GroundTruth(
keypoints=Keypoints(points=json.loads(record.points)),
normalization_factor=record.normalization_factor,
)
for record in df.itertuples()
]
ts_with_gt = list(zip(test_sample, ground_truths))
complete_test_case = TestCase(f"{DATASET} :: basic", test_samples=ts_with_gt)
In this tutorial we created only a single simple test case, but more advanced test cases can be generated in a variety of fast and scalable ways. See Creating Test Cases for details.
Now that we have a basic test case for our entire dataset let's create a test suite for it:
test_suite = TestSuite(f"{DATASET} :: basic", test_cases=[
complete_test_case
])
With basic tests defined for the 300-W dataset, we're almost ready to start testing our models.
Core to the testing process is the
kolena.workflow.Evaluator
implementation that computes the metrics defined in step 2. Usually, an evaluator implementation simply plugs your existing metrics computation logic into the Evaluator interface.Evaluators can have arbitrary configuration, allowing you to evaluate model performance under a variety of conditions. For this keypoints example, perhaps we want to compute performance at a few different NME threshold values, as this threshold drives the
alignment_failure
metric.from kolena.workflow import EvaluatorConfiguration
@dataclass(frozen=True)
class NmeThreshold(EvaluatorConfiguration):
# threshold for NME above which an image is considered an "alignment failure"
threshold: float
def display_name(self):
return f"nme-threshold-{self.threshold}"
Implementing an Evaluator is left as an exercise for the reader.
from random import random
from kolena.workflow import Evaluator
class KeypointsEvaluator(Evaluator):
"""See API documentation for details."""
def compute_test_case_metrics(self, test_case, inferences, metrics, configuration=None):
# Generate dummy metrics for demo purposes.
return TestCaseMetrics(random(), random())
def compute_test_sample_metrics(self, test_case, inferences, metrics, configuration=None):
# Generate dummy metrics for demo purposes.
return [
(i[0], TestSampleMetrics(random(), bool(random() > 0.5)))
for i in inferences
]
evaluator = KeypointsEvaluator(configurations=[
NmeThreshold(0.01), NmeThreshold(0.05), NmeThreshold(0.1),
])
To test our models, we can define an
infer
function that maps the TestSample
object we defined above into an Inference
:from random import randint
def infer(test_sample: TestSample) -> Inference:
"""
1. load the image pointed to at `test_sample.locator`
2. pass the image to our model and transform its output into an `Inference` object
"""
# Generate the dummy inference for the demo purpose.
return Inference(Keypoints([(randint(100, 400), randint(100, 400)) for _ in range(5)]), random())
model = Model("example-model-name", infer=infer, metadata=dict(
description="Any freeform metadata can go here",
hint="""
It may be helpful to include information about the model's environment,
training methodology, responsible party, etc.
""",
))
We now have the pieces in place to run tests on our new workflow:
from kolena.workflow import test
test(model, test_suite, evaluator)
That wraps up the testing process! We can now visit the web platform to analyze and debug our model's performance on this test suite:
In this quickstart tutorial we learned how to build a workflow for an arbitrary ML problem, using a facial keypoints model as an example. We created new tests, tested our models on Kolena, and learned how to customize evaluation to fit our exact expectations.
This tutorial just scratches the surface of what's possible with Kolena and covered a fraction of the
kolena-client
API — now that we're up and running, we can think about ways to create more detailed tests, improve existing tests, and dive deep into model behaviors.Last modified 19d ago