Testing with Kolena

Kolena is a machine learning testing and debugging platform to surface hidden model behaviors and take the mystery out of model development. Kolena helps you:
  • Perform high-resolution model evaluation
  • Understand and track behavioral improvements and regressions
  • Meaningfully communicate model capabilities
  • Automate model testing and deployment workflows
Kolena organizes your test data, stores and visualizes your model evaluations, and provides tooling to craft better tests. You interface with it through the web at and programmatically via the kolena Python client.


Kolena helps you test your ML models more effectively. Jump right in with a quickstart guide:

Why Kolena?

Current ML evaluation techniques are falling short. Engineers run inference on arbitrarily split benchmark datasets, spend weeks staring at error graphs to evaluate their models, and ultimately produce a global metric that fails to capture the true behavior of the model.
Models exhibit highly variable performance across different subclasses. A global metric gives you a high-level picture of performance but doesn't tell you what you really want to know: what sort of behavior can I expect from my models?
To answer this question you need a higher-resolution picture of model performance. Not "how well does my model perform on class X," but "in what scenarios does my model perform well for class X?"
Looking at global metric, Model A seems far inferior to Model B.
In the above example, looking only at global metric (e.g. F1 score), we'd almost certainly choose to deploy Model B.
But what if the "High Occlusion" scenario isn't important for our product? Most of Model A's failures are from that scenario, and it outperforms Model B in more important scenarios like "Front View." Meanwhile, Model B's underperformance in the "Front View" scenario, a highly important scenario, is masked by improved performance in unimportant scenarios.
Everything you know about your model's behavior prior to pushing to production is learned from your tests. Fine-grained tests teach you what you need to learn before a model hits production.
Now... why Kolena? Two reasons:
  1. 1.
    Managing fine-grained tests is a tedious data engineering task, especially under changing data circumstances as your dataset grows and your understanding of your domain develops
  2. 2.
    Creating fine-grained tests is labor-intensive and typically involves manual annotation of countless images, a costly and time-consuming process
We built Kolena to solve these two problems.

Core Concepts

The Kolena testing platform is a web platform to explore high-resolution model test results, debug behaviors, and communicate performance effectively with internal and external, technical and non-technical stakeholders.
There are two main components of Kolena:
  1. 1.
    The web app, located on
  2. 2.
    The Python client, kolena
Testing on Kolena is centered around a few central concepts:
Type of computer vision problem, e.g. Object Detection or Classification
Test Case
Well-scoped benchmark dataset against which metrics are computed, assessing one facet of model behavior
Grouping of test cases, attempting to answer a specific set of questions about model performance
Deterministic transformation turning an input into a set of predictions

Testing with Kolena

Learn how to use Kolena to test your models effectively:

Advanced Usage

Check out the Advanced Usage documentation to learn how to further streamline the process of testing on Kolena:

Read More

  • Best Practices for ML Model Testing (Kolena Blog)
  • Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging (arXiv:1909.12475)
  • No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems (arXiv:2011.12945)
Last modified 12d ago