Testing with Kolena
Kolena is a machine learning testing and debugging platform to surface hidden model behaviors and take the mystery out of model development. Kolena helps you:
- Perform high-resolution model evaluation
- Understand and track behavioral improvements and regressions
- Meaningfully communicate model capabilities
- Automate model testing and deployment workflows
Kolena organizes your test data, stores and visualizes your model evaluations, and provides tooling to craft better tests. You interface with it through the web at app.kolena.io and programmatically via the
kolena
Python client.Kolena helps you test your ML models more effectively. Jump right in with a quickstart guide:
Current ML evaluation techniques are falling short. Engineers run inference on arbitrarily split benchmark datasets, spend weeks staring at error graphs to evaluate their models, and ultimately produce a global metric that fails to capture the true behavior of the model.
Models exhibit highly variable performance across different subclasses. A global metric gives you a high-level picture of performance but doesn't tell you what you really want to know: what sort of behavior can I expect from my models?
To answer this question you need a higher-resolution picture of model performance. Not "how well does my model perform on class X," but "in what scenarios does my model perform well for class X?"

Looking at global metric, Model A seems far inferior to Model B.
In the above example, looking only at global metric (e.g. F1 score), we'd almost certainly choose to deploy Model B.
But what if the "High Occlusion" scenario isn't important for our product? Most of Model A's failures are from that scenario, and it outperforms Model B in more important scenarios like "Front View." Meanwhile, Model B's underperformance in the "Front View" scenario, a highly important scenario, is masked by improved performance in unimportant scenarios.
Everything you know about your model's behavior prior to pushing to production is learned from your tests. Fine-grained tests teach you what you need to learn before a model hits production.
Now... why Kolena? Two reasons:
- 1.Managing fine-grained tests is a tedious data engineering task, especially under changing data circumstances as your dataset grows and your understanding of your domain develops
- 2.Creating fine-grained tests is labor-intensive and typically involves manual annotation of countless images, a costly and time-consuming process
We built Kolena to solve these two problems.
The Kolena testing platform is a web platform to explore high-resolution model test results, debug behaviors, and communicate performance effectively with internal and external, technical and non-technical stakeholders.
There are two main components of Kolena:
- 1.
- 2.
Testing on Kolena is centered around a few central concepts:
Concept | Description |
---|---|
Well-scoped benchmark dataset against which metrics are computed, assessing one facet of model behavior | |
Grouping of test cases, attempting to answer a specific set of questions about model performance | |
Deterministic transformation turning an input into a set of predictions |
Learn how to use Kolena to test your models effectively:
Check out the Advanced Usage documentation to learn how to further streamline the process of testing on Kolena:
- Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging (arXiv:1909.12475)
- No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems (arXiv:2011.12945)
Last modified 12d ago