At its core, a test case in Kolena is a benchmark dataset against which metrics are computed.
Test cases can be as small or as large as necessary — in some circumstances, a test case containing hundreds of thousands of examples is desirable. In other cases, a test case with a handful of laser-focused examples is best.
Types of test cases include:
- Unit Tests: examples focusing on a particular scenario or subclass, e.g. "cars from the rear in low lighting"
- Regression Tests: examples improved upon by a particular model training push (squashed bug) or one segment of the "long tail"
- Integration Tests: examples mirroring deployment conditions in terms of composition, distribution, preprocessing, etc.
- Integration tests may also be subsets of normal tests that are run by models both pre-packaging and post-packaging to ensure that operations such as quantization do not unacceptably impact performance
How a test case is defined varies by workflow:
Object Detection / Instance Segmentation
When building your own workflow, a test case may contain any number (>0) of test samples of any arbitrary data type, such as images with annotations or documents, each with ground truth(s) specific to your workflow.
In the Object Detection and Instance Segmentation workflows, a test case contains any number (>0) of test images, each with zero or more ground truth bounding boxes/segmentation masks defining the extents and the label of the object being bounded.
In the Classification workflow, a test case contains any number (>0) of test images, each with zero or more ground truth labels corresponding to the class(es) the image belongs to.
In the Face Recognition 1:1 workflow, a test case contains any number (>0) of image pairs, each with a ground truth label indicating whether the two images do or do not contain the same person.
Metrics computed against aggregate benchmarks don't tell the full story.
If you are an engineer, data scientist, product manager, team lead, or anyone else who builds or sells ML products, you've likely asked (or been asked):
- What are my model's failure modes (bugs)?
- If I've trained a new model, what in particular has improved or regressed from my previous model?
- If my new model has improved from 96% F1 score to 98%, has it improved across all scenarios?
- Which model should we deploy?
- How do I know if my model is ready to be deployed?
Test cases help you give direct, repeatable, and methodological answers to these questions.
Different sized test cases tell you different things about model performance.
At one end, you will likely want to create a test case containing your entire benchmark. Metrics computed against this test case will then be the aggregate metrics you're used to looking at.
At the other end, if you're testing e.g. whether a model can be demoed under certain circumstances, it's reasonable to create test cases with only a few dozen examples. While good performance on these tiny test cases doesn't necessarily tell you that you can expect good performance, poor performance is a strong indicator that your model isn't ready to demo under those circumstances.
As many as you can meaningfully create. When exploring results, you can always filter down to only certain test cases.
If your images are annotated with additional metadata, make test cases for each annotation. If your images are from different sources, make test cases broken down by source. If your images vary widely in size, make test cases stratified by image dimensions.
Each test case increases the resolution of your results. Higher resolution isn't necessarily better — adding noise doesn't help — but higher resolution is necessary to form a detailed understanding of your model's performance.
Certain metrics for each workflow are dependent on negative examples.
For example, precision is dependent on both the true positives and the false positives predicted by your model. If a test case doesn't have any negative examples, your model will very likely produce very few false positives. Therefore precision is not a particularly meaningful metric for that test case!
In general, we encourage including a mix of both positive and negative examples within each test case to avoid being misled by metrics.
Yes! We encourage defining many test cases that take different cuts through the same dataset.
For example, if training a car detector, you may want to stratify your data by lighting, occlusion, distance, type, and color using different test cases within the same test suite. The same image would appear in test cases within each of those categories.
No matter how many different test cases a single image belongs to, during testing, you will only have to process the image a single time and upload inferences to Kolena once.
Update your tests regularly! No dataset is perfect — we encourage making incremental improvements whenever possible to remove noise, improve ground truths, and add new examples.
You don't have to worry about updating tests breaking your ability to compare apples to apples. If two models have been tested on different versions of the same test cases, you can use the web platform to compare performance on only examples they've both been tested on.