Quickstart: Face Recognition (1:1)

In this quickstart tutorial we'll use the Labeled Faces in the Wild dataset and deepface Python library to create and run tests for the Face Recognition (1:1) workflow.

Getting Started

With the kolena-client Python client installed, let's first initialize a client session:
import os
import kolena
kolena.initialize(os.environ["KOLENA_TOKEN"], verbose=True)
The data used in this tutorial is publicly available in the kolena-public-datasets S3 bucket in metadata.csv and pairs.csv.gz files:
import pandas as pd
DATASET = "labeled-faces-in-the-wild"
BUCKET = "s3://kolena-public-datasets"
df_metadata = pd.read_csv(f"{BUCKET}/{DATASET}/meta/metadata.csv")
df_pairs = pd.read_csv(f"{BUCKET}/{DATASET}/meta/pairs.csv.gz")
To load CSVs directly from S3, make sure to install the s3fs Python module:pip3 install s3fs[boto3]
These files describe a 1:1 face recognition dataset with the following format:
  • metadata.csv: table containing one record for each image in the dataset, with the following columns:
    • person: name of the person depicted in the image
    • locator: the bucket locator of the image in the shared bucket, e.g. s3://kolena-public-datasets/labeled-faces-in-the-wild/imgs/AJ_Cook/AJ_Cook_0001.jpg
    • width: width of the image, in pixels
    • height: height of the image, in pixels
    • age: estimated age of the person depicted in the image
    • race: estimated race of the person depicted in the image
    • gender: estimated gender of the person depicted in the image
  • pairs.csv.gz: GZipped table containing one record for each image pair in the dataset, with the following columns:
    • locator_a: the bucket locator for the left image in the pair. This image must exist in metadata.csv
    • locator_b: the bucket locator for the right image in the pair. This image must exist in metadata.csv
    • is_same: boolean indicating whether the two images contain the same person or different people

Step 1: Registering Data

Integration begins with the registration of images used to run tests.
Images are not uploaded to Kolena — they live in your shared bucket only. However, in order to use an image in a test, images must be registered with the platform by providing the locator to the image and a few pieces of additional metadata:
from import TestImages
with TestImages.register() as registrar:
for record in df_metadata.itertuples():
tags = dict(age=record.age, race=record.race, gender=record.gender)
registrar.add(record.locator, DATASET, record.width, record.height, tags=tags)

Step 2: Creating Tests

Now that we've registered data, let's define test cases. A test case is a set of image pairs with a left image (locator_a), a right image (locator_b), and a boolean value (is_same) indicating if the two images contain the same person or different people.
Pairs defined with is_same=True are considered Genuine Pairs. Pairs defined with is_same=False are considered Imposter Pairs.
Let's define a simple test case containing the entire pairs table in df_pairs:
from import TestCase
test_case = TestCase.create(
f"complete {DATASET}",
description=f"complete set of pairs defined in {DATASET}",
with test_case.edit() as editor:
for record in df_pairs.itertuples():
editor.add(record.locator_a, record.locator_b, record.is_same)
In this tutorial we created only a single simple test case, but more advanced test cases can be generated in a variety of fast and scalable ways, such as using the demographic metadata associated with our images to create test cases broken down by age, race, and gender. See Creating Test Cases for details.
Now that we've defined our test cases, the next step is to create a test suite:
from import TestSuite
test_suite = TestSuite.create(
f"{DATASET} simple benchmark",
description=f"simple test suite derived from {DATASET}",
with test_suite.edit() as editor:
editor.add(test_case, is_baseline=True)
The similarity score threshold used to compute metrics is calculated based on target False Match Rates (FMR). A prediction is considered a false match if the computed similarity between two images in an imposter pair is higher than this similarity threshold. This similarity threshold is automatically calculated such that the false match rate on the test suite baseline is no higher than the selected target FMR value.
The default settings compute metrics at target FMR values of 0.1, 0.01, 0.001, 0.0001, 0.00001, and 0.000001. This means performance metrics are available at thresholds corresponding to one false match in ten imposter pairs, one false match in one hundred imposter pairs, all the way up to one false match in one million imposter pairs (provided there are enough examples in the baseline to calculate this value).

Step 3: Running Tests

With data registered and tests defined in the platform, the final piece of the integration is to run tests.
Testing for the face recognition 1:1 workflow is divided into two phases:
  1. 1.
    Extracting embeddings: each image in the test suite is surfaced. During this phase your model detects bounding boxes surrounding faces in each image, estimates landmarks corresponding to the eyes, nose, and mouth of these faces, and extracts embeddings vectors used in the next phase to compute similarity scores between images.
  2. 2.
    Computing similarities: during this phase, the embeddings extracted in the previous phase are used to compute float similarity scores for each image pair defined in the test suite. These similarity scores are used to compute the metrics reported in the platform.
These phases are abstracted away using
Let's define a model using Facenet512 from the deepface Python library:
from typing import Optional
from deepface import DeepFace
from import InferenceModel
import numpy as np
from PIL import Image
import s3fs
s3 = s3fs.S3FileSystem()
model_name = "Facenet512"
extractor = DeepFace.build_model(model_name)
def extract(locator: str) -> Optional[np.ndarray]:
"""Extract an embedding from the provided image representing the depicted face."""
with as fp:
img = np.asarray(
return np.asarray(DeepFace.represent(img, model=extractor, enforce_detection=False))
def compare(emb_a: np.ndarray, emb_b: np.ndarray) -> float:
"""Compare two extracted embeddings vectors to product a float similarity score."""
# cosine similarity
return np.inner(emb_a, emb_b) / (np.linalg.norm(emb_a) * np.linalg.norm(emb_b))
model_descriptor = f"{model_name} (enforce_detection=False, distance_metric=cosine)"
model_metadata = dict(enforce_detection=False, similarity_metric="cosine", library="deepface")
model = InferenceModel.create(name=model_descriptor, metadata=model_metadata, extract=extract, compare=compare)
Finally run the test:
from import test
test(model, test_suite)
Our test run is now complete! We can now visit the web platform to analyze and debug our model's performance on this test suite.


In this quickstart tutorial we learned how to create new tests for face recognition (1:1) datasets and how to test FR models on Kolena.
What we learned here just scratches the surface of what's possible with Kolena and covered a fraction of the kolena-client API — now that we're up and running, we can think about ways to create more detailed tests, improve existing tests, and dive deep into model behaviors.