connectomics.metrics¶

Evaluation metrics for PyTorch Connectomics.

This package provides comprehensive evaluation metrics: - metrics_seg.py: Segmentation metrics (Adapted Rand, VOI, instance matching) - metrics_skel.py: Skeleton-based metrics for curvilinear structures

Note: PyTorch Lightning handles training monitoring and logging.

Import patterns:: from connectomics.metrics import AdaptedRandError, VariationOfInformation from connectomics.metrics import evaluate_image_pair from connectomics.evaluation import evaluate_directory from connectomics.metrics.segmentation_numpy import adapted_rand, instance_matching

class connectomics.metrics.AdaptedRandError(return_all_stats=False, dist_sync_on_step=False)[source]¶

Torchmetrics-style wrapper around the numpy-based adapted Rand implementation.

This wrapper lets us accumulate scores during Lightning test_step without manual numpy<->torch conversions in the training loop.

Parameters

return_all_stats (bool) – If True, also compute and return precision and recall
dist_sync_on_step (bool) – Whether to sync across distributed processes on each step

Initialize internal Module state, shared by both nn.Module and ScriptModule.

compute()[source]¶

Override this method to compute the final metric value.

This method will automatically synchronize state variables when running in distributed backend.

Return type: Tensor

update(preds, target)[source]¶

Override this method to update the state variables of your metric class.

Parameters

preds (Tensor) –
target (Tensor) –

Return type

None

class connectomics.metrics.InstanceAccuracy(thresh=0.5, criterion='iou', dist_sync_on_step=False)[source]¶

Torchmetrics-style wrapper around instance_matching for instance-level accuracy.

Instance accuracy measures the fraction of correctly detected instances:: accuracy = TP / (TP + FP + FN)

Where: - TP (True Positives): Number of GT instances correctly matched to predictions - FP (False Positives): Number of predicted instances not matched to GT - FN (False Negatives): Number of GT instances not matched to predictions

Matching is based on IoU threshold (default 0.5).

Higher values are better (1.0 = perfect detection).

This wrapper lets us accumulate scores during Lightning test_step without manual numpy<->torch conversions in the training loop.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters

thresh (float) –
criterion (str) –
dist_sync_on_step (bool) –

compute()[source]¶

Return instance-level accuracy: TP / (TP + FP + FN).

Return type: Tensor

update(preds, target)[source]¶

Override this method to update the state variables of your metric class.

Parameters

preds (Tensor) –
target (Tensor) –

Return type

None

class connectomics.metrics.InstanceAccuracySimple(thresh=0.5, criterion='iou', dist_sync_on_step=False)[source]¶

Torchmetrics-style wrapper for relaxed instance-level accuracy (NO Hungarian matching).

WARNING: This is a RELAXED metric for debugging/analysis only, NOT for benchmark ranking. Unlike InstanceAccuracy, this does NOT use optimal bipartite matching.

Simple counting approach:

Count all (GT, Pred) pairs with IoU >= threshold as TP
fp = n_pred - tp
fn = n_true - tp
accuracy = tp / (tp + fp + fn)

This metric is useful for: - Quick debugging and sanity checks - Understanding raw overlap statistics - Comparing with strict Hungarian-based metrics

Higher values are better (1.0 = perfect detection).

This wrapper lets us accumulate scores during Lightning test_step without manual numpy<->torch conversions in the training loop.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters

thresh (float) –
criterion (str) –
dist_sync_on_step (bool) –

compute()[source]¶

Return relaxed instance-level accuracy: TP / (TP + FP + FN).

Return type: Tensor

compute_f1()[source]¶

Return instance-level F1: 2*TP / (2*TP + FP + FN).

Return type: Tensor

compute_precision()[source]¶

Return instance-level precision: TP / (TP + FP).

Return type: Tensor

compute_recall()[source]¶

Return instance-level recall: TP / (TP + FN).

Return type: Tensor

update(preds, target)[source]¶

Override this method to update the state variables of your metric class.

Parameters

preds (Tensor) –
target (Tensor) –

Return type

None

class connectomics.metrics.VariationOfInformation(dist_sync_on_step=False)[source]¶

Torchmetrics-style wrapper around the numpy-based VOI implementation.

VOI (Variation of Information) measures the information-theoretic distance between two clusterings. It decomposes into: - VOI Split (H(X|Y)): Over-segmentation error (false splits) - VOI Merge (H(Y|X)): Under-segmentation error (false merges)

Lower values are better (0 = perfect match).

This wrapper lets us accumulate scores during Lightning test_step without manual numpy<->torch conversions in the training loop.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters: dist_sync_on_step (bool) –

compute()[source]¶

Return total VOI (split + merge).

Return type: Tensor

compute_merge()[source]¶

Return VOI merge (under-segmentation error).

Return type: Tensor

compute_split()[source]¶

Return VOI split (over-segmentation error).

Return type: Tensor

update(preds, target)[source]¶

Override this method to update the state variables of your metric class.

Parameters

preds (Tensor) –
target (Tensor) –

Return type

None

connectomics.metrics.adapted_rand(seg, gt, all_stats=False)[source]¶

Compute Adapted Rand error as defined by the SNEMI3D contest [1]

Formula is given as 1 - the maximal F-score of the Rand index (excluding the zero component of the original labels). Adapted from the SNEMI3D MATLAB script, hence the strange style.

segnp.ndarray
the segmentation to score, where each value is the label at that point

gtnp.ndarray, same shape as seg
the groundtruth to score against, where each value is a label

all_statsboolean, optional
whether to also return precision and recall as a 3-tuple with rand_error

arefloat
The adapted Rand error; equal to $1 -

rac{2pr}{p + r}$,

where $p$ and $r$ are the precision and recall described below.

precfloat, optional: The adapted Rand precision. (Only returned when all_stats is True.)
recfloat, optional: The adapted Rand recall. (Only returned when all_stats is True.)

[1]: http://brainiac2.mit.edu/SNEMI3D/evaluation

connectomics.metrics.evaluate_image_pair(pred, gt, threshold=128, dilation_size=5)[source]¶

Evaluate single prediction-ground truth pair.

Parameters

pred (ndarray) – Prediction mask (0-255 range)
gt (ndarray) – Ground truth mask (0-255 range)
threshold (int) – Threshold for binarizing prediction. Default: 128
dilation_size (int) – Dilation size for skeleton matching. Default: 5

Returns

Tuple of (iou, correctness, completeness, quality) metrics

Returns (1.0, 1.0, 1.0, 1.0) if GT is empty
All values in range [0.0, 1.0]

Return type

Tuple[float, float, float, float]

connectomics.metrics.instance_matching(y_true, y_pred, thresh=0.5, criterion='iou', report_matches=False)[source]¶

Calculate detection/instance segmentation metrics between ground truth and predictions.

Currently, the following metrics are implemented:: ‘fp’, ‘tp’, ‘fn’, ‘precision’, ‘recall’, ‘accuracy’, ‘f1’, ‘criterion’, ‘thresh’, ‘n_true’, ‘n_pred’, ‘mean_true_score’, ‘mean_matched_score’, ‘panoptic_quality’

Corresponding objects of y_true and y_pred are counted as true positives (tp), false positives (fp), and false negatives (fn) when their intersection over union (IoU) >= thresh (for criterion=’iou’, which can be changed)

mean_matched_score is the mean IoUs of matched true positives
mean_true_score is the mean IoUs of matched true positives but normalized by the total number of GT objects
panoptic_quality defined as in Eq. 1 of Kirillov et al. “Panoptic Segmentation”, CVPR 2019

Parameters

y_true (ndarray) – ground truth label image (integer valued)
y_pred (ndarray) – predicted label image (integer valued)
thresh (float) – threshold for matching criterion (default 0.5)
criterion (string) – matching criterion (default IoU)
report_matches (bool) – if True, additionally calculate matched_pairs and matched_scores (returns gt-pred pairs even when scores are below ‘thresh’)

Return type

Matching object with different metrics as attributes

Examples

>>> y_true = np.zeros((100,100), np.uint16)
>>> y_true[10:20,10:20] = 1
>>> y_pred = np.roll(y_true,5,axis = 0)

>>> stats = instance_matching(y_true, y_pred)
>>> print(stats)
Matching(criterion='iou', thresh=0.5, fp=1, tp=0, fn=1, precision=0,
         recall=0, accuracy=0, f1=0, n_true=1, n_pred=1,
         mean_true_score=0.0, mean_matched_score=0.0, panoptic_quality=0.0)

connectomics.metrics.instance_matching_simple(y_true, y_pred, thresh=0.5, criterion='iou')[source]¶

Calculate relaxed instance segmentation metrics without Hungarian matching.

WARNING: This is a RELAXED metric for debugging/analysis only, NOT for benchmark ranking. Unlike instance_matching(), this does NOT use optimal bipartite matching (Hungarian algorithm). Instead, it simply counts all (GT, Pred) pairs with IoU >= threshold as true positives.

This metric is useful for: - Quick debugging and sanity checks - Understanding raw overlap statistics - Comparing with strict Hungarian-based metrics

Metrics computed:: ‘tp’, ‘fp’, ‘fn’, ‘precision’, ‘recall’, ‘accuracy’, ‘f1’, ‘criterion’, ‘thresh’, ‘n_true’, ‘n_pred’

Parameters

y_true (ndarray) – ground truth label image (integer valued)
y_pred (ndarray) – predicted label image (integer valued)
thresh (float) – threshold for matching criterion (default 0.5)
criterion (string) – matching criterion (default ‘iou’)

Return type

Dictionary with metrics (tp, fp, fn, precision, recall, accuracy, f1, etc.)

Examples

>>> y_true = np.zeros((100,100), np.uint16)
>>> y_true[10:20,10:20] = 1
>>> y_pred = np.roll(y_true, 5, axis=0)
>>> stats = instance_matching_simple(y_true, y_pred)
>>> print(f"Accuracy: {stats['accuracy']:.3f}")