connectomics.data¶
Datasets¶
Dataset module for PyTorch Connectomics.
Provides patch-sampling datasets for volumetric EM data: - CachedVolumeDataset: loads volumes into RAM, crops with numpy - LazyZarrVolumeDataset: lazy zarr reads (low memory) - MonaiFilenameDataset: loads pre-tiled images from JSON - Multi-dataset wrappers: Weighted, Stratified, Uniform concat
- class connectomics.data.datasets.CachedVolumeDataset(image_paths, label_paths=None, label_aux_paths=None, mask_paths=None, patch_size=(112, 112, 112), iter_num=500, transforms=None, pre_cache_transforms=None, mode='train', pad_size=None, pad_mode='reflect', max_attempts=10, foreground_threshold=0.05, crop_to_nonzero_mask=False, sample_nonzero_mask=False)[source]¶
Cached volume dataset that loads volumes once and crops in memory.
Dramatically speeds up training by: 1. Loading all volumes into memory once during init 2. Performing random crops from cached volumes during iteration 3. Applying augmentations to crops (not full volumes)
- Parameters
image_paths (List[str]) – List of image volume paths.
label_paths (Optional[List[str]]) – List of label volume paths (None entries OK).
mask_paths (Optional[List[str]]) – List of mask volume paths (None entries OK).
patch_size (Tuple[int, ...]) – Size of random crops (z, y, x) or (y, x).
iter_num (int) – Number of iterations per epoch.
transforms (Optional[Compose]) – MONAI transforms applied after cropping.
pre_cache_transforms (Optional[Any]) – One-time transforms applied before caching.
mode (str) – ‘train’ or ‘val’.
pad_size (Optional[Tuple[int, ...]]) – Padding to apply to each spatial dimension.
pad_mode (str) – Padding mode (‘reflect’, ‘constant’, etc.).
max_attempts (int) – Max foreground sampling retries.
foreground_threshold (float) – Min foreground fraction to accept a patch.
crop_to_nonzero_mask (bool) – Constrain crops to intersect mask bounding box.
sample_nonzero_mask (bool) – Center crops on random nonzero mask voxels.
label_aux_paths (Optional[List[str]]) –
- class connectomics.data.datasets.LazyH5VolumeDataset(image_paths, label_paths=None, label_aux_paths=None, mask_paths=None, patch_size=(112, 112, 112), iter_num=500, transforms=None, mode='train', max_attempts=10, foreground_threshold=0.0, transpose_axes=None)[source]¶
Lazy HDF5 dataset that samples random crops directly from .h5 files.
Mirrors
LazyZarrVolumeDatasetbut opens HDF5 stores instead of Zarr stores. Paths may point at a file (“vol.h5”) — the first dataset in the file is used — or include an explicit dataset key (“vol.h5/main”).- Parameters
- class connectomics.data.datasets.LazyZarrVolumeDataset(image_paths, label_paths=None, label_aux_paths=None, mask_paths=None, patch_size=(112, 112, 112), iter_num=500, transforms=None, mode='train', max_attempts=10, foreground_threshold=0.0, transpose_axes=None)[source]¶
Lazy zarr dataset that samples random crops directly from zarr stores.
Notes: - Input image arrays may be 3D or 4D (channel-last or channel-first). - Label/mask arrays are expected to be 3D (or 4D with singleton channel). - Output is channel-first: image/label/mask shapes are [C, D, H, W].
- Parameters
- class connectomics.data.datasets.MonaiFilenameDataset(json_path, transforms=None, mode='train', images_key='images', labels_key='masks', base_path_key='base_path', train_val_split=None, random_seed=42, use_labels=True)[source]¶
MONAI dataset for loading individual images from JSON file lists.
JSON format:
{ "base_path": "/path/to/data", "images": ["relative/path/to/image1.png", ...], "masks": ["relative/path/to/mask1.png", ...] }
- Parameters
json_path (str) – Path to JSON file containing file lists.
transforms (Optional[Compose]) – MONAI transforms pipeline.
mode (str) – ‘train’, ‘val’, or ‘test’.
images_key (str) – Key in JSON for image file list.
labels_key (str) – Key in JSON for label file list.
base_path_key (str) – Key in JSON for base path.
train_val_split (Optional[float]) – Fraction for train split (0.0-1.0).
random_seed (int) – Random seed for train/val split.
use_labels (bool) – Whether to load labels.
data – input data to load and transform to generate dataset for model.
transform – a callable, sequence of callables or None. If transform is not
instance (a Compose) –
Sequences (it will be wrapped in a Compose instance.) –
passed (of callables are applied in order and if None is) –
is. (the data is returned as) –
- class connectomics.data.datasets.PatchDataset(patch_size, iter_num=500, transforms=None, mode='train', max_attempts=10, foreground_threshold=0.0)[source]¶
Abstract base for datasets that sample random patches from volumes.
- Subclasses must implement:
_crop_volumes(vol_idx, pos) -> dict with “image” and optional “label”/”mask” _has_labels(vol_idx) -> bool
Subclasses must populate
self.volume_sizesduring __init__.- Provides:
__getitem__ with foreground-aware retry loop
set_epoch / get_sampling_fingerprint for validation reseeding
Shared crop position sampling via crop_sampling.py
- Parameters
- class connectomics.data.datasets.StratifiedConcatDataset(datasets, length=None)[source]¶
Concatenate datasets with stratified (round-robin) sampling.
Ensures balanced sampling across datasets by cycling through them. This is useful when you want equal representation from each dataset regardless of their actual sizes.
- Parameters
datasets (List[Dataset]) – List of datasets to concatenate
length (Optional[int]) – Total number of samples per epoch. Default: sum of dataset lengths
Example
>>> from connectomics.data.datasets import StratifiedConcatDataset >>> dataset1 = Dataset1(size=100) >>> dataset2 = Dataset2(size=200) >>> stratified = StratifiedConcatDataset([dataset1, dataset2]) >>> # Will sample: dataset1[0], dataset2[0], dataset1[1], dataset2[1], ... >>> # Ensures equal representation even though dataset2 is 2x larger
- class connectomics.data.datasets.UniformConcatDataset(datasets, length=None)[source]¶
Concatenate datasets with uniform random sampling.
Samples uniformly from all datasets combined, giving equal probability to each individual sample across all datasets. This is equivalent to WeightedConcatDataset with weights proportional to dataset sizes.
- Parameters
datasets (List[Dataset]) – List of datasets to concatenate
length (Optional[int]) – Total number of samples per epoch. Default: sum of dataset lengths
Example
>>> from connectomics.data.datasets import UniformConcatDataset >>> dataset1 = Dataset1(size=100) >>> dataset2 = Dataset2(size=200) >>> uniform = UniformConcatDataset([dataset1, dataset2]) >>> # Each sample has equal probability (1/300) regardless of source dataset
- class connectomics.data.datasets.WeightedConcatDataset(datasets, weights, length=None)[source]¶
Concatenate multiple datasets and sample from them with specified weights.
Unlike torch.utils.data.ConcatDataset which samples proportionally to dataset sizes, this class samples according to specified weights. This is particularly useful for domain adaptation where you want to control the ratio of synthetic vs. real data regardless of dataset sizes.
- Parameters
Example
>>> from connectomics.data.datasets import WeightedConcatDataset >>> synthetic_data = SyntheticDataset(size=10000) >>> real_data = RealDataset(size=1000) >>> # 80% synthetic, 20% real (regardless of actual sizes) >>> mixed = WeightedConcatDataset( ... datasets=[synthetic_data, real_data], ... weights=[0.8, 0.2], ... length=5000 # 5000 samples per epoch ... ) >>> # Each batch will be 80% synthetic, 20% real on average
- connectomics.data.datasets.compute_total_samples(volume_sizes, patch_size, stride)[source]¶
Compute total number of samples across multiple volumes.
- Parameters
- Returns
Tuple of (total_samples, samples_per_volume) - total_samples: Total number of possible patches across all volumes - samples_per_volume: List of sample counts per volume
- Return type
Examples
>>> volume_sizes = [(165, 768, 1024)] >>> patch_size = (112, 112, 112) >>> stride = (1, 1, 1) >>> total, per_vol = compute_total_samples(volume_sizes, patch_size, stride) >>> print(f"Total samples: {total}") >>> # Total samples: 32,380,302 (54 * 657 * 913)
- connectomics.data.datasets.count_volume(data_size, patch_size, stride)[source]¶
Calculate the number of patches that can be extracted from a volume.
This function computes how many non-overlapping or overlapping patches of a given size can be extracted from a volume using a specified stride.
- Parameters
- Returns
Array of shape (3,) containing the number of patches along each dimension
- Return type
ndarray
Examples
>>> data_size = np.array([165, 768, 1024]) >>> patch_size = np.array([112, 112, 112]) >>> stride = np.array([1, 1, 1]) >>> count = count_volume(data_size, patch_size, stride) >>> # count = [54, 657, 913] along z, y, x >>> total_samples = np.prod(count) # Total possible patches
Note
The formula is: 1 + ceil((data_size - patch_size) / stride) This matches the legacy PyTorch Connectomics v1 implementation.
- connectomics.data.datasets.create_data_dicts_from_paths(image_paths, label_paths=None, label_aux_paths=None, mask_paths=None)[source]¶
Create MONAI-style data dictionaries from file paths.
- Parameters
- Returns
List of dictionaries with ‘image’, ‘label’, ‘label_aux’, and/or ‘mask’ keys
- Return type
- connectomics.data.datasets.create_filename_datasets(json_path, train_transforms=None, val_transforms=None, train_val_split=0.9, random_seed=42, images_key='images', labels_key='masks', use_labels=True)[source]¶
Create train and val datasets from a single JSON.
- connectomics.data.datasets.crop_volume(volume, size, start, pad_mode='reflect')[source]¶
Crop a subvolume from a volume using numpy slicing.
If the crop extends past volume bounds, pads to the exact requested size.
- Parameters
volume (ndarray) – Input volume (C, D, H, W) or (C, H, W) or without channel dim.
size (Tuple[int, ...]) – Crop size (d, h, w) for 3D or (h, w) for 2D.
start (Tuple[int, ...]) – Start position matching size dimensions.
pad_mode (str) – Padding mode – “reflect” for images, “constant” for labels/masks.
- Returns
Cropped volume with exact requested size.
- Return type
ndarray
- connectomics.data.datasets.split_volume_train_val(volume_shape, train_ratio=0.8, axis=0, min_val_size=None)[source]¶
Split a volume into training and validation regions along a specified axis.
This follows DeepEM’s approach of spatial splitting where: - First 80% (or specified ratio) of volume is used for training - Last 20% is used for validation - Split is along Z-axis by default (axis=0 for [D,H,W] volumes)
- Parameters
- Returns
Tuple of slices for training region val_slices: Tuple of slices for validation region
- Return type
train_slices
Example
>>> volume_shape = (100, 256, 256) # [D, H, W] >>> train_slices, val_slices = split_volume_train_val(volume_shape, train_ratio=0.8) >>> # train_slices = (slice(0, 80), slice(None), slice(None)) >>> # val_slices = (slice(80, 100), slice(None), slice(None))
Augmentations¶
MONAI-native augmentation interface for PyTorch Connectomics.
This module provides pure MONAI transforms for connectomics-specific data augmentation, enabling seamless integration with MONAI Compose pipelines.
- class connectomics.data.augmentation.RandAxisPermuted(*args, **kwargs)[source]¶
Randomly permute the three spatial axes of a cubic 3D volume.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandCopyPasted(*args, **kwargs)[source]¶
Random Copy-Paste — copies transformed objects to non-overlapping regions.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandCutBlurd(*args, **kwargs)[source]¶
Random CutBlur — downsample+upsample cuboid regions for super-resolution learning.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandCutNoised(*args, **kwargs)[source]¶
Random cut noise — adds noise to random cuboid regions.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandMisAlignmentd(*args, **kwargs)[source]¶
Random misalignment augmentation simulating EM section alignment artifacts.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandMissingPartsd(*args, **kwargs)[source]¶
Random missing parts — creates rectangular holes in sections.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandMissingSectiond(*args, **kwargs)[source]¶
Random missing section augmentation with paper-style fill values.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandMixupd(*args, **kwargs)[source]¶
Random Mixup — linear interpolation between batch samples.
Warning: This transform requires a batch dimension (ndim >= 4) and at least 2 samples along that dimension. In standard per-sample MONAI pipelines (where each dict is one sample with ndim=3), this is a no-op. For true cross-sample mixup, use a collate-level or batch-level transform instead.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandMotionBlurd(*args, **kwargs)[source]¶
Legacy name for paper-style out-of-focus Gaussian blur augmentation.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandRotate90Alld(*args, **kwargs)[source]¶
Apply random quarter-turn rotations over all three 3D plane pairs.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandSliceDropZd(*args, **kwargs)[source]¶
Clearer alias for the legacy z-only missing-section augmentation.
- class connectomics.data.augmentation.RandSliceDropd(*args, **kwargs)[source]¶
BANIS-style per-slice dropping along one sampled spatial axis.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- class connectomics.data.augmentation.RandSliceShiftZd(*args, **kwargs)[source]¶
Clearer alias for the legacy z-only misalignment augmentation.
- class connectomics.data.augmentation.RandSliceShiftd(*args, **kwargs)[source]¶
BANIS-style independent per-slice in-plane shifts along one sampled axis.
- Parameters
- randomize(_=None)[source]¶
Within this method,
self.Rshould be used, instead of np.random, to introduce random factors.all
self.Rcalls happen here so that we have a better chance to identify errors of sync the random state.This method can generate the random factors based on properties of the input data.
- Parameters
_ (Any) –
- Return type
None
- connectomics.data.augmentation.build_test_transforms(cfg, keys=None, mode='test')[source]¶
Build test/tune inference transforms from Hydra config.
Similar to validation transforms but WITHOUT cropping to enable sliding window inference on full volumes.
- connectomics.data.augmentation.build_train_transforms(cfg, keys=None, skip_loading=False)[source]¶
Build training transforms from Hydra config.
I/O¶
I/O utilities for PyTorch Connectomics.
- Organization:
io.py - Format-specific I/O (HDF5, TIFF, PNG, NIfTI) transforms.py - MONAI-compatible data loading transforms tiles.py - Tile-based operations for large datasets utils.py - RGB/seg conversion, mask splitting
- class connectomics.data.io.LoadVolumed(*args, **kwargs)[source]¶
MONAI loader for connectomics volume data.
Loads HDF5, TIFF, PNG, NIfTI files and ensures channel-first format with a channel dimension.
- connectomics.data.io.get_vol_shape(filename, dataset=None)[source]¶
Get volume shape without loading data.
Returns shape consistent with what read_volume would produce: (D, H, W) or (C, D, H, W).
- connectomics.data.io.read_hdf5(filename, dataset=None, slice_obj=None)[source]¶
Read data from HDF5 file.
- connectomics.data.io.read_images(filename_pattern, image_type='image')[source]¶
Read multiple images from a glob pattern.
Returns stacked array with shape (N, H, W) or (N, H, W, C).
- connectomics.data.io.read_volume(filename, dataset=None, drop_channel=False)[source]¶
Load volumetric data (HDF5, TIFF, PNG, NIfTI).
Returns array with shape (D, H, W) or (C, D, H, W).
- connectomics.data.io.rgb_to_seg(rgb)[source]¶
Convert VAST RGB segmentation format to IDs.
Each pixel’s RGB values are combined to create a unique 24-bit segmentation ID.
- Parameters
rgb (ndarray) –
- Return type
ndarray
- connectomics.data.io.save_volume(filename, volume, dataset='main', file_format=None)[source]¶
Save volumetric data in specified format.
- connectomics.data.io.volume_exists(filename, dataset=None)[source]¶
Return True when a volume path can be opened by this IO layer.