Data Loading ============== Data Augmentation ------------------ PyTorch Connectomics uses MONAI dictionary transforms for augmentation. The common path is to configure augmentations in YAML and let the Lightning data factory build the transform pipeline: .. code-block:: python from connectomics.config import load_config from connectomics.data.augmentation import build_train_transforms cfg = load_config("tutorials/minimal.yaml") transforms = build_train_transforms(cfg, keys=["image", "label"], skip_loading=True) sample = {"image": image, "label": label} augmented = transforms(sample) For custom pipelines, compose MONAI transforms with the connectomics-specific ``*d`` dictionary transforms: .. code-block:: python from monai.transforms import Compose, RandFlipd from connectomics.data.augmentation import RandCutBlurd, RandMisAlignmentd transforms = Compose([ RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0), RandMisAlignmentd(keys=["image", "label"], prob=0.5, displacement=16), RandCutBlurd(keys=["image"], prob=0.7, length_ratio=0.6), ]) sample = {"image": image, "label": label} augmented = transforms(sample) The standard keys are ``image``, ``label``, ``label_aux``, and ``mask``. Spatial transforms that receive multiple keys sample one random transform and apply it consistently to every specified key. Augmentations are configured under ``data.augmentation``: .. code-block:: yaml default: data: augmentation: profile: aug_standard misalignment: enabled: true prob: 0.5 displacement: 16 Each transform has an ``enabled`` flag. To turn off a specific transformation, set: .. code-block:: yaml default: data: augmentation: misalignment: enabled: false Rejection Sampling ------------------- Rejection sampling in the dataloader is applied for the following two purposes: **1 - Adding more attention to sparse targets** For some datasets/tasks, the foreground mask is sparse in the volume (*e.g.*, `synapse detection <../tutorials/synapse/index.html>`_). Therefore we perform reject sampling to decrease the ratio of (all completely avoid) regions without foreground pixels. Such a design lets the model pay more attention to the foreground pixels to alleviate false negatives (but may introduce more false positives). Configure rejection sampling under ``data.dataloader``: .. code-block:: yaml default: data: dataloader: reject_sampling: size_thres: 1000 p: 0.95 The ``size_thres: 1000`` key-value pair means that if a random volume contains more than 1,000 non-background voxels, then the volume is considered as a foreground volume and is returned by the rejection sampling function. If it contains less than 1,000 voxels, the function will reject it with a probability ``p: 0.95`` and sample another volume. ``size_thres`` is set to -1 by default to disable the rejection sampling. **2 - Handling partially annotated data** Some datasets are only partially labeled, and the unlabeled region should not be considered in loss calculation. In that case, the user can specify the data path to the valid mask using ``data.train.mask`` and ``data.val.mask``. The valid mask volume should be of the same shape as the label volume with non-zero values denoting annotated regions. A sampled volume with a valid ratio less than 0.5 will be rejected by default. Filename and Lazy Datasets -------------------------- The old ``TileDataset`` path has been removed. Large datasets now use one of the current dataset implementations exported from :mod:`connectomics.data.datasets`: - :class:`connectomics.data.datasets.CachedVolumeDataset` for volumes that fit in RAM. - :class:`connectomics.data.datasets.LazyH5VolumeDataset` and :class:`connectomics.data.datasets.LazyZarrVolumeDataset` for crop-on-read HDF5/Zarr training without preloading the full volume. - :class:`connectomics.data.datasets.MonaiFilenameDataset` for pre-tiled PNG/TIFF-style file lists. For filename-based datasets, prepare a JSON file with image and label paths: .. code-block:: python import json from pathlib import Path root = Path("path/to/dataset") n_images = 2000 data_dict = { "base_path": str(root), "images": [f"images/im{idx:04d}.png" for idx in range(n_images)], "masks": [f"labels/seg{idx:04d}.png" for idx in range(n_images)], } js_path = "filename_dataset.json" with open(js_path, 'w') as fp: json.dump(data_dict, fp) Then select the filename dataset in the Hydra config: .. code-block:: yaml default: data: train: dataset_type: filename json: filename_dataset.json image_key: images label_key: masks split_ratio: 0.9 For large HDF5 or Zarr volumes, prefer lazy crop-on-read instead of file tiling: .. code-block:: yaml default: data: dataloader: use_lazy_h5: true # or: use_lazy_zarr: true patch_size: [128, 128, 128] The Lightning data factory chooses the concrete dataset from these config fields: .. code-block:: python from connectomics.config import load_config from connectomics.training.lightning import create_datamodule cfg = load_config("tutorials/minimal.yaml") datamodule = create_datamodule(cfg) Handling 2D Data ------------------ We design two ways to run inference for a trained 2D model. The first way is to directly load a 3D volume, but the inference pipeline will predict each slice one-by-one and stack them back to a 3D volume. For representations depend on the dimension of the inputs (*e.g.*, affinity map has three channels for 3D masks but only two channels for 2D masks), the number of output channels is consistent with the 2D model. The second way is to directly load 2D PNG or TIFF images. Below are the configurations for streaming 2D inputs at inference time: .. code-block:: yaml test: data: test: dataset_type: filename json: datasets/test_files.json dataloader: patch_size: [1, 256, 256] The filename JSON should list every input image: .. code-block:: json { "base_path": "/data/test", "images": [ "slice_0001.png", "slice_0002.png", "slice_0003.png", "slice_0004.png" ] } The useful Linux command to list PNG images in a folder is: .. code-block:: console ls -d $(pwd -P)/*.png > path.txt