Skip to content

Datasets

The dataset_keyword field in your config selects the dataset and automatically wires up the correct data loader, image transformer, and neural network architecture.

Quick start

Pick a keyword from the tables below, drop it into "dataset_keyword" in your config, and InteFL handles the rest — partitioning, transforms, and model selection.


Unified CNN architecture

All image datasets (except CIFAR-100) use the MedMNISTCNN architecture — a configurable CNN with dataset-specific input dimensions, channel counts, and layer widths. The model registry in intellifl/network_models/__init__.py maps each dataset_keyword to the correct constructor parameters. CIFAR-100 uses DynamicCNN, which is configured from HuggingFace dataset metadata.

Image datasets

FEMNIST

Federated version of the EMNIST handwritten character dataset.

Keyword Partitioning Classes Network
femnist_iid IID (reduced) 10 MedMNISTCNN (28×28 grayscale)
femnist_niid Non-IID (full) 62 MedMNISTCNN (28×28 grayscale)

FLAIR

Federated Learning Annotated Image Recognition dataset (satellite imagery).

Keyword Classes Network
flair 2 MedMNISTCNN (256×256 RGB)

ITS (Intelligent Transportation Systems)

Traffic scene image classification.

Keyword Classes Network
its 10 MedMNISTCNN (224×224 RGB)

CIFAR-100

100 classes of 32×32 RGB images (fine-grained classification). Downloaded automatically from HuggingFace Hub (uoft-cs/cifar100).

Keyword Classes Network Source
cifar100 100 (fine labels) DynamicCNN HuggingFace Hub

Lung Cancer Photos

Chest CT scan images for lung cancer classification.

Keyword Classes Network
lung_photos 4 MedMNISTCNN (224×224 grayscale)

MedMNIST

A collection of standardised biomedical image classification benchmarks. All MedMNIST datasets use the MedMNISTCNN architecture with dataset-specific configurations.

Keyword Modality Classes Network
bloodmnist Blood cell microscopy (RGB) 8 MedMNISTCNN
breastmnist Ultrasound (grayscale) 2 MedMNISTCNN
dermamnist Dermatoscopy (RGB) 7 MedMNISTCNN
octmnist Retinal OCT (grayscale) 4 MedMNISTCNN
organamnist Abdominal CT — axial (grayscale) 11 MedMNISTCNN
organcmnist Abdominal CT — coronal (grayscale) 11 MedMNISTCNN
organsmnist Abdominal CT — sagittal (grayscale) 11 MedMNISTCNN
pathmnist Colon pathology (RGB) 9 MedMNISTCNN
pneumoniamnist Chest X-ray (grayscale) 2 MedMNISTCNN
retinamnist Fundus photography (RGB) 5 MedMNISTCNN
tissuemnist Kidney cortex microscopy (grayscale) 8 MedMNISTCNN

Text datasets

Text datasets use a BERT-family transformer backbone (configured via llm_model).

MedQuAD

Medical question-answer pairs for masked language modelling.

Keyword Task Vocabulary domain
medquad MLM medical

HuggingFace text datasets

These are downloaded automatically from the HuggingFace Hub.

Keyword HF path Task Vocabulary domain
financial_phrasebank gtfintechlab/financial_phrasebank_sentences_allagree Classification financial
lexglue coastalcph/lex_glue (LEDGAR subset) Classification legal
pubmed_classification_20k ml4pubmed/pubmed-classification-20k Classification medical
medal cyrilzakka/pubmed-medline MLM medical

HuggingFace cache

Downloaded datasets are cached under ./cache/huggingface. Set the HF_HOME environment variable to change the cache location.


Dataset partitioning

Each dataset is split into num_of_clients partitions, one per virtual client. The training_subset_fraction field controls what fraction of each partition is used for training (the rest is used for validation).

Partitioning strategies

HuggingFace and custom text datasets support configurable partitioning via the partitioning_strategy and partitioning_params config fields.

Strategy Description Parameters
iid Balanced, shuffled, even distribution across clients.
dirichlet Heterogeneous (non-IID) distribution using a Dirichlet prior. alpha (default 0.5; lower = more heterogeneous, higher = more uniform)
pathological Extreme non-IID — each client receives only K classes. num_classes_per_partition (default 2)

Example:

Dirichlet partitioning config
{
  "dataset_keyword": "uoft-cs/cifar10",
  "partitioning_strategy": "dirichlet",
  "partitioning_params": {
    "alpha": 0.5
  }
}

Dataset source location

The mapping between dataset_keyword and its local directory is defined in config/dataset_keyword_to_dataset_dir.json. HuggingFace datasets (dataset_source: "huggingface") are downloaded automatically; local datasets must be placed in the corresponding directory first.

Adding a new dataset

To add a custom dataset, create a directory under datasets/, add an entry to config/dataset_keyword_to_dataset_dir.json, and implement a matching dataset handler in intellifl/dataset_handlers/.