pytorch-datastream

Simple dataset to dataloader library for pytorch.

Quick Example

from datastream import Dataset, Datastream

dataset = (
    Dataset.from_subscriptable([1, 2, 3])
    .map(lambda number: number + 1)
)

assert dataset[-1] == 4

data_loader = (
    Datastream(dataset)
    .data_loader(batch_size=16, n_batches_per_epoch=100)
)

assert len(next(iter(data_loader))) == 16

Features

Simple, readable dataset pipeline creation
Built-in support for:
Imbalanced datasets
Oversampling / stratification
Weighted sampling
Easy conversion to PyTorch DataLoader
Testable examples in documentation
Type hints and Pydantic validation
Clean, maintainable codebase

Installation

Install with poetry:

poetry add pytorch-datastream

Or with pip:

pip install pytorch-datastream

Next Steps

Check out the Getting Started guide
See the Dataset and Datastream API references