Dataset

A Dataset[T] is a mapping that allows pipelining of functions in a readable syntax returning an example of type T.

from datastream import Dataset

fruits_and_cost = (
    ('apple', 5),
    ('pear', 7),
    ('banana', 14),
    ('kiwi', 100),
)

dataset = (
    Dataset.from_subscriptable(fruits_and_cost)
    .starmap(lambda fruit, cost: (
        fruit,
        cost * 2,
    ))
)

assert dataset[2] == ('banana', 28)

Class Methods

`from_subscriptable`

from_subscriptable(data: Subscriptable[T]) -> Dataset[T]

Create Dataset based on subscriptable i.e. implements __getitem__ and __len__.

Parameters

data: Any object that implements __getitem__ and __len__

Returns

A new Dataset instance

Notes

Should only be used for simple examples as a Dataset created with this method does not support methods that require a source dataframe like Dataset.split and Dataset.subset.

`from_dataframe`

from_dataframe(df: pd.DataFrame) -> Dataset[pd.Series]

Create Dataset based on pandas.DataFrame.

Parameters

df: Source pandas DataFrame

Returns

A new Dataset instance where __getitem__ returns a row from the dataframe

Notes

Dataset.map should be given a function that takes a row from the dataframe as input.

Examples

import pandas as pd
from datastream import Dataset

dataset = (
    Dataset.from_dataframe(pd.DataFrame(dict(
        number=[1, 2, 3]
    )))
    .map(lambda row: row['number'] + 1)
)

assert dataset[-1] == 4

`from_paths`

from_paths(paths: List[str], pattern: str) -> Dataset[pd.Series]

Create Dataset from paths using regex pattern that extracts information from the path itself.

Parameters

paths: List of file paths
pattern: Regex pattern with named groups to extract information from paths

Returns

A new Dataset instance where __getitem__ returns a row from the generated dataframe

Notes

Dataset.map should be given a function that takes a row from the dataframe as input.

Examples

from datastream import Dataset

image_paths = ["dataset/damage/1.png"]
dataset = (
    Dataset.from_paths(image_paths, pattern=r".*/(?P<class_name>\w+)/(?P<index>\d+).png")
    .map(lambda row: row["class_name"])
)

assert dataset[-1] == 'damage'

Instance Methods

`map`

map(self, function: Callable[[T], U]) -> Dataset[U]

Creates a new dataset with the function added to the dataset pipeline.

Parameters

function: Function to apply to each example

Returns

A new Dataset with the mapping function added to the pipeline

Examples

from datastream import Dataset

dataset = (
    Dataset.from_subscriptable([1, 2, 3])
    .map(lambda number: number + 1)
)

assert dataset[-1] == 4

`starmap`

starmap(self, function: Callable[..., U]) -> Dataset[U]

Creates a new dataset with the function added to the dataset pipeline.

Parameters

function: Function that accepts multiple arguments unpacked from the pipeline output

Returns

A new Dataset with the mapping function added to the pipeline

Notes

The dataset's pipeline should return an iterable that will be expanded as arguments to the mapped function.

Examples

from datastream import Dataset

dataset = (
    Dataset.from_subscriptable([1, 2, 3])
    .map(lambda number: (number, number + 1))
    .starmap(lambda number, plus_one: number + plus_one)
)

assert dataset[-1] == 7

`subset`

subset(self, function: Callable[[pd.DataFrame], pd.Series]) -> Dataset[T]

Select a subset of the dataset using a function that receives the source dataframe as input.

Parameters

function: Function that takes a DataFrame and returns a boolean mask

Returns

A new Dataset containing only the selected examples

Notes

This function can still be called after multiple operations such as mapping functions as it uses the source dataframe.

Examples

import pandas as pd
from datastream import Dataset

dataset = (
    Dataset.from_dataframe(pd.DataFrame(dict(
        number=[1, 2, 3]
    )))
    .map(lambda row: row['number'])
    .subset(lambda dataframe: dataframe['number'] <= 2)
)

assert dataset[-1] == 2

`split`

split(
    self,
    key_column: str,
    proportions: Dict[str, float],
    stratify_column: Optional[str] = None,
    filepath: Optional[str] = None,
    seed: Optional[int] = None,
) -> Dict[str, Dataset[T]]

Split dataset into multiple parts.

Parameters

key_column: Column to use as unique identifier for examples
proportions: Dictionary mapping split names to proportions
stratify_column: Optional column to use for stratification
filepath: Optional path to save/load split configuration
seed: Optional random seed for reproducibility

Returns

Dictionary mapping split names to Dataset instances

Notes

Optionally you can stratify on a column in the source dataframe or save the split to a json file. If you are sure that the split strategy will not change then you can safely use a seed instead of a filepath.

Saved splits can continue from the old split and handle:

New examples
Changing test size
Adapt after removing examples from dataset
Adapt to new stratification

Examples

import numpy as np
import pandas as pd
from datastream import Dataset

split_datasets = (
    Dataset.from_dataframe(pd.DataFrame(dict(
        index=np.arange(100),
        number=np.arange(100),
    )))
    .map(lambda row: row['number'])
    .split(
        key_column='index',
        proportions=dict(train=0.8, test=0.2),
        seed=700,
    )
)
assert len(split_datasets['train']) == 80
assert split_datasets['test'][0] == 3

`zip_index`

zip_index(self) -> Dataset[Tuple[T, int]]

Zip the output with its underlying Dataset index.

Returns

A new Dataset where each example is a tuple of (output, index)

Examples

from datastream import Dataset

dataset = Dataset.from_subscriptable([4, 5, 6]).zip_index()
assert dataset[0] == (4, 0)

`cache`

cache(self, key_column: str) -> Dataset[T]

Cache intermediate step in-memory based on key column.

Parameters

key_column: Column to use as cache key

Returns

A new Dataset with caching enabled

Examples

import pandas as pd
from datastream import Dataset

df = pd.DataFrame({'key': ['a', 'b'], 'value': [1, 2]})
dataset = Dataset.from_dataframe(df).cache('key')
assert dataset[0]['value'] == 1

`concat`

concat(datasets: List[Dataset[T]]) -> Dataset[T]

Concatenate multiple datasets together.

Parameters

datasets: List of datasets to concatenate

Returns

A new Dataset combining all input datasets

Notes

Consider using Datastream.merge if you have multiple data sources instead as it allows you to control the number of samples from each source in the training batches.

Examples

from datastream import Dataset

dataset1 = Dataset.from_subscriptable([1, 2])
dataset2 = Dataset.from_subscriptable([3, 4])
combined = Dataset.concat([dataset1, dataset2])
assert len(combined) == 4
assert combined[2] == 3

`combine`

combine(datasets: List[Dataset]) -> Dataset[Tuple]

Zip multiple datasets together so that all combinations of examples are possible.

Parameters

datasets: List of datasets to combine

Returns

A new Dataset yielding tuples of all possible combinations

Notes

Creates tuples like (example1, example2, ...) for all possible combinations (i.e. the cartesian product).

Examples

from datastream import Dataset

dataset1 = Dataset.from_subscriptable([1, 2])
dataset2 = Dataset.from_subscriptable([3, 4])
combined = Dataset.combine([dataset1, dataset2])
assert len(combined) == 4  # 2 * 2 = 4 combinations
assert combined[0] == (1, 3)  # First combination