Dataset
A Dataset[T]
is a mapping that allows pipelining of functions in a readable syntax returning an example of type T
.
from datastream import Dataset
fruits_and_cost = (
('apple', 5),
('pear', 7),
('banana', 14),
('kiwi', 100),
)
dataset = (
Dataset.from_subscriptable(fruits_and_cost)
.starmap(lambda fruit, cost: (
fruit,
cost * 2,
))
)
assert dataset[2] == ('banana', 28)
Class Methods
from_subscriptable
Create Dataset
based on subscriptable i.e. implements __getitem__
and __len__
.
Parameters
data
: Any object that implements__getitem__
and__len__
Returns
- A new Dataset instance
Notes
Should only be used for simple examples as a Dataset
created with this method does not support methods that require a source dataframe like Dataset.split
and Dataset.subset
.
from_dataframe
Create Dataset
based on pandas.DataFrame
.
Parameters
df
: Source pandas DataFrame
Returns
- A new Dataset instance where
__getitem__
returns a row from the dataframe
Notes
Dataset.map
should be given a function that takes a row from the dataframe as input.
Examples
import pandas as pd
from datastream import Dataset
dataset = (
Dataset.from_dataframe(pd.DataFrame(dict(
number=[1, 2, 3]
)))
.map(lambda row: row['number'] + 1)
)
assert dataset[-1] == 4
from_paths
Create Dataset
from paths using regex pattern that extracts information from the path itself.
Parameters
paths
: List of file pathspattern
: Regex pattern with named groups to extract information from paths
Returns
- A new Dataset instance where
__getitem__
returns a row from the generated dataframe
Notes
Dataset.map
should be given a function that takes a row from the dataframe as input.
Examples
from datastream import Dataset
image_paths = ["dataset/damage/1.png"]
dataset = (
Dataset.from_paths(image_paths, pattern=r".*/(?P<class_name>\w+)/(?P<index>\d+).png")
.map(lambda row: row["class_name"])
)
assert dataset[-1] == 'damage'
Instance Methods
map
Creates a new dataset with the function added to the dataset pipeline.
Parameters
function
: Function to apply to each example
Returns
- A new Dataset with the mapping function added to the pipeline
Examples
from datastream import Dataset
dataset = (
Dataset.from_subscriptable([1, 2, 3])
.map(lambda number: number + 1)
)
assert dataset[-1] == 4
starmap
Creates a new dataset with the function added to the dataset pipeline.
Parameters
function
: Function that accepts multiple arguments unpacked from the pipeline output
Returns
- A new Dataset with the mapping function added to the pipeline
Notes
The dataset's pipeline should return an iterable that will be expanded as arguments to the mapped function.
Examples
from datastream import Dataset
dataset = (
Dataset.from_subscriptable([1, 2, 3])
.map(lambda number: (number, number + 1))
.starmap(lambda number, plus_one: number + plus_one)
)
assert dataset[-1] == 7
subset
Select a subset of the dataset using a function that receives the source dataframe as input.
Parameters
function
: Function that takes a DataFrame and returns a boolean mask
Returns
- A new Dataset containing only the selected examples
Notes
This function can still be called after multiple operations such as mapping functions as it uses the source dataframe.
Examples
import pandas as pd
from datastream import Dataset
dataset = (
Dataset.from_dataframe(pd.DataFrame(dict(
number=[1, 2, 3]
)))
.map(lambda row: row['number'])
.subset(lambda dataframe: dataframe['number'] <= 2)
)
assert dataset[-1] == 2
split
split(
self,
key_column: str,
proportions: Dict[str, float],
stratify_column: Optional[str] = None,
filepath: Optional[str] = None,
seed: Optional[int] = None,
) -> Dict[str, Dataset[T]]
Split dataset into multiple parts.
Parameters
key_column
: Column to use as unique identifier for examplesproportions
: Dictionary mapping split names to proportionsstratify_column
: Optional column to use for stratificationfilepath
: Optional path to save/load split configurationseed
: Optional random seed for reproducibility
Returns
- Dictionary mapping split names to Dataset instances
Notes
Optionally you can stratify on a column in the source dataframe or save the split to a json file. If you are sure that the split strategy will not change then you can safely use a seed instead of a filepath.
Saved splits can continue from the old split and handle:
- New examples
- Changing test size
- Adapt after removing examples from dataset
- Adapt to new stratification
Examples
import numpy as np
import pandas as pd
from datastream import Dataset
split_datasets = (
Dataset.from_dataframe(pd.DataFrame(dict(
index=np.arange(100),
number=np.arange(100),
)))
.map(lambda row: row['number'])
.split(
key_column='index',
proportions=dict(train=0.8, test=0.2),
seed=700,
)
)
assert len(split_datasets['train']) == 80
assert split_datasets['test'][0] == 3
zip_index
Zip the output with its underlying Dataset index.
Returns
- A new Dataset where each example is a tuple of
(output, index)
Examples
from datastream import Dataset
dataset = Dataset.from_subscriptable([4, 5, 6]).zip_index()
assert dataset[0] == (4, 0)
cache
Cache intermediate step in-memory based on key column.
Parameters
key_column
: Column to use as cache key
Returns
- A new Dataset with caching enabled
Examples
import pandas as pd
from datastream import Dataset
df = pd.DataFrame({'key': ['a', 'b'], 'value': [1, 2]})
dataset = Dataset.from_dataframe(df).cache('key')
assert dataset[0]['value'] == 1
concat
Concatenate multiple datasets together.
Parameters
datasets
: List of datasets to concatenate
Returns
- A new Dataset combining all input datasets
Notes
Consider using Datastream.merge
if you have multiple data sources instead as it allows you to control the number of samples from each source in the training batches.
Examples
from datastream import Dataset
dataset1 = Dataset.from_subscriptable([1, 2])
dataset2 = Dataset.from_subscriptable([3, 4])
combined = Dataset.concat([dataset1, dataset2])
assert len(combined) == 4
assert combined[2] == 3
combine
Zip multiple datasets together so that all combinations of examples are possible.
Parameters
datasets
: List of datasets to combine
Returns
- A new Dataset yielding tuples of all possible combinations
Notes
Creates tuples like (example1, example2, ...)
for all possible combinations (i.e. the cartesian product).