Dataset#
Fondant helps you build datasets by providing a set of operations to load, transform, and write data. With Fondant, you can use both reusable components and custom components, and chain them to create datasets.
Load a Fondant dataset#
You can initialise a dataset from a previous run by using the read
method.
View a detailed reference of the Dataset.read()
method
Read a dataset from a manifest file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
manifest_path |
str
|
The path to the manifest file. |
required |
#
Build a dataset#
Start by creating a dataset.py
file and adding the following code.
from fondant.dataset import Dataset
dataset = Dataset.create(
"load_from_parquet",
arguments={
"dataset_uri": "path/to/dataset",
"n_rows_to_load": 100,
},
produces={
"text": pa.string()
},
dataset_name="my_dataset"
)
This code initializes a Dataset
instance with a load component. The load component reads data.
View a detailed reference of the Dataset.create()
method
Read data using the provided component.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref |
Any
|
The name of a reusable component, or the path to the directory containing a containerized component, or a lightweight component class. |
required |
produces |
Optional[Dict[str, Union[str, DataType]]]
|
A mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the dataset. |
None
|
arguments |
Optional[Dict[str, Any]]
|
A dictionary containing the argument name and value for the operation. |
None
|
input_partition_rows |
Optional[Union[int, str]]
|
The number of rows to load per partition. Set to override the |
None
|
resources |
Optional[Resources]
|
The resources to assign to the operation. |
None
|
cache |
Optional[bool]
|
Set to False to disable caching, True by default. |
True
|
dataset_name |
Optional[str]
|
The name of the dataset. |
None
|
Returns:
Type | Description |
---|---|
Dataset
|
An intermediate dataset. |
#
The create method does not execute your component yet, but adds the component to the execution
graph. It returns a lazy Dataset
instance which you can use to chain transform components.
Adding transform components#
from fondant.dataset import Resources
dataset = dataset.apply(
"embed_text",
resources=Resources(
accelerator_number=1,
accelerator_name="GPU",
)
)
The apply
method also returns a lazy Dataset
which you can use to chain additional components.
The apply
method also provides additional configuration options on how to execute the component.
You can for instance provide a Resources
definition to specify the hardware it should run on.
In this case, we want to leverage a GPU to run our embedding model. Depending on the runner, you
can choose the type of GPU as well.
View a detailed reference of the Dataset.apply()
method
Apply the provided component on the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref |
Any
|
The name of a reusable component, or the path to the directory containing a custom component, or a lightweight component class. |
required |
workspace |
workspace to operate in |
required | |
consumes |
Optional[Dict[str, Union[str, DataType]]]
|
A mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the input dataset. Suppose we have a component spec that expects the following fields: To override the default mapping and specify that the 'text' field should be sourced from the 'custom_text' field in the input dataset, the 'consumes' mapping can be defined as follows: In this example, the 'text' field will be sourced from 'custom_text' and 'image' will be sourced from the 'image' field by default, since it's not specified in the custom mapping. |
None
|
produces |
Optional[Dict[str, Union[str, DataType]]]
|
A mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field, or the name that should be used to write the field to the output dataset. Suppose we have a component spec that expects the following fields: To customize the field names and types during the production step, the 'produces' mapping can be defined as follows: In this example, the 'text' field will retain as text since it is not specified in the custom mapping. The 'width' field will be stored with the name 'custom_width' in the output dataset. Alternatively, the produces defines the data type of the output data. In this example, the 'text' field will retain its type 'string' without specifying a
different source, while the 'width' field will be produced as type |
None
|
arguments |
Optional[Dict[str, Any]]
|
A dictionary containing the argument name and value for the operation. |
None
|
input_partition_rows |
Optional[Union[int, str]]
|
The number of rows to load per partition. Set to override the |
None
|
resources |
Optional[Resources]
|
The resources to assign to the operation. |
None
|
cache |
Optional[bool]
|
Set to False to disable caching, True by default. |
True
|
Returns:
Type | Description |
---|---|
Dataset
|
An intermediate dataset. |
#
Adding a write component#
The final step is to write our data to its destination.
dataset = dataset.write(
"write_to_hf_hub",
arguments={
"username": "user",
"dataset_name": "dataset",
"hf_token": "xxx",
}
)
View a detailed reference of the Dataset.write()
method
Write the dataset using the provided component.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref |
Any
|
The name of a reusable component, or the path to the directory containing a custom component, or a lightweight component class. |
required |
workspace |
workspace to operate in |
required | |
consumes |
Optional[Dict[str, Union[str, DataType]]]
|
A mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the input dataset. |
None
|
arguments |
Optional[Dict[str, Any]]
|
A dictionary containing the argument name and value for the operation. |
None
|
input_partition_rows |
Optional[Union[int, str]]
|
The number of rows to load per partition. Set to override the |
None
|
resources |
Optional[Resources]
|
The resources to assign to the operation. |
None
|
cache |
Optional[bool]
|
Set to False to disable caching, True by default. |
True
|
Returns:
Type | Description |
---|---|
Dataset
|
An intermediate dataset. |
#
Materialize the dataset#
Once all your components are added to your dataset you can use different runners to materialize the your dataset.
IMPORTANT
When using other runners you will need to make sure that your new environment has access to:
- The working directory of your workflow (as mentioned above)
- The images used in your pipeline (make sure you have access to the registries where the images are stored)
The dataset ref can be a reference to the file containing your dataset, a variable containing your dataset, or a factory function that will create your dataset.
The working directory can be:
- A remote cloud location (S3, GCS, Azure Blob storage): For the local runner, make sure that your local credentials or service account have read/write access to the designated working directory and that you provide them to the dataset. For the Vertex, Sagemaker, and Kubeflow runners, make sure that the service account attached to those runners has read/write access.
- A local directory: only valid for the local runner, points to a local directory. This is useful for local development.