Dataset#

Fondant helps you build datasets by providing a set of operations to load, transform, and write data. With Fondant, you can use both reusable components and custom components, and chain them to create datasets.

Load a Fondant dataset#

You can initialise a dataset from a previous run by using the read method.

from fondant.dataset import Dataset

dataset = Dataset.read("path/to/manfiest.json")

View a detailed reference of the Dataset.read() method

Read a dataset from a manifest file.

Parameters:

Name	Type	Description	Default
`manifest_path`	`str`	The path to the manifest file.	required

#

Build a dataset#

Start by creating a dataset.py file and adding the following code.

from fondant.dataset import Dataset

dataset = Dataset.create(
    "load_from_parquet",
    arguments={
        "dataset_uri": "path/to/dataset",
        "n_rows_to_load": 100,
    },
    produces={
        "text": pa.string()
    },
    dataset_name="my_dataset"
)

This code initializes a Dataset instance with a load component. The load component reads data.

View a detailed reference of the Dataset.create() method

Read data using the provided component.

Parameters:

Name	Type	Description	Default
`ref`	`Any`	The name of a reusable component, or the path to the directory containing a containerized component, or a lightweight component class.	required
`produces`	`Optional[Dict[str, Union[str, DataType]]]`	A mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the dataset.	`None`
`arguments`	`Optional[Dict[str, Any]]`	A dictionary containing the argument name and value for the operation.	`None`
`input_partition_rows`	`Optional[Union[int, str]]`	The number of rows to load per partition. Set to override the	`None`
`resources`	`Optional[Resources]`	The resources to assign to the operation.	`None`
`cache`	`Optional[bool]`	Set to False to disable caching, True by default.	`True`
`dataset_name`	`Optional[str]`	The name of the dataset.	`None`

Returns:

Type	Description
`Dataset`	An intermediate dataset.

#

The create method does not execute your component yet, but adds the component to the execution graph. It returns a lazy Dataset instance which you can use to chain transform components.

Adding transform components#

from fondant.dataset import Resources

dataset = dataset.apply(
    "embed_text",
    resources=Resources(
        accelerator_number=1,
        accelerator_name="GPU",
    )
)

The apply method also returns a lazy Dataset which you can use to chain additional components.

The apply method also provides additional configuration options on how to execute the component. You can for instance provide a Resources definition to specify the hardware it should run on. In this case, we want to leverage a GPU to run our embedding model. Depending on the runner, you can choose the type of GPU as well.

View a detailed reference of the Dataset.apply() method

Apply the provided component on the dataset.

Parameters:

Name	Type	Description	Default
`ref`	`Any`	The name of a reusable component, or the path to the directory containing a custom component, or a lightweight component class.	required
`workspace`		workspace to operate in	required
`consumes`	`Optional[Dict[str, Union[str, DataType]]]`	A mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the input dataset. Suppose we have a component spec that expects the following fields: `... consumes: text: type: string image: type: binary ...` To override the default mapping and specify that the 'text' field should be sourced from the 'custom_text' field in the input dataset, the 'consumes' mapping can be defined as follows: `consumes = { "text": "custom_text" }` In this example, the 'text' field will be sourced from 'custom_text' and 'image' will be sourced from the 'image' field by default, since it's not specified in the custom mapping.	`None`
`produces`	`Optional[Dict[str, Union[str, DataType]]]`	A mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field, or the name that should be used to write the field to the output dataset. Suppose we have a component spec that expects the following fields: `... produces: text: type: string width: type: int` To customize the field names and types during the production step, the 'produces' mapping can be defined as follows: `produces = { "width": "custom_width", }` In this example, the 'text' field will retain as text since it is not specified in the custom mapping. The 'width' field will be stored with the name 'custom_width' in the output dataset. Alternatively, the produces defines the data type of the output data. `produces = { "width": pa.float32(), }` In this example, the 'text' field will retain its type 'string' without specifying a different source, while the 'width' field will be produced as type `float` in the output dataset.	`None`
`arguments`	`Optional[Dict[str, Any]]`	A dictionary containing the argument name and value for the operation.	`None`
`input_partition_rows`	`Optional[Union[int, str]]`	The number of rows to load per partition. Set to override the	`None`
`resources`	`Optional[Resources]`	The resources to assign to the operation.	`None`
`cache`	`Optional[bool]`	Set to False to disable caching, True by default.	`True`

Returns:

Type	Description
`Dataset`	An intermediate dataset.

#

Adding a write component#

The final step is to write our data to its destination.

dataset = dataset.write(
    "write_to_hf_hub",
    arguments={
        "username": "user",
        "dataset_name": "dataset",
        "hf_token": "xxx",
    }
)

View a detailed reference of the Dataset.write() method

Write the dataset using the provided component.

Parameters:

Name	Type	Description	Default
`ref`	`Any`	The name of a reusable component, or the path to the directory containing a custom component, or a lightweight component class.	required
`workspace`		workspace to operate in	required
`consumes`	`Optional[Dict[str, Union[str, DataType]]]`	A mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the input dataset.	`None`
`arguments`	`Optional[Dict[str, Any]]`	A dictionary containing the argument name and value for the operation.	`None`
`input_partition_rows`	`Optional[Union[int, str]]`	The number of rows to load per partition. Set to override the	`None`
`resources`	`Optional[Resources]`	The resources to assign to the operation.	`None`
`cache`	`Optional[bool]`	Set to False to disable caching, True by default.	`True`

Returns:

Type	Description
`Dataset`	An intermediate dataset.

#

Materialize the dataset#

Once all your components are added to your dataset you can use different runners to materialize the your dataset.

IMPORTANT

When using other runners you will need to make sure that your new environment has access to:

The working directory of your workflow (as mentioned above)
The images used in your pipeline (make sure you have access to the registries where the images are stored)

ConsolePython

LocalVertexSageMakerKubeflow

fondant run local <dataset_ref> --working_directory <path_to_working_directory>

fondant run vertex <dataset_ref> \
 --project-id $PROJECT_ID \
 --project-region $PROJECT_REGION \
 --service-account $SERVICE_ACCOUNT \
 --working_directory <path_to_working_directory>

fondant run sagemaker <dataset_ref> \
 --role-arn <sagemaker_role_arn> \
 --working_directory <path_to_working_directory>

fondant run kubeflow <dataset_ref> --working_directory <path_to_working_directory>

LocalVertexSageMakerKubeFlow

from fondant.dataset.runner import DockerRunner

runner = DockerRunner()
runner.run(input=<dataset_ref>, working_directory=<path_to_working_directory>)

from fondant.dataset.runner import VertexRunner

runner = VertexRunner()
runner.run(input=<dataset_ref>, working_directory=<path_to_working_directory>)

from fondant.dataset.runner import SageMakerRunner

runner = SageMakerRunner()
runner.run(input=<dataset_ref>,role_arn=<sagemaker_role_arn>, 
          working_directory=<path_to_working_directory>)

from fondant.dataset.runner import KubeFlowRunner

runner = KubeFlowRunner(host=<kubeflow_host>)
runner.run(input=<dataset_ref>)

The dataset ref can be a reference to the file containing your dataset, a variable containing your dataset, or a factory function that will create your dataset.

The working directory can be:

A remote cloud location (S3, GCS, Azure Blob storage): For the local runner, make sure that your local credentials or service account have read/write access to the designated working directory and that you provide them to the dataset. For the Vertex, Sagemaker, and Kubeflow runners, make sure that the service account attached to those runners has read/write access.
A local directory: only valid for the local runner, points to a local directory. This is useful for local development.