Skip to content

Dataset#

Fondant helps you build datasets by providing a set of operations to load, transform, and write data. With Fondant, you can use both reusable components and custom components, and chain them to create datasets.

Load a Fondant dataset#

You can initialise a dataset from a previous run by using the read method.

from fondant.dataset import Dataset

dataset = Dataset.read("path/to/manfiest.json")
View a detailed reference of the Dataset.read() method

Read a dataset from a manifest file.

Parameters:

Name Type Description Default
manifest_path str

The path to the manifest file.

required

#

Build a dataset#

Start by creating a dataset.py file and adding the following code.

from fondant.dataset import Dataset

dataset = Dataset.create(
    "load_from_parquet",
    arguments={
        "dataset_uri": "path/to/dataset",
        "n_rows_to_load": 100,
    },
    produces={
        "text": pa.string()
    },
    dataset_name="my_dataset"
)

This code initializes a Dataset instance with a load component. The load component reads data.

View a detailed reference of the Dataset.create() method

Read data using the provided component.

Parameters:

Name Type Description Default
ref Any

The name of a reusable component, or the path to the directory containing a containerized component, or a lightweight component class.

required
produces Optional[Dict[str, Union[str, DataType]]]

A mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the dataset.

None
arguments Optional[Dict[str, Any]]

A dictionary containing the argument name and value for the operation.

None
input_partition_rows Optional[Union[int, str]]

The number of rows to load per partition. Set to override the

None
resources Optional[Resources]

The resources to assign to the operation.

None
cache Optional[bool]

Set to False to disable caching, True by default.

True
dataset_name Optional[str]

The name of the dataset.

None

Returns:

Type Description
Dataset

An intermediate dataset.

#

The create method does not execute your component yet, but adds the component to the execution graph. It returns a lazy Dataset instance which you can use to chain transform components.

Adding transform components#

from fondant.dataset import Resources

dataset = dataset.apply(
    "embed_text",
    resources=Resources(
        accelerator_number=1,
        accelerator_name="GPU",
    )
)

The apply method also returns a lazy Dataset which you can use to chain additional components.

The apply method also provides additional configuration options on how to execute the component. You can for instance provide a Resources definition to specify the hardware it should run on. In this case, we want to leverage a GPU to run our embedding model. Depending on the runner, you can choose the type of GPU as well.

View a detailed reference of the Dataset.apply() method

Apply the provided component on the dataset.

Parameters:

Name Type Description Default
ref Any

The name of a reusable component, or the path to the directory containing a custom component, or a lightweight component class.

required
workspace

workspace to operate in

required
consumes Optional[Dict[str, Union[str, DataType]]]

A mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the input dataset.

Suppose we have a component spec that expects the following fields:

...
consumes:
    text:
        type: string
    image:
        type: binary
...

To override the default mapping and specify that the 'text' field should be sourced from the 'custom_text' field in the input dataset, the 'consumes' mapping can be defined as follows:

consumes = {
    "text": "custom_text"
}

In this example, the 'text' field will be sourced from 'custom_text' and 'image' will be sourced from the 'image' field by default, since it's not specified in the custom mapping.

None
produces Optional[Dict[str, Union[str, DataType]]]

A mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field, or the name that should be used to write the field to the output dataset.

Suppose we have a component spec that expects the following fields:

...
produces:
    text:
        type: string
    width:
        type: int

To customize the field names and types during the production step, the 'produces' mapping can be defined as follows:

produces = {
    "width": "custom_width",
}

In this example, the 'text' field will retain as text since it is not specified in the custom mapping. The 'width' field will be stored with the name 'custom_width' in the output dataset.

Alternatively, the produces defines the data type of the output data.

produces = {
    "width": pa.float32(),
}

In this example, the 'text' field will retain its type 'string' without specifying a different source, while the 'width' field will be produced as type float in the output dataset.

None
arguments Optional[Dict[str, Any]]

A dictionary containing the argument name and value for the operation.

None
input_partition_rows Optional[Union[int, str]]

The number of rows to load per partition. Set to override the

None
resources Optional[Resources]

The resources to assign to the operation.

None
cache Optional[bool]

Set to False to disable caching, True by default.

True

Returns:

Type Description
Dataset

An intermediate dataset.

#

Adding a write component#

The final step is to write our data to its destination.

dataset = dataset.write(
    "write_to_hf_hub",
    arguments={
        "username": "user",
        "dataset_name": "dataset",
        "hf_token": "xxx",
    }
)
View a detailed reference of the Dataset.write() method

Write the dataset using the provided component.

Parameters:

Name Type Description Default
ref Any

The name of a reusable component, or the path to the directory containing a custom component, or a lightweight component class.

required
workspace

workspace to operate in

required
consumes Optional[Dict[str, Union[str, DataType]]]

A mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the type of the field, or the name of the field to map from the input dataset.

None
arguments Optional[Dict[str, Any]]

A dictionary containing the argument name and value for the operation.

None
input_partition_rows Optional[Union[int, str]]

The number of rows to load per partition. Set to override the

None
resources Optional[Resources]

The resources to assign to the operation.

None
cache Optional[bool]

Set to False to disable caching, True by default.

True

Returns:

Type Description
Dataset

An intermediate dataset.

#

Materialize the dataset#

Once all your components are added to your dataset you can use different runners to materialize the your dataset.

IMPORTANT

When using other runners you will need to make sure that your new environment has access to:

  • The working directory of your workflow (as mentioned above)
  • The images used in your pipeline (make sure you have access to the registries where the images are stored)
fondant run local <dataset_ref> --working_directory <path_to_working_directory>
fondant run vertex <dataset_ref> \
 --project-id $PROJECT_ID \
 --project-region $PROJECT_REGION \
 --service-account $SERVICE_ACCOUNT \
 --working_directory <path_to_working_directory>
fondant run sagemaker <dataset_ref> \
 --role-arn <sagemaker_role_arn> \
 --working_directory <path_to_working_directory>
fondant run kubeflow <dataset_ref> --working_directory <path_to_working_directory>
from fondant.dataset.runner import DockerRunner

runner = DockerRunner()
runner.run(input=<dataset_ref>, working_directory=<path_to_working_directory>)
from fondant.dataset.runner import VertexRunner

runner = VertexRunner()
runner.run(input=<dataset_ref>, working_directory=<path_to_working_directory>)
from fondant.dataset.runner import SageMakerRunner

runner = SageMakerRunner()
runner.run(input=<dataset_ref>,role_arn=<sagemaker_role_arn>, 
          working_directory=<path_to_working_directory>)        
from fondant.dataset.runner import KubeFlowRunner

runner = KubeFlowRunner(host=<kubeflow_host>)
runner.run(input=<dataset_ref>)        

The dataset ref can be a reference to the file containing your dataset, a variable containing your dataset, or a factory function that will create your dataset.

The working directory can be:

  • A remote cloud location (S3, GCS, Azure Blob storage): For the local runner, make sure that your local credentials or service account have read/write access to the designated working directory and that you provide them to the dataset. For the Vertex, Sagemaker, and Kubeflow runners, make sure that the service account attached to those runners has read/write access.
  • A local directory: only valid for the local runner, points to a local directory. This is useful for local development.