from distributed import Client
Components#
Fondant makes it easy to build dataset collaborative leveraging reusable components. Fondant provides a lot of components out of the box (overview), but you can also define your own custom components.
The anatomy of a component#
A component is completely defined by its script, specification, docker image, which data it consumes and produces, and which arguments it takes. The definition of the script is similar for all types of components. All other aspects of the component are defined different ways, depending on the type of component. Continue reading to learn more about the different types of components and how to define them.
Component script#
The logic should be implemented as a class, inheriting from one of the base Component
classes
offered by Fondant.
There are three large types of components:
LoadComponent
: Load data and initialise a dataset from an external data sourceTransformComponent
: Implement a single transformation step to transform data in your datasetWriteComponent
: Write your dataset to an external data sink
The easiest way to implement a TransformComponent
is to subclass the provided
PandasTransformComponent
. This component streams your data and offers it in memory-sized
chunks as pandas dataframes.
import pandas as pd
from fondant.component import PandasTransformComponent
class ExampleComponent(PandasTransformComponent):
def __init__(self, *, argument1, argument2) -> None:
"""
Args:
argumentX: An argument passed to the component
"""
# Initialize your component here based on the arguments
def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Implement your custom logic in this single method
Args:
dataframe: A Pandas dataframe containing one partition of your data
Returns:
A pandas dataframe containing the transformed data
"""
The __init__
method is called once for each component class with custom arguments defined in the
args
section of the component. This is a good
place to initialize resources and costly initializations such as network connections, models,
parsing a config file, etc. By doing so, you can effectively prevent the redundant re-initialization
of resources each time the transform
method is invoked.
The transform
method is called multiple times, each time containing a pandas dataframe
with a partition of your data loaded in memory.
The dataframes
passed to the transform
method contains the data specified in the consumes
section of the component. If a component defines that it consumes an image
field,
this data can be accessed using dataframe["image"]
.
The transform
method should return a single dataframe, with the columns complying to the
schema defined by the produces
section of the component specification.
Configuring Dask#
You can configure the Dask client based on the
needs of your component by overriding the dask_client
method:
import os
from dask.distributed import Client, LocalCluster
from fondant.component import PandasTransformComponent
class Component(PandasTransformComponent):
def dask_client(self) -> Client:
"""Initialize the dask client to use for this component."""
cluster = LocalCluster(
processes=True,
n_workers=os.cpu_count(),
threads_per_worker=1,
)
return Client(cluster)
The default configuration uses a LocalCluster
which works with processes, the same amount of
workers as logical CPUs available, and one thread per worker.
Some components might work more optimally using threads or a different combination of threads
and processes. To use multiple GPUs, you can use a
LocalCUDACluster
.
Component types#
We can distinguish two different types of components:
-
Custom components are completely defined and implemented by the user. There are two ways to define a custom component:
- Lightweight Components: Create a component from a self-contained Python function. This is the easiest way to create a custom component. It allows you to define a component without having to build a custom docker image or defining a component specification.
- Containerized Components: You can build your code into a docker image and write an accompanying component specification that refers to it. This is used for more complex components that require additional dependencies (e.g. GPU support).
-
Reusable components can be used out of the box and can be loaded from the Fondant Hub. They are containerized components that are defined by the Fondant team or the community.
Custom components#
Lightweight Components#
To define a lightweight component, you can create a self-contained python function that implements the logic of your component.
from fondant.component import PandasTransformComponent
from fondant.dataset import lightweight_component
import pandas as pd
import pyarrow as pa
@lightweight_component
class AddNumber(PandasTransformComponent):
def __init__(self, n: int):
self.n = n
def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
dataframe["x"] = dataframe["x"].map(lambda x: x + self.n)
return dataframe
You can apply a custom component to your dataset by passing in the reference to the component class containing your script.
See our best practices on creating a lightweight component.
Containerized Components#
To define your own containerized component, you can build your code into a docker image and write an accompanying component specification that refers to it.
A typical file structure for a custom component looks like this:
|- components
| |- custom_component
| |- src
| | |- main.py
| |- Dockerfile
| |- fondant_component.yaml
| |- requirements.txt
|- dataset.py
The Dockerfile
is used to build the code into a docker image, which is then referred to in the
fondant_component.yaml
.
name: Custom component
description: This is a custom component
image: custom_component:latest
You can apply a custom component to your dataset by passing in the path to the directory containing
your fondant_component.yaml
.
dataset = dataset.apply(
component_dir="components/custom_component",
arguments={
"arg": "value"
}
)
See our best practices on creating a containerized component.
Reusable components#
Reusable components are out of the box containerized components from the Fondant Hub that you can easily add to your dataset:
You can find an overview of the available reusable components on the Fondant hub. Check their documentation for information on which arguments they accept and which data they consume and produce.