Component Hub#

Below you can find the reusable components offered by Fondant.

Data loading

Load from csv

Description#

Component that loads a dataset from a csv file

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
dataset_uri	str	The remote path to the csv file(s) containing the dataset	/
column_separator	str	Define the column separator of the csv file	/
column_name_mapping	dict	Mapping of the consumed dataset	/
n_rows_to_load	int	Optional argument that defines the number of rows to load. Useful for testing dataset workflows on a small scale	/
index_column	str	Column to set index to in the load component, if not specified a default globally unique index will be set	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_csv",
    arguments={
        # Add arguments
        # "dataset_uri": ,
        # "column_separator": ,
        # "column_name_mapping": {},
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load from files

Description#

This component loads data from files in a local or remote (AWS S3, Azure Blob storage, GCS) location. It supports the following formats: .zip, gzip, tar and tar.gz.

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component produces:

filename: string
content: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
directory_uri	str	Local or remote path to the directory containing the files	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_files",
    arguments={
        # Add arguments
        # "directory_uri": ,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load from Hugging Face hub

Description#

Component that loads a dataset from the hub

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
dataset_name	str	Name of dataset on the hub	/
column_name_mapping	dict	Mapping of the consumed hub dataset to fondant column names	/
image_column_names	list	Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string.	/
n_rows_to_load	int	Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale	/
index_column	str	Column to set index to in the load component, if not specified a default globally unique index will be set	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_hf_hub",
    arguments={
        # Add arguments
        # "dataset_name": ,
        # "column_name_mapping": {},
        # "image_column_names": [],
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Load from parquet

Description#

Component that loads a dataset from a parquet uri

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
dataset_uri	str	The remote path to the parquet file/folder containing the dataset	/
column_name_mapping	dict	Mapping of the consumed dataset	/
n_rows_to_load	int	Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale	/
index_column	str	Column to set index to in the load component, if not specified a default globally unique index will be set	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_parquet",
    arguments={
        # Add arguments
        # "dataset_uri": ,
        # "column_name_mapping": {},
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Load from pdf

Description#

Load pdf data stored locally or remote using langchain loaders.

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component produces:

pdf_path: string
file_name: string
text: string

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
pdf_path	str	The path to the a pdf file or a folder containing pdf files to load. Can be a local path or a remote path. If the path is remote, the loader class will be determined by the scheme of the path.	/
n_rows_to_load	int	Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale	/
index_column	str	Column to set index to in the load component, if not specified a default globally unique index will be set	/
n_partitions	int	Number of partitions of the dask dataframe. If not specified, the number of partitions will be equal to the number of CPU cores. Set to high values if the data is large and the pipelineis running out of memory.	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_pdf",
    arguments={
        # Add arguments
        # "pdf_path": ,
        # "n_rows_to_load": 0,
        # "index_column": ,
        # "n_partitions": 0,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Data retrieval

Download images

Description#

Component that downloads images from a list of URLs.

This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width). The images are stored in a new colum as bytes objects. This component also resizes the images using the resizer function from the img2dataset library.

Inputs / outputs#

Consumes#

This component consumes:

image_url: string

Produces#

This component produces:

image: binary
image_width: int32
image_height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
timeout	int	Maximum time (in seconds) to wait when trying to download an image,	10
retries	int	Number of times to retry downloading an image if it fails.	/
n_connections	int	Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput.	100
image_size	int	Size of the images after resizing.	256
resize_mode	str	Resize mode to use. One of "no", "keep_ratio", "center_crop", "border".	border
resize_only_if_bigger	bool	If True, resize only if image is bigger than image_size.	/
min_image_size	int	Minimum size of the images.	/
max_aspect_ratio	float	Maximum aspect ratio of the images.	99.9

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "download_images",
    arguments={
        # Add arguments
        # "timeout": 10,
        # "retries": 0,
        # "n_connections": 100,
        # "image_size": 256,
        # "resize_mode": "border",
        # "resize_only_if_bigger": False,
        # "min_image_size": 0,
        # "max_aspect_ratio": 99.9,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve from FAISS by embedding

Description#

Retrieve images from a Faiss index. The component should reference a Faiss image dataset, which includes both the Faiss index and a dataset of image URLs. The input dataset contains embeddings which will be use to retrieve similar images.

Inputs / outputs#

Consumes#

This component consumes:

embedding: list

Produces#

This component produces:

image_url: string

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
url_mapping_path	str	Url of the image mapping dataset	/
faiss_index_path	str	Url of the dataset	/
num_images	int	Number of images that will be retrieved for each prompt	2

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_from_faiss_by_embedding",
    arguments={
        # Add arguments
        # "url_mapping_path": ,
        # "faiss_index_path": ,
        # "num_images": 2,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve from FAISS by prompt

Description#

Retrieve images from a Faiss index. The component should reference a Faiss image dataset, which includes both the Faiss index and a dataset of image URLs. The input dataset consists of a list of prompts. These prompts will be embedded using a CLIP model, and similar images will be retrieved from the index.

Inputs / outputs#

Consumes#

This component consumes:

prompt: string

Produces#

This component produces:

image_url: string
prompt: string

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
url_mapping_path	str	Url of the image mapping dataset	/
faiss_index_path	str	Url of the dataset	/
clip_model	str	Clip model name to use for the retrieval	laion/CLIP-ViT-B-32-laion2B-s34B-b79K
num_images	int	Number of images that will be retrieved for each prompt	2

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_from_faiss_by_prompt",
    arguments={
        # Add arguments
        # "url_mapping_path": ,
        # "faiss_index_path": ,
        # "clip_model": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
        # "num_images": 2,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve LAION by embedding

Description#

This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be used to find images similar to the embedded images / captions.

Inputs / outputs#

Consumes#

This component consumes:

embedding: list

Produces#

This component produces:

image_url: string
embedding_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
num_images	int	Number of images to retrieve for each prompt	/
aesthetic_score	int	Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier).	9
aesthetic_weight	float	Weight of the aesthetic embedding when added to the query, between 0 and 1	0.5

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_laion_by_embedding",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve LAION by prompt

Description#

This component retrieves image URLs from the LAION-5B dataset based on text prompts. The retrieval itself is done based on CLIP embeddings similarity between the prompt sentences and the captions in the LAION dataset.

This component doesn’t return the actual images, only URLs.

Inputs / outputs#

Consumes#

This component consumes:

prompt: string

Produces#

This component produces:

image_url: string
prompt_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
num_images	int	Number of images to retrieve for each prompt	/
aesthetic_score	int	Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier).	9
aesthetic_weight	float	Weight of the aesthetic embedding when added to the query, between 0 and 1	0.5
url	str	The url of the backend clip retrieval service, defaults to the public service	https://knn.laion.ai/knn-service

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_laion_by_prompt",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
        # "url": "https://knn.laion.ai/knn-service",
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Data writing

Index AWS OpenSearch

Description#

Component that takes embeddings of text snippets and indexes them into AWS OpenSearch vector database.

Inputs / outputs#

Consumes#

This component consumes:

text: string
embedding: list

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
host	str	The Cluster endpoint of the AWS OpenSearch cluster where the embeddings will be indexed. E.g. "my-test-domain.us-east-1.aoss.amazonaws.com"	/
region	str	The AWS region where the OpenSearch cluster is located. If not specified, the default region will be used.	/
index_name	str	The name of the index in the AWS OpenSearch cluster where the embeddings will be stored.	/
index_body	dict	Parameters that specify index settings, mappings, and aliases for newly created index.	/
port	int	The port number to connect to the AWS OpenSearch cluster.	443
use_ssl	bool	A boolean flag indicating whether to use SSL/TLS for the connection to the OpenSearch cluster.	True
verify_certs	bool	A boolean flag indicating whether to verify SSL certificates when connecting to the OpenSearch cluster.	True
pool_maxsize	int	The maximum size of the connection pool to the AWS OpenSearch cluster.	20

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_aws_opensearch",
    arguments={
        # Add arguments
        # "host": ,
        # "region": ,
        # "index_name": ,
        # "index_body": {},
        # "port": 443,
        # "use_ssl": True,
        # "verify_certs": True,
        # "pool_maxsize": 20,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Index Qdrant

Description#

A Fondant component to load textual data and embeddings into a Qdrant database. NOTE: A Qdrant collection has to be created in advance with the appropriate configurations. https://qdrant.tech/documentation/concepts/collections/

Inputs / outputs#

Consumes#

This component consumes:

text: string
embedding: list

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
collection_name	str	The name of the Qdrant collection to upsert data into.	/
location	str	The location of the Qdrant instance.	/
batch_size	int	The batch size to use when uploading points to Qdrant.	64
parallelism	int	The number of parallel workers to use when uploading points to Qdrant.	1
url	str	Either host or str of 'Optional[scheme], host, Optional[port], Optional[prefix]'.	/
port	int	Port of the REST API interface.	6333
grpc_port	int	Port of the gRPC interface.	6334
prefer_grpc	bool	If `true` - use gRPC interface whenever possible in custom methods.	/
https	bool	If `true` - use HTTPS(SSL) protocol.	/
api_key	str	API key for authentication in Qdrant Cloud.	/
prefix	str	If set, add `prefix` to the REST URL path.	/
timeout	int	Timeout for API requests.	/
host	str	Host name of Qdrant service. If url and host are not set, defaults to 'localhost'.	/
path	str	Persistence path for QdrantLocal. Eg. `local_data/qdrant`	/
force_disable_check_same_thread	bool	Force disable check_same_thread for QdrantLocal sqlite connection.	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_qdrant",
    arguments={
        # Add arguments
        # "collection_name": ,
        # "location": ,
        # "batch_size": 64,
        # "parallelism": 1,
        # "url": ,
        # "port": 6333,
        # "grpc_port": 6334,
        # "prefer_grpc": False,
        # "https": False,
        # "api_key": ,
        # "prefix": ,
        # "timeout": 0,
        # "host": ,
        # "path": ,
        # "force_disable_check_same_thread": False,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Index Weaviate

Description#

Component that takes text or embeddings of text snippets and indexes them into a Weaviate vector database.

To run the component with text snippets as input, the component needs to be connected to a previous component that outputs text snippets.

Running with text as input#

import pyarrow as pa
from fondant.dataset import Dataset

dataset = Dataset.read(...)

dataset.write(
    "index_weaviate",
    arguments={
        "weaviate_url": "http://localhost:8080",
        "class_name": "my_class",
        "vectorizer": "text2vec-openai",
        "additional_headers" : {
            "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY"
        }
    },
    consumes={
        "text": pa.string()
    }
)

Running with embedding as input#

import pyarrow as pa
from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "embed_text",
    arguments={...},
    consumes={
        "text": "text",
    },
)

dataset.write(
    "index_weaviate",
    arguments={
        "weaviate_url": "http://localhost:8080",
        "class_name": "my_class",
    },
    consumes={
        "embedding": pa.list_(pa.float32())
    }
)

Inputs / outputs#

Consumes#

This component can consume additional fields - : This defines a mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the name of the field to map from the input dataset

See the usage example below on how to define a field name for additional fields.

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
weaviate_url	str	The URL of the weaviate instance.	http://localhost:8080
batch_size	int	The batch size to be used.Parameter of weaviate.batch.Batch().configure().	100
dynamic	bool	Whether to use dynamic batching or not.Parameter of weaviate.batch.Batch().configure().	True
num_workers	int	The maximal number of concurrent threads to run batch import.Parameter of weaviate.batch.Batch().configure().	2
overwrite	bool	Whether to overwrite/ re-create the existing weaviate class and its embeddings.	/
class_name	str	The name of the weaviate class that will be created and used to store the embeddings.Should follow the weaviate naming conventions.	/
additional_config	dict	Additional configuration to pass to the weaviate client.	/
additional_headers	dict	Additional headers to pass to the weaviate client.	/
vectorizer	str	Which vectorizer to use. You can find the available vectorizers in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings.	/
module_config	dict	The configuration of the vectorizer module.You can find the available configuration options in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings.	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_weaviate",
    arguments={
        # Add arguments
        # "weaviate_url": "http://localhost:8080",
        # "batch_size": 100,
        # "dynamic": True,
        # "num_workers": 2,
        # "overwrite": False,
        # "class_name": ,
        # "additional_config": {},
        # "additional_headers": {},
        # "vectorizer": ,
        # "module_config": {},
    },
    consumes={
         <field_name>: <dataset_field_name>,
         ..., # Add fields
     },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Write to file

Description#

A Fondant component to write a dataset to file on a local machine or to a cloud storage bucket. The dataset can be written as csv or parquet.

Inputs / outputs#

Consumes#

This component can consume additional fields - : This defines a mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the name of the field to map from the input dataset

See the usage example below on how to define a field name for additional fields.

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
path	str	Path to store the dataset, whether it's a local path or a cloud storage bucket, must be specified. A separate filename will be generated for each partition. If you are using the local runner and export the data to a local directory, ensure that you mount the path to the directory using the `--extra-volumes` argument.	/
format	str	Format for storing the dataframe can be either `csv` or `parquet`. As default `parquet` is used. The CSV files contain the column as a header and use a comma as a delimiter.	parquet

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "write_to_file",
    arguments={
        # Add arguments
        # "path": ,
        # "format": "parquet",
    },
    consumes={
         <field_name>: <dataset_field_name>,
         ..., # Add fields
     },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Write to Hugging Face hub

Description#

Component that writes a dataset to the hub

Inputs / outputs#

Consumes#

This component can consume additional fields - : This defines a mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the name of the field to map from the input dataset

See the usage example below on how to define a field name for additional fields.

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
hf_token	str	The hugging face token used to write to the hub	/
username	str	The username under which to upload the dataset	/
dataset_name	str	The name of the dataset to upload	/
image_column_names	list	A list containing the image column names. Used to format to image to HF hub format	/
column_name_mapping	dict	Mapping of the consumed fondant column names to the written hub column names	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "write_to_hf_hub",
    arguments={
        # Add arguments
        # "hf_token": ,
        # "username": ,
        # "dataset_name": ,
        # "image_column_names": [],
        # "column_name_mapping": {},
    },
    consumes={
         <field_name>: <dataset_field_name>,
         ..., # Add fields
     },
)

Image processing

Caption images

Description#

This component captions images using a BLIP model from the Hugging Face hub

Inputs / outputs#

Consumes#

This component consumes:

image: binary

Produces#

This component produces:

caption: string

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
model_id	str	Id of the BLIP model on the Hugging Face hub	Salesforce/blip-image-captioning-base
batch_size	int	Batch size to use for inference	8
max_new_tokens	int	Maximum token length of each caption	50

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "caption_images",
    arguments={
        # Add arguments
        # "model_id": "Salesforce/blip-image-captioning-base",
        # "batch_size": 8,
        # "max_new_tokens": 50,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Crop images

Description#

This component crops out image borders. This is typically useful when working with graphical images that have single-color borders (e.g. logos, icons, etc.).

The component takes an image and calculates which color is most present in the border. It then crops the image in order to minimize this single-color border. The padding argument will add extra border to the image before cropping it, in order to avoid cutting off parts of the image. The resulting crop will always be square. If a crop is not possible, the component will return the original image.

Examples#

Examples of image cropping by removing the single-color border. Left side is original image, right side is border-cropped image.

Example of image cropping by removing the single-color border. Left side is original, right side is cropped image

Inputs / outputs#

Consumes#

This component consumes:

images_data: binary

Produces#

This component produces:

image: binary
image_width: int32
image_height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
cropping_threshold	int	Threshold parameter used for detecting borders. A lower (negative) parameter results in a more performant border detection, but can cause overcropping. Default is -30	-30
padding	int	Padding for the image cropping. The padding is added to all borders of the image.	10

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "crop_images",
    arguments={
        # Add arguments
        # "cropping_threshold": -30,
        # "padding": 10,
    },
)

Embed images

Description#

Component that generates CLIP embeddings from images

Inputs / outputs#

Consumes#

This component consumes:

image: binary

Produces#

This component produces:

embedding: list

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
model_id	str	Model id of a CLIP model on the Hugging Face hub	openai/clip-vit-large-patch14
batch_size	int	Batch size to use when embedding	8

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "embed_images",
    arguments={
        # Add arguments
        # "model_id": "openai/clip-vit-large-patch14",
        # "batch_size": 8,
    },
)

Extract image resolution

Description#

Component that extracts image resolution data from the images

Inputs / outputs#

Consumes#

This component consumes:

image: binary

Produces#

This component produces:

image: binary
image_width: int32
image_height: int32

Arguments#

This component takes no arguments.

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "extract_image_resolution",
    arguments={
        # Add arguments
    },
)

Filter image resolution

Description#

Component that filters images based on minimum size and max aspect ratio

Inputs / outputs#

Consumes#

This component consumes:

image_width: int32
image_height: int32

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
min_image_dim	int	Minimum image dimension	/
max_aspect_ratio	float	Maximum aspect ratio	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "filter_image_resolution",
    arguments={
        # Add arguments
        # "min_image_dim": 0,
        # "max_aspect_ratio": 0.0,
    },
)

Resize images

Description#

Component that resizes images based on given width and height

Inputs / outputs#

Consumes#

This component consumes:

image: binary

Produces#

This component produces:

image: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
resize_width	int	The width to resize to	/
resize_height	int	The height to resize to	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "resize_images",
    arguments={
        # Add arguments
        # "resize_width": 0,
        # "resize_height": 0,
    },
)

Segment images

Description#

Component that creates segmentation masks for images using a model from the Hugging Face hub

Inputs / outputs#

Consumes#

This component consumes:

image: binary

Produces#

This component produces:

segmentation_map: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
model_id	str	id of the model on the Hugging Face hub	openmmlab/upernet-convnext-small
batch_size	int	batch size to use	8

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "segment_images",
    arguments={
        # Add arguments
        # "model_id": "openmmlab/upernet-convnext-small",
        # "batch_size": 8,
    },
)

Text processing

Chunk text

Description#

Component that chunks text into smaller segments

This component takes a body of text and chunks into small chunks. The id of the returned dataset consists of the id of the original document followed by the chunk index.

Different chunking strategies can be used. The default is to use the "recursive" strategy which recursively splits the text into smaller chunks until the chunk size is reached.

More information about the different chunking strategies can be here: - https://python.langchain.com/docs/modules/data_connection/document_transformers/ - https://www.pinecone.io/learn/chunking-strategies/

Inputs / outputs#

Consumes#

This component consumes:

text: string

Produces#

This component produces:

text: string
original_document_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
chunk_strategy	str	The strategy to use for chunking the text. One of ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter', 'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter', 'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character', 'NLTK', 'SpaCy']	RecursiveCharacterTextSplitter
chunk_kwargs	dict	The arguments to pass to the chunking strategy	/
language_text_splitter	str	The programming language to use for splitting text into sentences if "language" is selected as the splitter. Check https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter for more information on supported languages.	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "chunk_text",
    arguments={
        # Add arguments
        # "chunk_strategy": "RecursiveCharacterTextSplitter",
        # "chunk_kwargs": {},
        # "language_text_splitter": ,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Embed text

Description#

Component that generates embeddings of text passages.

Inputs / outputs#

Consumes#

This component consumes:

text: string

Produces#

This component produces:

embedding: list

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
model_provider	str	The provider of the model - corresponding to langchain embedding classes. Currently the following providers are supported: aleph_alpha, cohere, huggingface, openai, vertexai.	huggingface
model	str	The model to generate embeddings from. Choose an available model name to pass to the model provider's langchain embedding class.	/
api_keys	dict	The API keys to use for the model provider that are written to environment variables.Pass only the keys required by the model provider or conveniently pass all keys you will ever need. Pay attention how to name the dictionary keys so that they can be used by the model provider.	/
auth_kwargs	dict	Additional keyword arguments required for api initialization/authentication.	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "embed_text",
    arguments={
        # Add arguments
        # "model_provider": "huggingface",
        # "model": ,
        # "api_keys": {},
        # "auth_kwargs": {},
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter language

Description#

A component that filters text based on the provided language.

Inputs / outputs#

Consumes#

This component consumes:

text: string

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
language	str	A valid language code or identifier (e.g., "en", "fr", "de").	en

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "filter_language",
    arguments={
        # Add arguments
        # "language": "en",
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter text length

Description#

A component that filters out text based on their length

Inputs / outputs#

Consumes#

This component consumes:

text: string

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
min_characters_length	int	Minimum number of characters	/
min_words_length	int	Mininum number of words	/

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "filter_text_length",
    arguments={
        # Add arguments
        # "min_characters_length": 0,
        # "min_words_length": 0,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Generate minhash

Description#

A component that generates minhashes of text.

Inputs / outputs#

Consumes#

This component consumes:

text: string

Produces#

This component produces:

minhash: list

Arguments#

The component takes the following arguments to alter its behavior:

argument	type	description	default
shingle_ngram_size	int	Define size of ngram used for the shingle generation	3

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "generate_minhash",
    arguments={
        # Add arguments
        # "shingle_ngram_size": 3,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test