Skip to content

Component Hub#

Below you can find the reusable components offered by Fondant.

Data loading

Load from files

Description#

This component loads data from files in a local or remote (AWS S3, Azure Blob storage, GCS) location. It supports the following formats: .zip, gzip, tar and tar.gz.

Inputs / outputs#

This component consumes no data.

This component produces:

  • file
    • filename: string
    • content: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
directory_uri str Local or remote path to the directory containing the files /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


load_from_files_op = ComponentOp.from_registry(
    name="load_from_files",
    arguments={
        # Add arguments
        # "directory_uri": ,
    }
)
pipeline.add_op(load_from_files_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load from hub

Description#

Component that loads a dataset from the hub

Inputs / outputs#

This component consumes no data.

This component produces:

  • dummy_variable
    • data: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_name str Name of dataset on the hub /
column_name_mapping dict Mapping of the consumed hub dataset to fondant column names /
image_column_names list Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string. /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


load_from_hf_hub_op = ComponentOp.from_registry(
    name="load_from_hf_hub",
    arguments={
        # Add arguments
        # "dataset_name": ,
        # "column_name_mapping": {},
        # "image_column_names": [],
        # "n_rows_to_load": 0,
        # "index_column": ,
    }
)
pipeline.add_op(load_from_hf_hub_op, dependencies=[...])  #Add previous component as dependency
Load from parquet

Description#

Component that loads a dataset from a parquet uri

Inputs / outputs#

This component consumes no data.

This component produces:

  • dummy_variable
    • data: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_uri str The remote path to the parquet file/folder containing the dataset /
column_name_mapping dict Mapping of the consumed dataset /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


load_from_parquet_op = ComponentOp.from_registry(
    name="load_from_parquet",
    arguments={
        # Add arguments
        # "dataset_uri": ,
        # "column_name_mapping": {},
        # "n_rows_to_load": 0,
        # "index_column": ,
    }
)
pipeline.add_op(load_from_parquet_op, dependencies=[...])  #Add previous component as dependency

Data retrieval

Embedding based LAION retrieval

Description#

This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be used to find images similar to the embedded images / captions.

Inputs / outputs#

This component consumes:

  • embeddings
    • data: list

This component produces:

  • images
    • url: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
num_images int Number of images to retrieve for each prompt /
aesthetic_score int Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). 9
aesthetic_weight float Weight of the aesthetic embedding when added to the query, between 0 and 1 0.5

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


embedding_based_laion_retrieval_op = ComponentOp.from_registry(
    name="embedding_based_laion_retrieval",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
    }
)
pipeline.add_op(embedding_based_laion_retrieval_op, dependencies=[...])  #Add previous component as dependency
Prompt based LAION retrieval

Description#

This component retrieves image URLs from the LAION-5B dataset based on text prompts. The retrieval itself is done based on CLIP embeddings similarity between the prompt sentences and the captions in the LAION dataset.

This component doesn’t return the actual images, only URLs.

Inputs / outputs#

This component consumes:

  • prompts
    • text: string

This component produces:

  • images
    • url: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
num_images int Number of images to retrieve for each prompt /
aesthetic_score int Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). 9
aesthetic_weight float Weight of the aesthetic embedding when added to the query, between 0 and 1 0.5
url str The url of the backend clip retrieval service, defaults to the public service https://knn.laion.ai/knn-service

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


prompt_based_laion_retrieval_op = ComponentOp.from_registry(
    name="prompt_based_laion_retrieval",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
        # "url": "https://knn.laion.ai/knn-service",
    }
)
pipeline.add_op(prompt_based_laion_retrieval_op, dependencies=[...])  #Add previous component as dependency

Data writing

Index Weaviate

Description#

Component that takes embeddings of text snippets and indexes them into a weaviate vector database.

Inputs / outputs#

This component consumes:

  • text
    • data: string
    • embedding: list

This component produces no data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
weaviate_url str The URL of the weaviate instance. http://localhost:8080
batch_size int The batch size to be used.Parameter of weaviate.batch.Batch().configure(). 100
dynamic bool Whether to use dynamic batching or not.Parameter of weaviate.batch.Batch().configure(). True
num_workers int The maximal number of concurrent threads to run batch import.Parameter of weaviate.batch.Batch().configure(). 2
overwrite bool Whether to overwrite/ re-create the existing weaviate class and its embeddings. /
class_name str The name of the weaviate class that will be created and used to store the embeddings.Should follow the weaviate naming conventions. /
vectorizer str Which vectorizer to use. You can find the available vectorizers in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings. /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


index_weaviate_op = ComponentOp.from_registry(
    name="index_weaviate",
    arguments={
        # Add arguments
        # "weaviate_url": "http://localhost:8080",
        # "batch_size": 100,
        # "dynamic": True,
        # "num_workers": 2,
        # "overwrite": False,
        # "class_name": ,
        # "vectorizer": ,
    }
)
pipeline.add_op(index_weaviate_op, dependencies=[...])  #Add previous component as dependency
Write to hub

Description#

Component that writes a dataset to the hub

Inputs / outputs#

This component consumes:

  • dummy_variable
    • data: binary

This component produces no data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
hf_token str The hugging face token used to write to the hub /
username str The username under which to upload the dataset /
dataset_name str The name of the dataset to upload /
image_column_names list A list containing the image column names. Used to format to image to HF hub format /
column_name_mapping dict Mapping of the consumed fondant column names to the written hub column names /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


write_to_hf_hub_op = ComponentOp.from_registry(
    name="write_to_hf_hub",
    arguments={
        # Add arguments
        # "hf_token": ,
        # "username": ,
        # "dataset_name": ,
        # "image_column_names": [],
        # "column_name_mapping": {},
    }
)
pipeline.add_op(write_to_hf_hub_op, dependencies=[...])  #Add previous component as dependency

Image processing

Caption images

Description#

This component captions images using a BLIP model from the Hugging Face hub

Inputs / outputs#

This component consumes:

  • images
    • data: binary

This component produces:

  • captions
    • text: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str Id of the BLIP model on the Hugging Face hub Salesforce/blip-image-captioning-base
batch_size int Batch size to use for inference 8
max_new_tokens int Maximum token length of each caption 50

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


caption_images_op = ComponentOp.from_registry(
    name="caption_images",
    arguments={
        # Add arguments
        # "model_id": "Salesforce/blip-image-captioning-base",
        # "batch_size": 8,
        # "max_new_tokens": 50,
    }
)
pipeline.add_op(caption_images_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Download images

Description#

Component that downloads images from a list of URLs.

This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width). The images are stored in a new colum as bytes objects. This component also resizes the images using the resizer function from the img2dataset library.

Inputs / outputs#

This component consumes:

  • images
    • url: string

This component produces:

  • images
    • data: binary
    • width: int32
    • height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
timeout int Maximum time (in seconds) to wait when trying to download an image, 10
retries int Number of times to retry downloading an image if it fails. /
n_connections int Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput. 100
image_size int Size of the images after resizing. 256
resize_mode str Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". border
resize_only_if_bigger bool If True, resize only if image is bigger than image_size. /
min_image_size int Minimum size of the images. /
max_aspect_ratio float Maximum aspect ratio of the images. 99.9

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


download_images_op = ComponentOp.from_registry(
    name="download_images",
    arguments={
        # Add arguments
        # "timeout": 10,
        # "retries": 0,
        # "n_connections": 100,
        # "image_size": 256,
        # "resize_mode": "border",
        # "resize_only_if_bigger": False,
        # "min_image_size": 0,
        # "max_aspect_ratio": 99.9,
    }
)
pipeline.add_op(download_images_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Embed images

Description#

Component that generates CLIP embeddings from images

Inputs / outputs#

This component consumes:

  • images
    • data: binary

This component produces:

  • embeddings
    • data: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str Model id of a CLIP model on the Hugging Face hub openai/clip-vit-large-patch14
batch_size int Batch size to use when embedding 8

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


embed_images_op = ComponentOp.from_registry(
    name="embed_images",
    arguments={
        # Add arguments
        # "model_id": "openai/clip-vit-large-patch14",
        # "batch_size": 8,
    }
)
pipeline.add_op(embed_images_op, dependencies=[...])  #Add previous component as dependency
Filter image resolution

Description#

Component that filters images based on minimum size and max aspect ratio

Inputs / outputs#

This component consumes:

  • images
    • width: int32
    • height: int32

This component produces no data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
min_image_dim int Minimum image dimension /
max_aspect_ratio float Maximum aspect ratio /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


filter_image_resolution_op = ComponentOp.from_registry(
    name="filter_image_resolution",
    arguments={
        # Add arguments
        # "min_image_dim": 0,
        # "max_aspect_ratio": 0.0,
    }
)
pipeline.add_op(filter_image_resolution_op, dependencies=[...])  #Add previous component as dependency
Image cropping

Description#

This component crops out image borders. This is typically useful when working with graphical images that have single-color borders (e.g. logos, icons, etc.).

The component takes an image and calculates which color is most present in the border. It then crops the image in order to minimize this single-color border. The padding argument will add extra border to the image before cropping it, in order to avoid cutting off parts of the image. The resulting crop will always be square. If a crop is not possible, the component will return the original image.

Examples#

Examples of image cropping by removing the single-color border. Left side is original image, right side is border-cropped image.

Example of image cropping by removing the single-color border. Left side is original, right side is cropped image Example of image cropping by removing the single-color border. Left side is original, right side is cropped image

Inputs / outputs#

This component consumes:

  • images
    • data: binary

This component produces:

  • images
    • data: binary
    • width: int32
    • height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
cropping_threshold int Threshold parameter used for detecting borders. A lower (negative) parameter results in a more performant border detection, but can cause overcropping. Default is -30 -30
padding int Padding for the image cropping. The padding is added to all borders of the image. 10

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


image_cropping_op = ComponentOp.from_registry(
    name="image_cropping",
    arguments={
        # Add arguments
        # "cropping_threshold": -30,
        # "padding": 10,
    }
)
pipeline.add_op(image_cropping_op, dependencies=[...])  #Add previous component as dependency
Image resolution extraction

Description#

Component that extracts image resolution data from the images

Inputs / outputs#

This component consumes:

  • images
    • data: binary

This component produces:

  • images
    • data: binary
    • width: int32
    • height: int32

Arguments#

This component takes no arguments.

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


image_resolution_extraction_op = ComponentOp.from_registry(
    name="image_resolution_extraction",
    arguments={
        # Add arguments
    }
)
pipeline.add_op(image_resolution_extraction_op, dependencies=[...])  #Add previous component as dependency
Resize images

Description#

Component that resizes images based on given width and height

Inputs / outputs#

This component consumes:

  • images
    • data: binary

This component produces:

  • images
    • data: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
resize_width int The width to resize to /
resize_height int The height to resize to /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


resize_images_op = ComponentOp.from_registry(
    name="resize_images",
    arguments={
        # Add arguments
        # "resize_width": 0,
        # "resize_height": 0,
    }
)
pipeline.add_op(resize_images_op, dependencies=[...])  #Add previous component as dependency
Segment images

Description#

Component that creates segmentation masks for images using a model from the Hugging Face hub

Inputs / outputs#

This component consumes:

  • images
    • data: binary

This component produces:

  • segmentations
    • data: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str id of the model on the Hugging Face hub openmmlab/upernet-convnext-small
batch_size int batch size to use 8

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


segment_images_op = ComponentOp.from_registry(
    name="segment_images",
    arguments={
        # Add arguments
        # "model_id": "openmmlab/upernet-convnext-small",
        # "batch_size": 8,
    }
)
pipeline.add_op(segment_images_op, dependencies=[...])  #Add previous component as dependency

Text processing

Chunk text

Description#

Component that chunks text into smaller segments

This component takes a body of text and chunks into small chunks. The id of the returned dataset consists of the id of the original document followed by the chunk index.

Inputs / outputs#

This component consumes:

  • text
    • data: string

This component produces:

  • text
    • data: string
    • original_document_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
chunk_size int Maximum size of chunks to return /
chunk_overlap int Overlap in characters between chunks /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


chunk_text_op = ComponentOp.from_registry(
    name="chunk_text",
    arguments={
        # Add arguments
        # "chunk_size": 0,
        # "chunk_overlap": 0,
    }
)
pipeline.add_op(chunk_text_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Embed text

Description#

Component that generates embeddings of text passages.

Inputs / outputs#

This component consumes:

  • text
    • data: string

This component produces:

  • text
    • data: string
    • embedding: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_provider str The provider of the model - corresponding to langchain embedding classes. Currently the following providers are supported: aleph_alpha, cohere, huggingface, openai, vertexai. huggingface
model str The model to generate embeddings from. Choose an available model name to pass to the model provider's langchain embedding class. /
api_keys dict The API keys to use for the model provider that are written to environment variables.Pass only the keys required by the model provider or conveniently pass all keys you will ever need. Pay attention how to name the dictionary keys so that they can be used by the model provider. /
auth_kwargs dict Additional keyword arguments required for api initialization/authentication. /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


embed_text_op = ComponentOp.from_registry(
    name="embed_text",
    arguments={
        # Add arguments
        # "model_provider": "huggingface",
        # "model": ,
        # "api_keys": {},
        # "auth_kwargs": {},
    }
)
pipeline.add_op(embed_text_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter text length

Description#

A component that filters out text based on their length

Inputs / outputs#

This component consumes:

  • text
    • data: string

This component produces no data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
min_characters_length int Minimum number of characters /
min_words_length int Mininum number of words /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


filter_text_length_op = ComponentOp.from_registry(
    name="filter_text_length",
    arguments={
        # Add arguments
        # "min_characters_length": 0,
        # "min_words_length": 0,
    }
)
pipeline.add_op(filter_text_length_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter languages

Description#

A component that filters text based on the provided language.

Inputs / outputs#

This component consumes:

  • text
    • data: string

This component produces no data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
language str A valid language code or identifier (e.g., "en", "fr", "de"). en

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


language_filter_op = ComponentOp.from_registry(
    name="language_filter",
    arguments={
        # Add arguments
        # "language": "en",
    }
)
pipeline.add_op(language_filter_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

MinHash generator

Description#

A component that generates minhashes of text.

Inputs / outputs#

This component consumes:

  • text
    • data: string

This component produces:

  • text
    • minhash: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
shingle_ngram_size int Define size of ngram used for the shingle generation 3

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


minhash_generator_op = ComponentOp.from_registry(
    name="minhash_generator",
    arguments={
        # Add arguments
        # "shingle_ngram_size": 3,
    }
)
pipeline.add_op(minhash_generator_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Normalize text

Description#

This component implements several text normalization techniques to clean and preprocess textual data:

  • Apply lowercasing: Converts all text to lowercase
  • Remove unnecessary whitespaces: Eliminates extra spaces between words, e.g. tabs
  • Apply NFC normalization: Converts characters to their canonical representation
  • Remove common seen patterns in webpages following the implementation of Penedo et al.
  • Remove punctuation: Strips punctuation marks from the text

These text normalization techniques are valuable for preparing text data before using it for the training of large language models.

Inputs / outputs#

This component consumes:

  • text
    • data: string

This component produces no data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
remove_additional_whitespaces bool If true remove all additional whitespace, tabs. /
apply_nfc bool If true apply nfc normalization /
normalize_lines bool If true analyze documents line-by-line and apply various rules to discard or edit lines. Used to removed common patterns in webpages, e.g. counter /
do_lowercase bool If true apply lowercasing /
remove_punctuation str If true punctuation will be removed /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import ComponentOp


normalize_text_op = ComponentOp.from_registry(
    name="normalize_text",
    arguments={
        # Add arguments
        # "remove_additional_whitespaces": False,
        # "apply_nfc": False,
        # "normalize_lines": False,
        # "do_lowercase": False,
        # "remove_punctuation": ,
    }
)
pipeline.add_op(normalize_text_op, dependencies=[...])  #Add previous component as dependency

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test