Skip to content

Component Hub#

Below you can find the reusable components offered by Fondant.

Data loading

Load from csv

Description#

Component that loads a dataset from a csv file

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_uri str The remote path to the csv file(s) containing the dataset /
column_separator str Define the column separator of the csv file /
column_name_mapping dict Mapping of the consumed dataset /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing dataset workflows on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_csv",
    arguments={
        # Add arguments
        # "dataset_uri": ,
        # "column_separator": ,
        # "column_name_mapping": {},
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load from files

Description#

This component loads data from files in a local or remote (AWS S3, Azure Blob storage, GCS) location. It supports the following formats: .zip, gzip, tar and tar.gz.

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component produces:

  • filename: string
  • content: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
directory_uri str Local or remote path to the directory containing the files /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_files",
    arguments={
        # Add arguments
        # "directory_uri": ,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load from Hugging Face hub

Description#

Component that loads a dataset from the hub

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_name str Name of dataset on the hub /
column_name_mapping dict Mapping of the consumed hub dataset to fondant column names /
image_column_names list Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string. /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_hf_hub",
    arguments={
        # Add arguments
        # "dataset_name": ,
        # "column_name_mapping": {},
        # "image_column_names": [],
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Load from parquet

Description#

Component that loads a dataset from a parquet uri

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_uri str The remote path to the parquet file/folder containing the dataset /
column_name_mapping dict Mapping of the consumed dataset /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_parquet",
    arguments={
        # Add arguments
        # "dataset_uri": ,
        # "column_name_mapping": {},
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Load from pdf

Description#

Load pdf data stored locally or remote using langchain loaders.

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component produces:

  • pdf_path: string
  • file_name: string
  • text: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
pdf_path str The path to the a pdf file or a folder containing pdf files to load. Can be a local path or a remote path. If the path is remote, the loader class will be determined by the scheme of the path. /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /
n_partitions int Number of partitions of the dask dataframe. If not specified, the number of partitions will be equal to the number of CPU cores. Set to high values if the data is large and the pipelineis running out of memory. /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.create(
    "load_from_pdf",
    arguments={
        # Add arguments
        # "pdf_path": ,
        # "n_rows_to_load": 0,
        # "index_column": ,
        # "n_partitions": 0,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Data retrieval

Download images

Description#

Component that downloads images from a list of URLs.

This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width). The images are stored in a new colum as bytes objects. This component also resizes the images using the resizer function from the img2dataset library.

Inputs / outputs#

Consumes#

This component consumes:

  • image_url: string

Produces#

This component produces:

  • image: binary
  • image_width: int32
  • image_height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
timeout int Maximum time (in seconds) to wait when trying to download an image, 10
retries int Number of times to retry downloading an image if it fails. /
n_connections int Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput. 100
image_size int Size of the images after resizing. 256
resize_mode str Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". border
resize_only_if_bigger bool If True, resize only if image is bigger than image_size. /
min_image_size int Minimum size of the images. /
max_aspect_ratio float Maximum aspect ratio of the images. 99.9

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "download_images",
    arguments={
        # Add arguments
        # "timeout": 10,
        # "retries": 0,
        # "n_connections": 100,
        # "image_size": 256,
        # "resize_mode": "border",
        # "resize_only_if_bigger": False,
        # "min_image_size": 0,
        # "max_aspect_ratio": 99.9,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve from FAISS by embedding

Description#

Retrieve images from a Faiss index. The component should reference a Faiss image dataset, which includes both the Faiss index and a dataset of image URLs. The input dataset contains embeddings which will be use to retrieve similar images.

Inputs / outputs#

Consumes#

This component consumes:

  • embedding: list

Produces#

This component produces:

  • image_url: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
url_mapping_path str Url of the image mapping dataset /
faiss_index_path str Url of the dataset /
num_images int Number of images that will be retrieved for each prompt 2

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_from_faiss_by_embedding",
    arguments={
        # Add arguments
        # "url_mapping_path": ,
        # "faiss_index_path": ,
        # "num_images": 2,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve from FAISS by prompt

Description#

Retrieve images from a Faiss index. The component should reference a Faiss image dataset, which includes both the Faiss index and a dataset of image URLs. The input dataset consists of a list of prompts. These prompts will be embedded using a CLIP model, and similar images will be retrieved from the index.

Inputs / outputs#

Consumes#

This component consumes:

  • prompt: string

Produces#

This component produces:

  • image_url: string
  • prompt: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
url_mapping_path str Url of the image mapping dataset /
faiss_index_path str Url of the dataset /
clip_model str Clip model name to use for the retrieval laion/CLIP-ViT-B-32-laion2B-s34B-b79K
num_images int Number of images that will be retrieved for each prompt 2

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_from_faiss_by_prompt",
    arguments={
        # Add arguments
        # "url_mapping_path": ,
        # "faiss_index_path": ,
        # "clip_model": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
        # "num_images": 2,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve LAION by embedding

Description#

This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be used to find images similar to the embedded images / captions.

Inputs / outputs#

Consumes#

This component consumes:

  • embedding: list

Produces#

This component produces:

  • image_url: string
  • embedding_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
num_images int Number of images to retrieve for each prompt /
aesthetic_score int Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). 9
aesthetic_weight float Weight of the aesthetic embedding when added to the query, between 0 and 1 0.5

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_laion_by_embedding",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve LAION by prompt

Description#

This component retrieves image URLs from the LAION-5B dataset based on text prompts. The retrieval itself is done based on CLIP embeddings similarity between the prompt sentences and the captions in the LAION dataset.

This component doesn’t return the actual images, only URLs.

Inputs / outputs#

Consumes#

This component consumes:

  • prompt: string

Produces#

This component produces:

  • image_url: string
  • prompt_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
num_images int Number of images to retrieve for each prompt /
aesthetic_score int Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). 9
aesthetic_weight float Weight of the aesthetic embedding when added to the query, between 0 and 1 0.5
url str The url of the backend clip retrieval service, defaults to the public service https://knn.laion.ai/knn-service

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "retrieve_laion_by_prompt",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
        # "url": "https://knn.laion.ai/knn-service",
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Data writing

Index AWS OpenSearch

Description#

Component that takes embeddings of text snippets and indexes them into AWS OpenSearch vector database.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string
  • embedding: list

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
host str The Cluster endpoint of the AWS OpenSearch cluster where the embeddings will be indexed. E.g. "my-test-domain.us-east-1.aoss.amazonaws.com" /
region str The AWS region where the OpenSearch cluster is located. If not specified, the default region will be used. /
index_name str The name of the index in the AWS OpenSearch cluster where the embeddings will be stored. /
index_body dict Parameters that specify index settings, mappings, and aliases for newly created index. /
port int The port number to connect to the AWS OpenSearch cluster. 443
use_ssl bool A boolean flag indicating whether to use SSL/TLS for the connection to the OpenSearch cluster. True
verify_certs bool A boolean flag indicating whether to verify SSL certificates when connecting to the OpenSearch cluster. True
pool_maxsize int The maximum size of the connection pool to the AWS OpenSearch cluster. 20

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_aws_opensearch",
    arguments={
        # Add arguments
        # "host": ,
        # "region": ,
        # "index_name": ,
        # "index_body": {},
        # "port": 443,
        # "use_ssl": True,
        # "verify_certs": True,
        # "pool_maxsize": 20,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Index Qdrant

Description#

A Fondant component to load textual data and embeddings into a Qdrant database. NOTE: A Qdrant collection has to be created in advance with the appropriate configurations. https://qdrant.tech/documentation/concepts/collections/

Inputs / outputs#

Consumes#

This component consumes:

  • text: string
  • embedding: list

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
collection_name str The name of the Qdrant collection to upsert data into. /
location str The location of the Qdrant instance. /
batch_size int The batch size to use when uploading points to Qdrant. 64
parallelism int The number of parallel workers to use when uploading points to Qdrant. 1
url str Either host or str of 'Optional[scheme], host, Optional[port], Optional[prefix]'. /
port int Port of the REST API interface. 6333
grpc_port int Port of the gRPC interface. 6334
prefer_grpc bool If true - use gRPC interface whenever possible in custom methods. /
https bool If true - use HTTPS(SSL) protocol. /
api_key str API key for authentication in Qdrant Cloud. /
prefix str If set, add prefix to the REST URL path. /
timeout int Timeout for API requests. /
host str Host name of Qdrant service. If url and host are not set, defaults to 'localhost'. /
path str Persistence path for QdrantLocal. Eg. local_data/qdrant /
force_disable_check_same_thread bool Force disable check_same_thread for QdrantLocal sqlite connection. /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_qdrant",
    arguments={
        # Add arguments
        # "collection_name": ,
        # "location": ,
        # "batch_size": 64,
        # "parallelism": 1,
        # "url": ,
        # "port": 6333,
        # "grpc_port": 6334,
        # "prefer_grpc": False,
        # "https": False,
        # "api_key": ,
        # "prefix": ,
        # "timeout": 0,
        # "host": ,
        # "path": ,
        # "force_disable_check_same_thread": False,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Index Weaviate

Description#

Component that takes text or embeddings of text snippets and indexes them into a Weaviate vector database.

To run the component with text snippets as input, the component needs to be connected to a previous component that outputs text snippets.

Running with text as input#

import pyarrow as pa
from fondant.dataset import Dataset

dataset = Dataset.read(...)

dataset.write(
    "index_weaviate",
    arguments={
        "weaviate_url": "http://localhost:8080",
        "class_name": "my_class",
        "vectorizer": "text2vec-openai",
        "additional_headers" : {
            "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY"
        }
    },
    consumes={
        "text": pa.string()
    }
)

Running with embedding as input#

import pyarrow as pa
from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "embed_text",
    arguments={...},
    consumes={
        "text": "text",
    },
)

dataset.write(
    "index_weaviate",
    arguments={
        "weaviate_url": "http://localhost:8080",
        "class_name": "my_class",
    },
    consumes={
        "embedding": pa.list_(pa.float32())
    }
)

Inputs / outputs#

Consumes#

This component can consume additional fields - : This defines a mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the name of the field to map from the input dataset

See the usage example below on how to define a field name for additional fields.

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
weaviate_url str The URL of the weaviate instance. http://localhost:8080
batch_size int The batch size to be used.Parameter of weaviate.batch.Batch().configure(). 100
dynamic bool Whether to use dynamic batching or not.Parameter of weaviate.batch.Batch().configure(). True
num_workers int The maximal number of concurrent threads to run batch import.Parameter of weaviate.batch.Batch().configure(). 2
overwrite bool Whether to overwrite/ re-create the existing weaviate class and its embeddings. /
class_name str The name of the weaviate class that will be created and used to store the embeddings.Should follow the weaviate naming conventions. /
additional_config dict Additional configuration to pass to the weaviate client. /
additional_headers dict Additional headers to pass to the weaviate client. /
vectorizer str Which vectorizer to use. You can find the available vectorizers in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings. /
module_config dict The configuration of the vectorizer module.You can find the available configuration options in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings. /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_weaviate",
    arguments={
        # Add arguments
        # "weaviate_url": "http://localhost:8080",
        # "batch_size": 100,
        # "dynamic": True,
        # "num_workers": 2,
        # "overwrite": False,
        # "class_name": ,
        # "additional_config": {},
        # "additional_headers": {},
        # "vectorizer": ,
        # "module_config": {},
    },
    consumes={
         <field_name>: <dataset_field_name>,
         ..., # Add fields
     },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Write to file

Description#

A Fondant component to write a dataset to file on a local machine or to a cloud storage bucket. The dataset can be written as csv or parquet.

Inputs / outputs#

Consumes#

This component can consume additional fields - : This defines a mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the name of the field to map from the input dataset

See the usage example below on how to define a field name for additional fields.

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
path str Path to store the dataset, whether it's a local path or a cloud storage bucket, must be specified. A separate filename will be generated for each partition. If you are using the local runner and export the data to a local directory, ensure that you mount the path to the directory using the --extra-volumes argument. /
format str Format for storing the dataframe can be either csv or parquet. As default parquet is used. The CSV files contain the column as a header and use a comma as a delimiter. parquet

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "write_to_file",
    arguments={
        # Add arguments
        # "path": ,
        # "format": "parquet",
    },
    consumes={
         <field_name>: <dataset_field_name>,
         ..., # Add fields
     },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Write to Hugging Face hub

Description#

Component that writes a dataset to the hub

Inputs / outputs#

Consumes#

This component can consume additional fields - : This defines a mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the name of the field to map from the input dataset

See the usage example below on how to define a field name for additional fields.

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
hf_token str The hugging face token used to write to the hub /
username str The username under which to upload the dataset /
dataset_name str The name of the dataset to upload /
image_column_names list A list containing the image column names. Used to format to image to HF hub format /
column_name_mapping dict Mapping of the consumed fondant column names to the written hub column names /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(...)

dataset.write(
    "write_to_hf_hub",
    arguments={
        # Add arguments
        # "hf_token": ,
        # "username": ,
        # "dataset_name": ,
        # "image_column_names": [],
        # "column_name_mapping": {},
    },
    consumes={
         <field_name>: <dataset_field_name>,
         ..., # Add fields
     },
)

Image processing

Caption images

Description#

This component captions images using a BLIP model from the Hugging Face hub

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • caption: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str Id of the BLIP model on the Hugging Face hub Salesforce/blip-image-captioning-base
batch_size int Batch size to use for inference 8
max_new_tokens int Maximum token length of each caption 50

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "caption_images",
    arguments={
        # Add arguments
        # "model_id": "Salesforce/blip-image-captioning-base",
        # "batch_size": 8,
        # "max_new_tokens": 50,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Crop images

Description#

This component crops out image borders. This is typically useful when working with graphical images that have single-color borders (e.g. logos, icons, etc.).

The component takes an image and calculates which color is most present in the border. It then crops the image in order to minimize this single-color border. The padding argument will add extra border to the image before cropping it, in order to avoid cutting off parts of the image. The resulting crop will always be square. If a crop is not possible, the component will return the original image.

Examples#

Examples of image cropping by removing the single-color border. Left side is original image, right side is border-cropped image.

Example of image cropping by removing the single-color border. Left side is original, right side is cropped image Example of image cropping by removing the single-color border. Left side is original, right side is cropped image

Inputs / outputs#

Consumes#

This component consumes:

  • images_data: binary

Produces#

This component produces:

  • image: binary
  • image_width: int32
  • image_height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
cropping_threshold int Threshold parameter used for detecting borders. A lower (negative) parameter results in a more performant border detection, but can cause overcropping. Default is -30 -30
padding int Padding for the image cropping. The padding is added to all borders of the image. 10

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "crop_images",
    arguments={
        # Add arguments
        # "cropping_threshold": -30,
        # "padding": 10,
    },
)

Embed images

Description#

Component that generates CLIP embeddings from images

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • embedding: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str Model id of a CLIP model on the Hugging Face hub openai/clip-vit-large-patch14
batch_size int Batch size to use when embedding 8

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "embed_images",
    arguments={
        # Add arguments
        # "model_id": "openai/clip-vit-large-patch14",
        # "batch_size": 8,
    },
)

Extract image resolution

Description#

Component that extracts image resolution data from the images

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • image: binary
  • image_width: int32
  • image_height: int32

Arguments#

This component takes no arguments.

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "extract_image_resolution",
    arguments={
        # Add arguments
    },
)

Filter image resolution

Description#

Component that filters images based on minimum size and max aspect ratio

Inputs / outputs#

Consumes#

This component consumes:

  • image_width: int32
  • image_height: int32

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
min_image_dim int Minimum image dimension /
max_aspect_ratio float Maximum aspect ratio /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "filter_image_resolution",
    arguments={
        # Add arguments
        # "min_image_dim": 0,
        # "max_aspect_ratio": 0.0,
    },
)

Resize images

Description#

Component that resizes images based on given width and height

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • image: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
resize_width int The width to resize to /
resize_height int The height to resize to /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "resize_images",
    arguments={
        # Add arguments
        # "resize_width": 0,
        # "resize_height": 0,
    },
)

Segment images

Description#

Component that creates segmentation masks for images using a model from the Hugging Face hub

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • segmentation_map: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str id of the model on the Hugging Face hub openmmlab/upernet-convnext-small
batch_size int batch size to use 8

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "segment_images",
    arguments={
        # Add arguments
        # "model_id": "openmmlab/upernet-convnext-small",
        # "batch_size": 8,
    },
)

Text processing

Chunk text

Description#

Component that chunks text into smaller segments

This component takes a body of text and chunks into small chunks. The id of the returned dataset consists of the id of the original document followed by the chunk index.

Different chunking strategies can be used. The default is to use the "recursive" strategy which recursively splits the text into smaller chunks until the chunk size is reached.

More information about the different chunking strategies can be here: - https://python.langchain.com/docs/modules/data_connection/document_transformers/ - https://www.pinecone.io/learn/chunking-strategies/

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component produces:

  • text: string
  • original_document_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
chunk_strategy str The strategy to use for chunking the text. One of ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter', 'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter', 'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character', 'NLTK', 'SpaCy'] RecursiveCharacterTextSplitter
chunk_kwargs dict The arguments to pass to the chunking strategy /
language_text_splitter str The programming language to use for splitting text into sentences if "language" is selected as the splitter. Check https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter for more information on supported languages. /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "chunk_text",
    arguments={
        # Add arguments
        # "chunk_strategy": "RecursiveCharacterTextSplitter",
        # "chunk_kwargs": {},
        # "language_text_splitter": ,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Embed text

Description#

Component that generates embeddings of text passages.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component produces:

  • embedding: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_provider str The provider of the model - corresponding to langchain embedding classes. Currently the following providers are supported: aleph_alpha, cohere, huggingface, openai, vertexai. huggingface
model str The model to generate embeddings from. Choose an available model name to pass to the model provider's langchain embedding class. /
api_keys dict The API keys to use for the model provider that are written to environment variables.Pass only the keys required by the model provider or conveniently pass all keys you will ever need. Pay attention how to name the dictionary keys so that they can be used by the model provider. /
auth_kwargs dict Additional keyword arguments required for api initialization/authentication. /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "embed_text",
    arguments={
        # Add arguments
        # "model_provider": "huggingface",
        # "model": ,
        # "api_keys": {},
        # "auth_kwargs": {},
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter language

Description#

A component that filters text based on the provided language.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
language str A valid language code or identifier (e.g., "en", "fr", "de"). en

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "filter_language",
    arguments={
        # Add arguments
        # "language": "en",
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter text length

Description#

A component that filters out text based on their length

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
min_characters_length int Minimum number of characters /
min_words_length int Mininum number of words /

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "filter_text_length",
    arguments={
        # Add arguments
        # "min_characters_length": 0,
        # "min_words_length": 0,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Generate minhash

Description#

A component that generates minhashes of text.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component produces:

  • minhash: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
shingle_ngram_size int Define size of ngram used for the shingle generation 3

Usage#

You can apply this component to your dataset using the following code:

from fondant.dataset import Dataset


dataset = Dataset.read(...)

dataset = dataset.apply(
    "generate_minhash",
    arguments={
        # Add arguments
        # "shingle_ngram_size": 3,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test