Skip to content

Component Hub#

Below you can find the reusable components offered by Fondant.

Data loading

Load from csv

Description#

Component that loads a dataset from a csv file

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_uri str The remote path to the csv file(s) containing the dataset /
column_separator str Define the column separator of the csv file /
column_name_mapping dict Mapping of the consumed dataset /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
    "load_from_csv",
    arguments={
        # Add arguments
        # "dataset_uri": ,
        # "column_separator": ,
        # "column_name_mapping": {},
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load from files

Description#

This component loads data from files in a local or remote (AWS S3, Azure Blob storage, GCS) location. It supports the following formats: .zip, gzip, tar and tar.gz.

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component produces:

  • filename: string
  • content: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
directory_uri str Local or remote path to the directory containing the files /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
    "load_from_files",
    arguments={
        # Add arguments
        # "directory_uri": ,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load from Hugging Face hub

Description#

Component that loads a dataset from the hub

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_name str Name of dataset on the hub /
column_name_mapping dict Mapping of the consumed hub dataset to fondant column names /
image_column_names list Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string. /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
    "load_from_hf_hub",
    arguments={
        # Add arguments
        # "dataset_name": ,
        # "column_name_mapping": {},
        # "image_column_names": [],
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Load from parquet

Description#

Component that loads a dataset from a parquet uri

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
dataset_uri str The remote path to the parquet file/folder containing the dataset /
column_name_mapping dict Mapping of the consumed dataset /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
    "load_from_parquet",
    arguments={
        # Add arguments
        # "dataset_uri": ,
        # "column_name_mapping": {},
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Load from pdf

Description#

Load pdf data stored locally or remote using langchain loaders.

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component produces:

  • pdf_path: string
  • file_name: string
  • text: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
pdf_path str The path to the a pdf file or a folder containing pdf files to load. Can be a local path or a remote path. If the path is remote, the loader class will be determined by the scheme of the path. /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /
n_partitions int Number of partitions of the dask dataframe. If not specified, the number of partitions will be equal to the number of CPU cores. Set to high values if the data is large and the pipelineis running out of memory. /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
    "load_from_pdf",
    arguments={
        # Add arguments
        # "pdf_path": ,
        # "n_rows_to_load": 0,
        # "index_column": ,
        # "n_partitions": 0,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Load with LlamaHub

Description#

Load data using a LlamaHub loader. For available loaders, check the LlamaHub.

Inputs / outputs#

Consumes#

This component does not consume data.

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
loader_class str The name of the LlamaIndex loader class to use. Make sure to provide the name and not the id. The name is passed to llama_index.download_loader to download the specified loader. /
loader_kwargs str Keyword arguments to pass when instantiating the loader class. Check the documentation of the loader to check which arguments it accepts. /
load_kwargs str Keyword arguments to pass to the .load() method of the loader. Check the documentation ofthe loader to check which arguments it accepts. /
additional_requirements list Some loaders require additional dependencies to be installed. You can specify those here. Use a format accepted by pip install. Eg. "pypdf" or "pypdf==3.17.1". Unfortunately additional requirements for LlamaIndex loaders are not documented well, but if a dependencyis missing, a clear error message will be thrown. /
n_rows_to_load int Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale /
index_column str Column to set index to in the load component, if not specified a default globally unique index will be set /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(
    "load_with_llamahub",
    arguments={
        # Add arguments
        # "loader_class": ,
        # "loader_kwargs": ,
        # "load_kwargs": ,
        # "additional_requirements": [],
        # "n_rows_to_load": 0,
        # "index_column": ,
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Data retrieval

Download images

Description#

Component that downloads images from a list of URLs.

This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width). The images are stored in a new colum as bytes objects. This component also resizes the images using the resizer function from the img2dataset library.

Inputs / outputs#

Consumes#

This component consumes:

  • image_url: string

Produces#

This component produces:

  • image: binary
  • image_width: int32
  • image_height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
timeout int Maximum time (in seconds) to wait when trying to download an image, 10
retries int Number of times to retry downloading an image if it fails. /
n_connections int Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput. 100
image_size int Size of the images after resizing. 256
resize_mode str Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". border
resize_only_if_bigger bool If True, resize only if image is bigger than image_size. /
min_image_size int Minimum size of the images. /
max_aspect_ratio float Maximum aspect ratio of the images. 99.9

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "download_images",
    arguments={
        # Add arguments
        # "timeout": 10,
        # "retries": 0,
        # "n_connections": 100,
        # "image_size": 256,
        # "resize_mode": "border",
        # "resize_only_if_bigger": False,
        # "min_image_size": 0,
        # "max_aspect_ratio": 99.9,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

retrieve_from_weaviate

Description#

Component that retrieves chunks from a weaviate vectorDB

Inputs / outputs#

Consumes#

This component consumes:

  • embedding: list

Produces#

This component produces:

  • retrieved_chunks: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
weaviate_url str The URL of the weaviate instance. http://localhost:8080
class_name str The name of the weaviate class that will be queried /
top_k int Number of chunks to retrieve /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "retrieve_from_weaviate",
    arguments={
        # Add arguments
        # "weaviate_url": "http://localhost:8080",
        # "class_name": ,
        # "top_k": 0,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve LAION by embedding

Description#

This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be used to find images similar to the embedded images / captions.

Inputs / outputs#

Consumes#

This component consumes:

  • embedding: list

Produces#

This component produces:

  • image_url: string
  • embedding_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
num_images int Number of images to retrieve for each prompt /
aesthetic_score int Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). 9
aesthetic_weight float Weight of the aesthetic embedding when added to the query, between 0 and 1 0.5

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "retrieve_laion_by_embedding",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Retrieve LAION by prompt

Description#

This component retrieves image URLs from the LAION-5B dataset based on text prompts. The retrieval itself is done based on CLIP embeddings similarity between the prompt sentences and the captions in the LAION dataset.

This component doesn’t return the actual images, only URLs.

Inputs / outputs#

Consumes#

This component consumes:

  • prompt: string

Produces#

This component produces:

  • image_url: string
  • prompt_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
num_images int Number of images to retrieve for each prompt /
aesthetic_score int Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). 9
aesthetic_weight float Weight of the aesthetic embedding when added to the query, between 0 and 1 0.5
url str The url of the backend clip retrieval service, defaults to the public service https://knn.laion.ai/knn-service

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "retrieve_laion_by_prompt",
    arguments={
        # Add arguments
        # "num_images": 0,
        # "aesthetic_score": 9,
        # "aesthetic_weight": 0.5,
        # "url": "https://knn.laion.ai/knn-service",
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Data writing

Index AWS OpenSearch

Description#

Component that takes embeddings of text snippets and indexes them into AWS OpenSearch vector database.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string
  • embedding: list

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
host str The Cluster endpoint of the AWS OpenSearch cluster where the embeddings will be indexed. E.g. "my-test-domain.us-east-1.aoss.amazonaws.com" /
region str The AWS region where the OpenSearch cluster is located. If not specified, the default region will be used. /
index_name str The name of the index in the AWS OpenSearch cluster where the embeddings will be stored. /
index_body dict Parameters that specify index settings, mappings, and aliases for newly created index. /
port int The port number to connect to the AWS OpenSearch cluster. 443
use_ssl bool A boolean flag indicating whether to use SSL/TLS for the connection to the OpenSearch cluster. True
verify_certs bool A boolean flag indicating whether to verify SSL certificates when connecting to the OpenSearch cluster. True
pool_maxsize int The maximum size of the connection pool to the AWS OpenSearch cluster. 20

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_aws_opensearch",
    arguments={
        # Add arguments
        # "host": ,
        # "region": ,
        # "index_name": ,
        # "index_body": {},
        # "port": 443,
        # "use_ssl": True,
        # "verify_certs": True,
        # "pool_maxsize": 20,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Index Qdrant

Description#

A Fondant component to load textual data and embeddings into a Qdrant database. NOTE: A Qdrant collection has to be created in advance with the appropriate configurations. https://qdrant.tech/documentation/concepts/collections/

Inputs / outputs#

Consumes#

This component consumes:

  • text: string
  • embedding: list

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
collection_name str The name of the Qdrant collection to upsert data into. /
location str The location of the Qdrant instance. /
batch_size int The batch size to use when uploading points to Qdrant. 64
parallelism int The number of parallel workers to use when uploading points to Qdrant. 1
url str Either host or str of 'Optional[scheme], host, Optional[port], Optional[prefix]'. /
port int Port of the REST API interface. 6333
grpc_port int Port of the gRPC interface. 6334
prefer_grpc bool If true - use gRPC interface whenever possible in custom methods. /
https bool If true - use HTTPS(SSL) protocol. /
api_key str API key for authentication in Qdrant Cloud. /
prefix str If set, add prefix to the REST URL path. /
timeout int Timeout for API requests. /
host str Host name of Qdrant service. If url and host are not set, defaults to 'localhost'. /
path str Persistence path for QdrantLocal. Eg. local_data/qdrant /
force_disable_check_same_thread bool Force disable check_same_thread for QdrantLocal sqlite connection. /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_qdrant",
    arguments={
        # Add arguments
        # "collection_name": ,
        # "location": ,
        # "batch_size": 64,
        # "parallelism": 1,
        # "url": ,
        # "port": 6333,
        # "grpc_port": 6334,
        # "prefer_grpc": False,
        # "https": False,
        # "api_key": ,
        # "prefix": ,
        # "timeout": 0,
        # "host": ,
        # "path": ,
        # "force_disable_check_same_thread": False,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Index Weaviate

Description#

Component that takes embeddings of text snippets and indexes them into a weaviate vector database.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string
  • embedding: list

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
weaviate_url str The URL of the weaviate instance. http://localhost:8080
batch_size int The batch size to be used.Parameter of weaviate.batch.Batch().configure(). 100
dynamic bool Whether to use dynamic batching or not.Parameter of weaviate.batch.Batch().configure(). True
num_workers int The maximal number of concurrent threads to run batch import.Parameter of weaviate.batch.Batch().configure(). 2
overwrite bool Whether to overwrite/ re-create the existing weaviate class and its embeddings. /
class_name str The name of the weaviate class that will be created and used to store the embeddings.Should follow the weaviate naming conventions. /
vectorizer str Which vectorizer to use. You can find the available vectorizers in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings. /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(...)

dataset.write(
    "index_weaviate",
    arguments={
        # Add arguments
        # "weaviate_url": "http://localhost:8080",
        # "batch_size": 100,
        # "dynamic": True,
        # "num_workers": 2,
        # "overwrite": False,
        # "class_name": ,
        # "vectorizer": ,
    },
)

Write to Hugging Face hub

Description#

Component that writes a dataset to the hub

Inputs / outputs#

Consumes#

This component can consume additional fields - : This defines a mapping to update the fields consumed by the operation as defined in the component spec. The keys are the names of the fields to be received by the component, while the values are the name of the field to map from the input dataset

See the usage example below on how to define a field name for additional fields.

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
hf_token str The hugging face token used to write to the hub /
username str The username under which to upload the dataset /
dataset_name str The name of the dataset to upload /
image_column_names list A list containing the image column names. Used to format to image to HF hub format /
column_name_mapping dict Mapping of the consumed fondant column names to the written hub column names /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(...)

dataset.write(
    "write_to_hf_hub",
    arguments={
        # Add arguments
        # "hf_token": ,
        # "username": ,
        # "dataset_name": ,
        # "image_column_names": [],
        # "column_name_mapping": {},
    },
    consumes={
         <field_name>: <dataset_field_name>,
         ..., # Add fields
     },
)

Image processing

Caption images

Description#

This component captions images using a BLIP model from the Hugging Face hub

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • caption: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str Id of the BLIP model on the Hugging Face hub Salesforce/blip-image-captioning-base
batch_size int Batch size to use for inference 8
max_new_tokens int Maximum token length of each caption 50

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "caption_images",
    arguments={
        # Add arguments
        # "model_id": "Salesforce/blip-image-captioning-base",
        # "batch_size": 8,
        # "max_new_tokens": 50,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Crop images

Description#

This component crops out image borders. This is typically useful when working with graphical images that have single-color borders (e.g. logos, icons, etc.).

The component takes an image and calculates which color is most present in the border. It then crops the image in order to minimize this single-color border. The padding argument will add extra border to the image before cropping it, in order to avoid cutting off parts of the image. The resulting crop will always be square. If a crop is not possible, the component will return the original image.

Examples#

Examples of image cropping by removing the single-color border. Left side is original image, right side is border-cropped image.

Example of image cropping by removing the single-color border. Left side is original, right side is cropped image Example of image cropping by removing the single-color border. Left side is original, right side is cropped image

Inputs / outputs#

Consumes#

This component consumes:

  • images_data: binary

Produces#

This component produces:

  • image: binary
  • image_width: int32
  • image_height: int32

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
cropping_threshold int Threshold parameter used for detecting borders. A lower (negative) parameter results in a more performant border detection, but can cause overcropping. Default is -30 -30
padding int Padding for the image cropping. The padding is added to all borders of the image. 10

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "crop_images",
    arguments={
        # Add arguments
        # "cropping_threshold": -30,
        # "padding": 10,
    },
)

Embed images

Description#

Component that generates CLIP embeddings from images

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • embedding: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str Model id of a CLIP model on the Hugging Face hub openai/clip-vit-large-patch14
batch_size int Batch size to use when embedding 8

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "embed_images",
    arguments={
        # Add arguments
        # "model_id": "openai/clip-vit-large-patch14",
        # "batch_size": 8,
    },
)

Extract image resolution

Description#

Component that extracts image resolution data from the images

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • image: binary
  • image_width: int32
  • image_height: int32

Arguments#

This component takes no arguments.

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "extract_image_resolution",
    arguments={
        # Add arguments
    },
)

Filter image resolution

Description#

Component that filters images based on minimum size and max aspect ratio

Inputs / outputs#

Consumes#

This component consumes:

  • image_width: int32
  • image_height: int32

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
min_image_dim int Minimum image dimension /
max_aspect_ratio float Maximum aspect ratio /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "filter_image_resolution",
    arguments={
        # Add arguments
        # "min_image_dim": 0,
        # "max_aspect_ratio": 0.0,
    },
)

Resize images

Description#

Component that resizes images based on given width and height

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • image: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
resize_width int The width to resize to /
resize_height int The height to resize to /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "resize_images",
    arguments={
        # Add arguments
        # "resize_width": 0,
        # "resize_height": 0,
    },
)

Segment images

Description#

Component that creates segmentation masks for images using a model from the Hugging Face hub

Inputs / outputs#

Consumes#

This component consumes:

  • image: binary

Produces#

This component produces:

  • segmentation_map: binary

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_id str id of the model on the Hugging Face hub openmmlab/upernet-convnext-small
batch_size int batch size to use 8

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "segment_images",
    arguments={
        # Add arguments
        # "model_id": "openmmlab/upernet-convnext-small",
        # "batch_size": 8,
    },
)

Text processing

Chunk text

Description#

Component that chunks text into smaller segments

This component takes a body of text and chunks into small chunks. The id of the returned dataset consists of the id of the original document followed by the chunk index.

Different chunking strategies can be used. The default is to use the "recursive" strategy which recursively splits the text into smaller chunks until the chunk size is reached.

More information about the different chunking strategies can be here: - https://python.langchain.com/docs/modules/data_connection/document_transformers/ - https://www.pinecone.io/learn/chunking-strategies/

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component produces:

  • text: string
  • original_document_id: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
chunk_strategy str The strategy to use for chunking the text. One of ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter', 'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter', 'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character', 'NLTK', 'SpaCy'] RecursiveCharacterTextSplitter
chunk_kwargs dict The arguments to pass to the chunking strategy /
language_text_splitter str The programming language to use for splitting text into sentences if "language" is selected as the splitter. Check https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter for more information on supported languages. /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "chunk_text",
    arguments={
        # Add arguments
        # "chunk_strategy": "RecursiveCharacterTextSplitter",
        # "chunk_kwargs": {},
        # "language_text_splitter": ,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Embed text

Description#

Component that generates embeddings of text passages.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component produces:

  • embedding: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
model_provider str The provider of the model - corresponding to langchain embedding classes. Currently the following providers are supported: aleph_alpha, cohere, huggingface, openai, vertexai. huggingface
model str The model to generate embeddings from. Choose an available model name to pass to the model provider's langchain embedding class. /
api_keys dict The API keys to use for the model provider that are written to environment variables.Pass only the keys required by the model provider or conveniently pass all keys you will ever need. Pay attention how to name the dictionary keys so that they can be used by the model provider. /
auth_kwargs dict Additional keyword arguments required for api initialization/authentication. /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "embed_text",
    arguments={
        # Add arguments
        # "model_provider": "huggingface",
        # "model": ,
        # "api_keys": {},
        # "auth_kwargs": {},
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Evaluate ragas

Description#

Component that evaluates the retriever using RAGAS

Inputs / outputs#

Consumes#

This component consumes:

  • question: string
  • retrieved_chunks: list

Produces#

This component can produce additional fields - : This defines a mapping to update the fields produced by the operation as defined in the component spec. The keys are the names of the fields to be produced by the component, while the values are the type of the field that should be used to write the output dataset.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
llm_module_name str Module from which the LLM is imported. Defaults to langchain.llms langchain.chat_models
llm_class_name str Name of the selected llm ChatOpenAI
llm_kwargs dict Arguments of the selected llm {'model_name': 'gpt-3.5-turbo'}

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "evaluate_ragas",
    arguments={
        # Add arguments
        # "llm_module_name": "langchain.chat_models",
        # "llm_class_name": "ChatOpenAI",
        # "llm_kwargs": {'model_name': 'gpt-3.5-turbo'},
    },
    produces={
         <field_name>: <field_schema>,
         ..., # Add fields
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter language

Description#

A component that filters text based on the provided language.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
language str A valid language code or identifier (e.g., "en", "fr", "de"). en

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "filter_language",
    arguments={
        # Add arguments
        # "language": "en",
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Filter text length

Description#

A component that filters out text based on their length

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component does not produce data.

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
min_characters_length int Minimum number of characters /
min_words_length int Mininum number of words /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "filter_text_length",
    arguments={
        # Add arguments
        # "min_characters_length": 0,
        # "min_words_length": 0,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Generate minhash

Description#

A component that generates minhashes of text.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component produces:

  • minhash: list

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
shingle_ngram_size int Define size of ngram used for the shingle generation 3

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "generate_minhash",
    arguments={
        # Add arguments
        # "shingle_ngram_size": 3,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test

Normalize text

Description#

This component implements several text normalization techniques to clean and preprocess textual data:

  • Apply lowercasing: Converts all text to lowercase
  • Remove unnecessary whitespaces: Eliminates extra spaces between words, e.g. tabs
  • Apply NFC normalization: Converts characters to their canonical representation
  • Remove common seen patterns in webpages following the implementation of Penedo et al.
  • Remove punctuation: Strips punctuation marks from the text

These text normalization techniques are valuable for preparing text data before using it for the training of large language models.

Inputs / outputs#

Consumes#

This component consumes:

  • text: string

Produces#

This component produces:

  • text: string

Arguments#

The component takes the following arguments to alter its behavior:

argument type description default
remove_additional_whitespaces bool If true remove all additional whitespace, tabs. /
apply_nfc bool If true apply nfc normalization /
normalize_lines bool If true analyze documents line-by-line and apply various rules to discard or edit lines. Used to removed common patterns in webpages, e.g. counter /
do_lowercase bool If true apply lowercasing /
remove_punctuation str If true punctuation will be removed /

Usage#

You can add this component to your pipeline using the following code:

from fondant.pipeline import Pipeline


pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset = dataset.apply(
    "normalize_text",
    arguments={
        # Add arguments
        # "remove_additional_whitespaces": False,
        # "apply_nfc": False,
        # "normalize_lines": False,
        # "do_lowercase": False,
        # "remove_punctuation": ,
    },
)

Testing#

You can run the tests using docker with BuildKit. From this directory, run:

docker build . --target test