Component Hub#
Below you can find the reusable components offered by Fondant.
Data loading
Load from csv
Description#
Component that loads a dataset from a csv file
Inputs / outputs#
Consumes#
This component does not consume data.
Produces#
This component can produce additional fields
-
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
dataset_uri | str | The remote path to the csv file(s) containing the dataset | / |
column_separator | str | Define the column separator of the csv file | / |
column_name_mapping | dict | Mapping of the consumed dataset | / |
n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing dataset workflows on a small scale | / |
index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.create(
"load_from_csv",
arguments={
# Add arguments
# "dataset_uri": ,
# "column_separator": ,
# "column_name_mapping": {},
# "n_rows_to_load": 0,
# "index_column": ,
},
produces={
<field_name>: <field_schema>,
..., # Add fields
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Load from files
Description#
This component loads data from files in a local or remote (AWS S3, Azure Blob storage, GCS) location. It supports the following formats: .zip, gzip, tar and tar.gz.
Inputs / outputs#
Consumes#
This component does not consume data.
Produces#
This component produces:
- filename: string
- content: binary
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
directory_uri | str | Local or remote path to the directory containing the files | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.create(
"load_from_files",
arguments={
# Add arguments
# "directory_uri": ,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Load from Hugging Face hub
Description#
Component that loads a dataset from the hub
Inputs / outputs#
Consumes#
This component does not consume data.
Produces#
This component can produce additional fields
-
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
dataset_name | str | Name of dataset on the hub | / |
column_name_mapping | dict | Mapping of the consumed hub dataset to fondant column names | / |
image_column_names | list | Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string. | / |
n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / |
index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / |
Usage#
You can apply this component to your dataset using the following code:
Load from parquet
Description#
Component that loads a dataset from a parquet uri
Inputs / outputs#
Consumes#
This component does not consume data.
Produces#
This component can produce additional fields
-
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
dataset_uri | str | The remote path to the parquet file/folder containing the dataset | / |
column_name_mapping | dict | Mapping of the consumed dataset | / |
n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / |
index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / |
Usage#
You can apply this component to your dataset using the following code:
Load from pdf
Description#
Load pdf data stored locally or remote using langchain loaders.
Inputs / outputs#
Consumes#
This component does not consume data.
Produces#
This component produces:
- pdf_path: string
- file_name: string
- text: string
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
pdf_path | str | The path to the a pdf file or a folder containing pdf files to load. Can be a local path or a remote path. If the path is remote, the loader class will be determined by the scheme of the path. | / |
n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / |
index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / |
n_partitions | int | Number of partitions of the dask dataframe. If not specified, the number of partitions will be equal to the number of CPU cores. Set to high values if the data is large and the pipelineis running out of memory. | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.create(
"load_from_pdf",
arguments={
# Add arguments
# "pdf_path": ,
# "n_rows_to_load": 0,
# "index_column": ,
# "n_partitions": 0,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Data retrieval
Download images
Description#
Component that downloads images from a list of URLs.
This component takes in image URLs as input and downloads the images, along with some metadata (like their height and width). The images are stored in a new colum as bytes objects. This component also resizes the images using the resizer function from the img2dataset library.
Inputs / outputs#
Consumes#
This component consumes:
- image_url: string
Produces#
This component produces:
- image: binary
- image_width: int32
- image_height: int32
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
timeout | int | Maximum time (in seconds) to wait when trying to download an image, | 10 |
retries | int | Number of times to retry downloading an image if it fails. | / |
n_connections | int | Number of concurrent connections opened per process. Decrease this number if you are running into timeout errors. A lower number of connections can increase the success rate but lower the throughput. | 100 |
image_size | int | Size of the images after resizing. | 256 |
resize_mode | str | Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". | border |
resize_only_if_bigger | bool | If True, resize only if image is bigger than image_size. | / |
min_image_size | int | Minimum size of the images. | / |
max_aspect_ratio | float | Maximum aspect ratio of the images. | 99.9 |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"download_images",
arguments={
# Add arguments
# "timeout": 10,
# "retries": 0,
# "n_connections": 100,
# "image_size": 256,
# "resize_mode": "border",
# "resize_only_if_bigger": False,
# "min_image_size": 0,
# "max_aspect_ratio": 99.9,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Retrieve from FAISS by embedding
Description#
Retrieve images from a Faiss index. The component should reference a Faiss image dataset, which includes both the Faiss index and a dataset of image URLs. The input dataset contains embeddings which will be use to retrieve similar images.
Inputs / outputs#
Consumes#
This component consumes:
- embedding: list
Produces#
This component produces:
- image_url: string
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
url_mapping_path | str | Url of the image mapping dataset | / |
faiss_index_path | str | Url of the dataset | / |
num_images | int | Number of images that will be retrieved for each prompt | 2 |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"retrieve_from_faiss_by_embedding",
arguments={
# Add arguments
# "url_mapping_path": ,
# "faiss_index_path": ,
# "num_images": 2,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Retrieve from FAISS by prompt
Description#
Retrieve images from a Faiss index. The component should reference a Faiss image dataset, which includes both the Faiss index and a dataset of image URLs. The input dataset consists of a list of prompts. These prompts will be embedded using a CLIP model, and similar images will be retrieved from the index.
Inputs / outputs#
Consumes#
This component consumes:
- prompt: string
Produces#
This component produces:
- image_url: string
- prompt: string
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
url_mapping_path | str | Url of the image mapping dataset | / |
faiss_index_path | str | Url of the dataset | / |
clip_model | str | Clip model name to use for the retrieval | laion/CLIP-ViT-B-32-laion2B-s34B-b79K |
num_images | int | Number of images that will be retrieved for each prompt | 2 |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"retrieve_from_faiss_by_prompt",
arguments={
# Add arguments
# "url_mapping_path": ,
# "faiss_index_path": ,
# "clip_model": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
# "num_images": 2,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Retrieve LAION by embedding
Description#
This component retrieves image URLs from LAION-5B based on a set of CLIP embeddings. It can be used to find images similar to the embedded images / captions.
Inputs / outputs#
Consumes#
This component consumes:
- embedding: list
Produces#
This component produces:
- image_url: string
- embedding_id: string
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
num_images | int | Number of images to retrieve for each prompt | / |
aesthetic_score | int | Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). | 9 |
aesthetic_weight | float | Weight of the aesthetic embedding when added to the query, between 0 and 1 | 0.5 |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"retrieve_laion_by_embedding",
arguments={
# Add arguments
# "num_images": 0,
# "aesthetic_score": 9,
# "aesthetic_weight": 0.5,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Retrieve LAION by prompt
Description#
This component retrieves image URLs from the LAION-5B dataset based on text prompts. The retrieval itself is done based on CLIP embeddings similarity between the prompt sentences and the captions in the LAION dataset.
This component doesn’t return the actual images, only URLs.
Inputs / outputs#
Consumes#
This component consumes:
- prompt: string
Produces#
This component produces:
- image_url: string
- prompt_id: string
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
num_images | int | Number of images to retrieve for each prompt | / |
aesthetic_score | int | Aesthetic embedding to add to the query embedding, between 0 and 9 (higher is prettier). | 9 |
aesthetic_weight | float | Weight of the aesthetic embedding when added to the query, between 0 and 1 | 0.5 |
url | str | The url of the backend clip retrieval service, defaults to the public service | https://knn.laion.ai/knn-service |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"retrieve_laion_by_prompt",
arguments={
# Add arguments
# "num_images": 0,
# "aesthetic_score": 9,
# "aesthetic_weight": 0.5,
# "url": "https://knn.laion.ai/knn-service",
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Data writing
Index AWS OpenSearch
Description#
Component that takes embeddings of text snippets and indexes them into AWS OpenSearch vector database.
Inputs / outputs#
Consumes#
This component consumes:
- text: string
- embedding: list
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
host | str | The Cluster endpoint of the AWS OpenSearch cluster where the embeddings will be indexed. E.g. "my-test-domain.us-east-1.aoss.amazonaws.com" | / |
region | str | The AWS region where the OpenSearch cluster is located. If not specified, the default region will be used. | / |
index_name | str | The name of the index in the AWS OpenSearch cluster where the embeddings will be stored. | / |
index_body | dict | Parameters that specify index settings, mappings, and aliases for newly created index. | / |
port | int | The port number to connect to the AWS OpenSearch cluster. | 443 |
use_ssl | bool | A boolean flag indicating whether to use SSL/TLS for the connection to the OpenSearch cluster. | True |
verify_certs | bool | A boolean flag indicating whether to verify SSL certificates when connecting to the OpenSearch cluster. | True |
pool_maxsize | int | The maximum size of the connection pool to the AWS OpenSearch cluster. | 20 |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(...)
dataset.write(
"index_aws_opensearch",
arguments={
# Add arguments
# "host": ,
# "region": ,
# "index_name": ,
# "index_body": {},
# "port": 443,
# "use_ssl": True,
# "verify_certs": True,
# "pool_maxsize": 20,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Index Qdrant
Description#
A Fondant component to load textual data and embeddings into a Qdrant database. NOTE: A Qdrant collection has to be created in advance with the appropriate configurations. https://qdrant.tech/documentation/concepts/collections/
Inputs / outputs#
Consumes#
This component consumes:
- text: string
- embedding: list
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
collection_name | str | The name of the Qdrant collection to upsert data into. | / |
location | str | The location of the Qdrant instance. | / |
batch_size | int | The batch size to use when uploading points to Qdrant. | 64 |
parallelism | int | The number of parallel workers to use when uploading points to Qdrant. | 1 |
url | str | Either host or str of 'Optional[scheme], host, Optional[port], Optional[prefix]'. | / |
port | int | Port of the REST API interface. | 6333 |
grpc_port | int | Port of the gRPC interface. | 6334 |
prefer_grpc | bool | If true - use gRPC interface whenever possible in custom methods. |
/ |
https | bool | If true - use HTTPS(SSL) protocol. |
/ |
api_key | str | API key for authentication in Qdrant Cloud. | / |
prefix | str | If set, add prefix to the REST URL path. |
/ |
timeout | int | Timeout for API requests. | / |
host | str | Host name of Qdrant service. If url and host are not set, defaults to 'localhost'. | / |
path | str | Persistence path for QdrantLocal. Eg. local_data/qdrant |
/ |
force_disable_check_same_thread | bool | Force disable check_same_thread for QdrantLocal sqlite connection. | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(...)
dataset.write(
"index_qdrant",
arguments={
# Add arguments
# "collection_name": ,
# "location": ,
# "batch_size": 64,
# "parallelism": 1,
# "url": ,
# "port": 6333,
# "grpc_port": 6334,
# "prefer_grpc": False,
# "https": False,
# "api_key": ,
# "prefix": ,
# "timeout": 0,
# "host": ,
# "path": ,
# "force_disable_check_same_thread": False,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Index Weaviate
Description#
Component that takes text or embeddings of text snippets and indexes them into a Weaviate vector database.
To run the component with text snippets as input, the component needs to be connected to a previous component that outputs text snippets.
Running with text as input#
import pyarrow as pa
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset.write(
"index_weaviate",
arguments={
"weaviate_url": "http://localhost:8080",
"class_name": "my_class",
"vectorizer": "text2vec-openai",
"additional_headers" : {
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY"
}
},
consumes={
"text": pa.string()
}
)
Running with embedding as input#
import pyarrow as pa
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"embed_text",
arguments={...},
consumes={
"text": "text",
},
)
dataset.write(
"index_weaviate",
arguments={
"weaviate_url": "http://localhost:8080",
"class_name": "my_class",
},
consumes={
"embedding": pa.list_(pa.float32())
}
)
Inputs / outputs#
Consumes#
This component can consume additional fields
-
See the usage example below on how to define a field name for additional fields.
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
weaviate_url | str | The URL of the weaviate instance. | http://localhost:8080 |
batch_size | int | The batch size to be used.Parameter of weaviate.batch.Batch().configure(). | 100 |
dynamic | bool | Whether to use dynamic batching or not.Parameter of weaviate.batch.Batch().configure(). | True |
num_workers | int | The maximal number of concurrent threads to run batch import.Parameter of weaviate.batch.Batch().configure(). | 2 |
overwrite | bool | Whether to overwrite/ re-create the existing weaviate class and its embeddings. | / |
class_name | str | The name of the weaviate class that will be created and used to store the embeddings.Should follow the weaviate naming conventions. | / |
additional_config | dict | Additional configuration to pass to the weaviate client. | / |
additional_headers | dict | Additional headers to pass to the weaviate client. | / |
vectorizer | str | Which vectorizer to use. You can find the available vectorizers in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings. | / |
module_config | dict | The configuration of the vectorizer module.You can find the available configuration options in the weaviate documentation: https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modulesSet this to None if you want to insert your own embeddings. | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(...)
dataset.write(
"index_weaviate",
arguments={
# Add arguments
# "weaviate_url": "http://localhost:8080",
# "batch_size": 100,
# "dynamic": True,
# "num_workers": 2,
# "overwrite": False,
# "class_name": ,
# "additional_config": {},
# "additional_headers": {},
# "vectorizer": ,
# "module_config": {},
},
consumes={
<field_name>: <dataset_field_name>,
..., # Add fields
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Write to file
Description#
A Fondant component to write a dataset to file on a local machine or to a cloud storage bucket. The dataset can be written as csv or parquet.
Inputs / outputs#
Consumes#
This component can consume additional fields
-
See the usage example below on how to define a field name for additional fields.
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
path | str | Path to store the dataset, whether it's a local path or a cloud storage bucket, must be specified. A separate filename will be generated for each partition. If you are using the local runner and export the data to a local directory, ensure that you mount the path to the directory using the --extra-volumes argument. |
/ |
format | str | Format for storing the dataframe can be either csv or parquet . As default parquet is used. The CSV files contain the column as a header and use a comma as a delimiter. |
parquet |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(...)
dataset.write(
"write_to_file",
arguments={
# Add arguments
# "path": ,
# "format": "parquet",
},
consumes={
<field_name>: <dataset_field_name>,
..., # Add fields
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Write to Hugging Face hub
Description#
Component that writes a dataset to the hub
Inputs / outputs#
Consumes#
This component can consume additional fields
-
See the usage example below on how to define a field name for additional fields.
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
hf_token | str | The hugging face token used to write to the hub | / |
username | str | The username under which to upload the dataset | / |
dataset_name | str | The name of the dataset to upload | / |
image_column_names | list | A list containing the image column names. Used to format to image to HF hub format | / |
column_name_mapping | dict | Mapping of the consumed fondant column names to the written hub column names | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(...)
dataset.write(
"write_to_hf_hub",
arguments={
# Add arguments
# "hf_token": ,
# "username": ,
# "dataset_name": ,
# "image_column_names": [],
# "column_name_mapping": {},
},
consumes={
<field_name>: <dataset_field_name>,
..., # Add fields
},
)
Image processing
Caption images
Description#
This component captions images using a BLIP model from the Hugging Face hub
Inputs / outputs#
Consumes#
This component consumes:
- image: binary
Produces#
This component produces:
- caption: string
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
model_id | str | Id of the BLIP model on the Hugging Face hub | Salesforce/blip-image-captioning-base |
batch_size | int | Batch size to use for inference | 8 |
max_new_tokens | int | Maximum token length of each caption | 50 |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"caption_images",
arguments={
# Add arguments
# "model_id": "Salesforce/blip-image-captioning-base",
# "batch_size": 8,
# "max_new_tokens": 50,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Crop images
Description#
This component crops out image borders. This is typically useful when working with graphical images that have single-color borders (e.g. logos, icons, etc.).
The component takes an image and calculates which color is most present in the border. It then
crops the image in order to minimize this single-color border. The padding
argument will add
extra border to the image before cropping it, in order to avoid cutting off parts of the image.
The resulting crop will always be square. If a crop is not possible, the component will return
the original image.
Examples#
Examples of image cropping by removing the single-color border. Left side is original image, right side is border-cropped image.
Inputs / outputs#
Consumes#
This component consumes:
- images_data: binary
Produces#
This component produces:
- image: binary
- image_width: int32
- image_height: int32
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
cropping_threshold | int | Threshold parameter used for detecting borders. A lower (negative) parameter results in a more performant border detection, but can cause overcropping. Default is -30 | -30 |
padding | int | Padding for the image cropping. The padding is added to all borders of the image. | 10 |
Usage#
You can apply this component to your dataset using the following code:
Embed images
Description#
Component that generates CLIP embeddings from images
Inputs / outputs#
Consumes#
This component consumes:
- image: binary
Produces#
This component produces:
- embedding: list
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
model_id | str | Model id of a CLIP model on the Hugging Face hub | openai/clip-vit-large-patch14 |
batch_size | int | Batch size to use when embedding | 8 |
Usage#
You can apply this component to your dataset using the following code:
Extract image resolution
Description#
Component that extracts image resolution data from the images
Inputs / outputs#
Consumes#
This component consumes:
- image: binary
Produces#
This component produces:
- image: binary
- image_width: int32
- image_height: int32
Arguments#
This component takes no arguments.
Usage#
You can apply this component to your dataset using the following code:
Filter image resolution
Description#
Component that filters images based on minimum size and max aspect ratio
Inputs / outputs#
Consumes#
This component consumes:
- image_width: int32
- image_height: int32
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
min_image_dim | int | Minimum image dimension | / |
max_aspect_ratio | float | Maximum aspect ratio | / |
Usage#
You can apply this component to your dataset using the following code:
Resize images
Description#
Component that resizes images based on given width and height
Inputs / outputs#
Consumes#
This component consumes:
- image: binary
Produces#
This component produces:
- image: binary
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
resize_width | int | The width to resize to | / |
resize_height | int | The height to resize to | / |
Usage#
You can apply this component to your dataset using the following code:
Segment images
Description#
Component that creates segmentation masks for images using a model from the Hugging Face hub
Inputs / outputs#
Consumes#
This component consumes:
- image: binary
Produces#
This component produces:
- segmentation_map: binary
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
model_id | str | id of the model on the Hugging Face hub | openmmlab/upernet-convnext-small |
batch_size | int | batch size to use | 8 |
Usage#
You can apply this component to your dataset using the following code:
Text processing
Chunk text
Description#
Component that chunks text into smaller segments
This component takes a body of text and chunks into small chunks. The id of the returned dataset consists of the id of the original document followed by the chunk index.
Different chunking strategies can be used. The default is to use the "recursive" strategy which recursively splits the text into smaller chunks until the chunk size is reached.
More information about the different chunking strategies can be here: - https://python.langchain.com/docs/modules/data_connection/document_transformers/ - https://www.pinecone.io/learn/chunking-strategies/
Inputs / outputs#
Consumes#
This component consumes:
- text: string
Produces#
This component produces:
- text: string
- original_document_id: string
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
chunk_strategy | str | The strategy to use for chunking the text. One of ['RecursiveCharacterTextSplitter', 'HTMLHeaderTextSplitter', 'CharacterTextSplitter', 'Language', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'SentenceTransformersTokenTextSplitter', 'LatexTextSplitter', 'SpacyTextSplitter', 'TokenTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'character', 'NLTK', 'SpaCy'] | RecursiveCharacterTextSplitter |
chunk_kwargs | dict | The arguments to pass to the chunking strategy | / |
language_text_splitter | str | The programming language to use for splitting text into sentences if "language" is selected as the splitter. Check https://python.langchain.com/docs/modules/data_connection/document_transformers/code_splitter for more information on supported languages. | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"chunk_text",
arguments={
# Add arguments
# "chunk_strategy": "RecursiveCharacterTextSplitter",
# "chunk_kwargs": {},
# "language_text_splitter": ,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Embed text
Description#
Component that generates embeddings of text passages.
Inputs / outputs#
Consumes#
This component consumes:
- text: string
Produces#
This component produces:
- embedding: list
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
model_provider | str | The provider of the model - corresponding to langchain embedding classes. Currently the following providers are supported: aleph_alpha, cohere, huggingface, openai, vertexai. | huggingface |
model | str | The model to generate embeddings from. Choose an available model name to pass to the model provider's langchain embedding class. | / |
api_keys | dict | The API keys to use for the model provider that are written to environment variables.Pass only the keys required by the model provider or conveniently pass all keys you will ever need. Pay attention how to name the dictionary keys so that they can be used by the model provider. | / |
auth_kwargs | dict | Additional keyword arguments required for api initialization/authentication. | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"embed_text",
arguments={
# Add arguments
# "model_provider": "huggingface",
# "model": ,
# "api_keys": {},
# "auth_kwargs": {},
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Filter language
Description#
A component that filters text based on the provided language.
Inputs / outputs#
Consumes#
This component consumes:
- text: string
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
language | str | A valid language code or identifier (e.g., "en", "fr", "de"). | en |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"filter_language",
arguments={
# Add arguments
# "language": "en",
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Filter text length
Description#
A component that filters out text based on their length
Inputs / outputs#
Consumes#
This component consumes:
- text: string
Produces#
This component does not produce data.
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
min_characters_length | int | Minimum number of characters | / |
min_words_length | int | Mininum number of words | / |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"filter_text_length",
arguments={
# Add arguments
# "min_characters_length": 0,
# "min_words_length": 0,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run:
Generate minhash
Description#
A component that generates minhashes of text.
Inputs / outputs#
Consumes#
This component consumes:
- text: string
Produces#
This component produces:
- minhash: list
Arguments#
The component takes the following arguments to alter its behavior:
argument | type | description | default |
---|---|---|---|
shingle_ngram_size | int | Define size of ngram used for the shingle generation | 3 |
Usage#
You can apply this component to your dataset using the following code:
from fondant.dataset import Dataset
dataset = Dataset.read(...)
dataset = dataset.apply(
"generate_minhash",
arguments={
# Add arguments
# "shingle_ngram_size": 3,
},
)
Testing#
You can run the tests using docker with BuildKit. From this directory, run: