Skip to content

Blog#

Building a Datacomp CLIP index with Fondant

Large (image) datasets are often unwieldy to use due to their sheer size. Assume for instance that we would like to extract all the cat images from such a dataset. We would have to look at every image to classify if it's a cat image or not. And if we want to extract all the dog images next, we again need to look at every image.

Instead, we can look at every image once, and calculate a (CLIP) embedding representing its content. Combining these embeddings into an index, we can efficiently search through the dataset with a query, finding specific images, without having to look at each one.

CLIP index

This is what LAION did for their LAION-5b dataset, which made it possible to use, like we did in our ControlNet example. Unfortunately, the LAION-5b dataset and index have been taken offline (temporarily) and there aren't any alternatives. This is why we built an index for the Datacomp-12M dataset. While it is a lot smaller than LAION-5b, it should already enable a lot of use cases again, and can hopefully be the start towards building indices for more and larger datasets.

Let's tune RAG pipelines with Fondant

Retrieval Augmented Generation (RAG) has quickly become the go-to architecture for providing large language models (LLM) with specific knowledge. Optimizing a custom setup requires days to find the right set of parameters and system configuration.

We have created an example use case to show how you can enhance your RAG setup by using Fondant. Checkout out the resources:

Fondant 0.8: Simplification, Sagemaker, RAG, and more!

Hi all, we released Fondant 0.8, which brings some major new features and improvements:

  • πŸ“ We simplified and improved the way datasets are stored and accessed
  • πŸš€ The interface to compose a Fondant pipeline is now simpler and more powerful
  • 🌐 AWS SageMaker is now supported as an execution framework for Fondant pipelines
  • πŸ” The Fondant explorer was improved, especially for text and document data
  • πŸ“š We released a RAG tuning repository powered by Fondant

Read on for more details!

Fondant 0.6 brings Vertex AI support and more

Hi all, we released Fondant 0.6, which brings some major new features and improvements:

πŸŒ€ Vertex AI is now supported as a backend for pipeline execution.

Simply run fondant run vertex to submit your pipeline. Run fondant run vertex --help to see the possible configuration options.

25 million Creative Commons image dataset released

Fondant is an open-source project that aims to simplify and speed up large-scale data processing by making containerized components reusable across pipelines & execution environments, shared within the community.

A current challenge for generative AI is compliance with copyright laws. For this reason, Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative Commons images to train a latent diffusion image generation model that respects copyright. Today, as a first step, we are releasing a 25-million sample dataset and invite the open source community to collaborate on further refinement steps.