Building a Datacomp CLIP index with Fondant
Large (image) datasets are often unwieldy to use due to their sheer size. Assume for instance that we would like to extract all the cat images from such a dataset. We would have to look at every image to classify if it's a cat image or not. And if we want to extract all the dog images next, we again need to look at every image.
Instead, we can look at every image once, and calculate a (CLIP) embedding representing its content. Combining these embeddings into an index, we can efficiently search through the dataset with a query, finding specific images, without having to look at each one.
This is what LAION did for their LAION-5b dataset, which made it possible to use, like we did in our ControlNet example. Unfortunately, the LAION-5b dataset and index have been taken offline (temporarily) and there aren't any alternatives. This is why we built an index for the Datacomp-12M dataset. While it is a lot smaller than LAION-5b, it should already enable a lot of use cases again, and can hopefully be the start towards building indices for more and larger datasets.