25 million Creative Commons image dataset released#

Fondant is an open-source project that aims to enable compliant, large-scale processing in a simple and cost-efficient way. As a first step, we have developed a pipeline to create a Creative Commons image dataset and are releasing a first 25 million sample with a call to action to help develop additional data processing pipelines.

Fondant simplifies and speeds up large-scale data processing by making self-contained pipeline components reusable across pipelines, infrastructures and shareable within the community. By offering a library of ready-to-use, off-the-shelf components and a standardized way of building and combining them with custom components, it significantly reduces the time required to build and maintain data processing infrastructure for generative AI applications in production.

Supported by Flanders innovation & entrepreneurship and European AI Service Provider ML6, Fondant developed a pipeline to create a dataset of over 500 million Creative Commons-licensed images from Common Crawl to train an image-generation model that respects copyright. Now we are releasing a first 25 million sample dataset with tools to download, explore and process the data. We are inviting developers and data enthusiasts to collaborate on large-scale data processing pipelines by building custom components for advanced filtering and captioning and to contribute to the core framework. We are also looking for feedback on the framework’s usability with suggestions for improvement. Contact us at info@fondant.ai and/or join our discord to help realize this vision.

Creative Commons is a non-profit organization which provides licenses that allow other creators to reuse one’s work under certain conditions. Common Crawl is a non-profit organization which publishes monthly archives of the public Internet.