TheoCoombes/crawlingathome

A client library for LAION's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.

/ 100

Emerging

This client library helps individuals or organizations contribute to the creation of large-scale image-text datasets, like those used for training AI models. You provide computing resources (CPU or GPU) and receive raw web data to process. Your output is filtered image-text pairs and progress updates, which contribute to a public dataset. This tool is for researchers, data scientists, or citizen scientists interested in curating vast web data for AI.

No commits in the last 6 months.

Use this if you want to contribute computational power to help build a massive, high-quality image-text dataset by processing web archives.

Not ideal if you need a pre-packaged, off-the-shelf dataset or are looking for a tool to perform web crawling for your own private projects.

data-curation machine-learning-datasets distributed-computing AI-training-data web-data-processing

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M...

devrimcavusoglu/pybboxes

Light weight toolkit for bounding boxes providing conversion between bounding box types and...

PyRetri/PyRetri

Open source deep learning based unsupervised image retrieval toolbox built on PyTorch🔥

Particle1904/DatasetHelpers

Dataset Helper program to automatically select, re scale and tag Datasets (composed of image and...

salesforce/LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

Explore ML Frameworks

All categories Trending ML Framework directory Insights