TheoCoombes/crawlingathome
A client library for LAION's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.
This client library helps individuals or organizations contribute to the creation of large-scale image-text datasets, like those used for training AI models. You provide computing resources (CPU or GPU) and receive raw web data to process. Your output is filtered image-text pairs and progress updates, which contribute to a public dataset. This tool is for researchers, data scientists, or citizen scientists interested in curating vast web data for AI.
No commits in the last 6 months.
Use this if you want to contribute computational power to help build a massive, high-quality image-text dataset by processing web archives.
Not ideal if you need a pre-packaged, off-the-shelf dataset or are looking for a tool to perform web crawling for your own private projects.
Stars
32
Forks
7
Language
Python
License
MIT
Category
Last pushed
Mar 21, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/TheoCoombes/crawlingathome"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M...
devrimcavusoglu/pybboxes
Light weight toolkit for bounding boxes providing conversion between bounding box types and...
PyRetri/PyRetri
Open source deep learning based unsupervised image retrieval toolbox built on PyTorch🔥
Particle1904/DatasetHelpers
Dataset Helper program to automatically select, re scale and tag Datasets (composed of image and...
salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence