TheoCoombes/crawlingathome

A client library for LAION's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.

39
/ 100
Emerging

This client library helps individuals or organizations contribute to the creation of large-scale image-text datasets, like those used for training AI models. You provide computing resources (CPU or GPU) and receive raw web data to process. Your output is filtered image-text pairs and progress updates, which contribute to a public dataset. This tool is for researchers, data scientists, or citizen scientists interested in curating vast web data for AI.

No commits in the last 6 months.

Use this if you want to contribute computational power to help build a massive, high-quality image-text dataset by processing web archives.

Not ideal if you need a pre-packaged, off-the-shelf dataset or are looking for a tool to perform web crawling for your own private projects.

data-curation machine-learning-datasets distributed-computing AI-training-data web-data-processing
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 7 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

32

Forks

7

Language

Python

License

MIT

Last pushed

Mar 21, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/TheoCoombes/crawlingathome"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.