huggingface/OBELICS
Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
This project helps researchers and developers create their own large-scale, high-quality datasets of web documents containing both images and text. It takes raw web archives (WARC files) as input and processes them to extract, clean, and deduplicate interleaved image-text content. This is for data scientists, machine learning engineers, and AI researchers who need vast amounts of structured web data for training models.
211 stars. No commits in the last 6 months.
Use this if you need to build a custom, massive dataset of web pages with finely extracted image-text pairings for training advanced multimodal AI models.
Not ideal if you are looking for a pre-packaged dataset for immediate use, as this provides the tools to build one from scratch.
Stars
211
Forks
11
Language
Python
License
Apache-2.0
Category
Last pushed
Aug 28, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/huggingface/OBELICS"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M...
devrimcavusoglu/pybboxes
Light weight toolkit for bounding boxes providing conversion between bounding box types and...
PyRetri/PyRetri
Open source deep learning based unsupervised image retrieval toolbox built on PyTorch🔥
Particle1904/DatasetHelpers
Dataset Helper program to automatically select, re scale and tag Datasets (composed of image and...
salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence