huggingface/OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

/ 100

Emerging

This project helps researchers and developers create their own large-scale, high-quality datasets of web documents containing both images and text. It takes raw web archives (WARC files) as input and processes them to extract, clean, and deduplicate interleaved image-text content. This is for data scientists, machine learning engineers, and AI researchers who need vast amounts of structured web data for training models.

211 stars. No commits in the last 6 months.

Use this if you need to build a custom, massive dataset of web pages with finely extracted image-text pairings for training advanced multimodal AI models.

Not ideal if you are looking for a pre-packaged dataset for immediate use, as this provides the tools to build one from scratch.

AI research web data extraction multimodal AI large-scale datasets dataset creation

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

211

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M...

devrimcavusoglu/pybboxes

Light weight toolkit for bounding boxes providing conversion between bounding box types and...

PyRetri/PyRetri

Open source deep learning based unsupervised image retrieval toolbox built on PyTorch🔥

Particle1904/DatasetHelpers

Dataset Helper program to automatically select, re scale and tag Datasets (composed of image and...

salesforce/LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

Explore ML Frameworks

All categories Trending ML Framework directory Insights