huggingface/OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

36
/ 100
Emerging

This project helps researchers and developers create their own large-scale, high-quality datasets of web documents containing both images and text. It takes raw web archives (WARC files) as input and processes them to extract, clean, and deduplicate interleaved image-text content. This is for data scientists, machine learning engineers, and AI researchers who need vast amounts of structured web data for training models.

211 stars. No commits in the last 6 months.

Use this if you need to build a custom, massive dataset of web pages with finely extracted image-text pairings for training advanced multimodal AI models.

Not ideal if you are looking for a pre-packaged dataset for immediate use, as this provides the tools to build one from scratch.

AI research web data extraction multimodal AI large-scale datasets dataset creation
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 10 / 25

How are scores calculated?

Stars

211

Forks

11

Language

Python

License

Apache-2.0

Last pushed

Aug 28, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/huggingface/OBELICS"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.