clovaai/webvicob

Official Implementation of Web-based Visual Corpus Builder (Webvicob), ICDAR 2023

36
/ 100
Emerging

This tool helps researchers and data scientists quickly build large collections of images with text annotations. It takes raw Wikipedia HTML dumps and generates a 'visual corpus' – images of web pages along with precise text labels. This corpus is specifically designed to train and evaluate models for understanding documents visually, making it ideal for those working in computer vision or natural language processing.

109 stars. No commits in the last 6 months.

Use this if you need to create a large-scale, visually rich dataset of documents for training AI models, especially for tasks involving both text and layout understanding.

Not ideal if you need a dataset of non-document images, or if your primary need is for text-only data without visual context.

document-intelligence computer-vision dataset-generation AI-training-data information-extraction
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 16 / 25
Community 11 / 25

How are scores calculated?

Stars

109

Forks

8

Language

Python

License

Apache-2.0

Last pushed

Oct 24, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/clovaai/webvicob"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.