clovaai/webvicob
Official Implementation of Web-based Visual Corpus Builder (Webvicob), ICDAR 2023
This tool helps researchers and data scientists quickly build large collections of images with text annotations. It takes raw Wikipedia HTML dumps and generates a 'visual corpus' – images of web pages along with precise text labels. This corpus is specifically designed to train and evaluate models for understanding documents visually, making it ideal for those working in computer vision or natural language processing.
109 stars. No commits in the last 6 months.
Use this if you need to create a large-scale, visually rich dataset of documents for training AI models, especially for tasks involving both text and layout understanding.
Not ideal if you need a dataset of non-document images, or if your primary need is for text-only data without visual context.
Stars
109
Forks
8
Language
Python
License
Apache-2.0
Category
Last pushed
Oct 24, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/clovaai/webvicob"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
deepdoctection/deepdoctection
A Repo For Document AI
deanmalmgren/textract
extract text from any document. no muss. no fuss.
eikek/docspell
Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources...
zzzDavid/ICDAR-2019-SROIE
ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction
clovaai/donut
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic...