clovaai/webvicob

Official Implementation of Web-based Visual Corpus Builder (Webvicob), ICDAR 2023

/ 100

Emerging

This tool helps researchers and data scientists quickly build large collections of images with text annotations. It takes raw Wikipedia HTML dumps and generates a 'visual corpus' – images of web pages along with precise text labels. This corpus is specifically designed to train and evaluate models for understanding documents visually, making it ideal for those working in computer vision or natural language processing.

109 stars. No commits in the last 6 months.

Use this if you need to create a large-scale, visually rich dataset of documents for training AI models, especially for tasks involving both text and layout understanding.

Not ideal if you need a dataset of non-document images, or if your primary need is for text-only data without visual context.

document-intelligence computer-vision dataset-generation AI-training-data information-extraction

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 11 / 25

How are scores calculated?

Stars

109

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

deepdoctection/deepdoctection

A Repo For Document AI

deanmalmgren/textract

extract text from any document. no muss. no fuss.

eikek/docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources...

zzzDavid/ICDAR-2019-SROIE

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

clovaai/donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic...

Explore NLP Tools

All categories Trending NLP directory Insights