doc-analysis/ReadingBank
ReadingBank: A Benchmark Dataset for Reading Order Detection
This dataset helps researchers and developers working on document understanding to accurately determine the natural reading order of text on a page. It provides 500,000 document images along with their correct word sequences and coordinates, extracted from Microsoft Word documents. Researchers focused on improving automated document processing and information extraction would use this.
117 stars. No commits in the last 6 months.
Use this if you are developing or evaluating machine learning models that need to accurately extract text in the correct human reading order from visually complex documents.
Not ideal if you are looking for a tool to process your own documents directly, as this is a dataset for model training and research, not an end-user application.
Stars
117
Forks
4
Language
—
License
—
Category
Last pushed
Aug 26, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/doc-analysis/ReadingBank"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/langfun
OO for LLMs
tanaos/artifex
Small Language Model Inference, Fine-Tuning and Observability. No GPU, no labeled data needed.
preligens-lab/textnoisr
Adding random noise to a text dataset, and controlling very accurately the quality of the result
vulnerability-lookup/VulnTrain
A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.
masakhane-io/masakhane-mt
Machine Translation for Africa