doc-analysis/ReadingBank

ReadingBank: A Benchmark Dataset for Reading Order Detection

/ 100

Experimental

This dataset helps researchers and developers working on document understanding to accurately determine the natural reading order of text on a page. It provides 500,000 document images along with their correct word sequences and coordinates, extracted from Microsoft Word documents. Researchers focused on improving automated document processing and information extraction would use this.

117 stars. No commits in the last 6 months.

Use this if you are developing or evaluating machine learning models that need to accurately extract text in the correct human reading order from visually complex documents.

Not ideal if you are looking for a tool to process your own documents directly, as this is a dataset for model training and research, not an end-user application.

document-processing information-extraction natural-language-processing computer-vision optical-character-recognition

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 8 / 25

Community 7 / 25

How are scores calculated?

Stars

117

Forks

Language

—

License

—

Higher-rated alternatives

google/langfun

OO for LLMs

tanaos/artifex

Small Language Model Inference, Fine-Tuning and Observability. No GPU, no labeled data needed.

preligens-lab/textnoisr

Adding random noise to a text dataset, and controlling very accurately the quality of the result

vulnerability-lookup/VulnTrain

A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.

masakhane-io/masakhane-mt

Machine Translation for Africa

Explore NLP Tools

All categories Trending NLP directory Insights