Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks

Repository for code underlying the paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'

/ 100

Emerging

This project helps digital humanities researchers and computational linguists understand how the quality of Optical Character Recognition (OCR) affects text analysis results. It takes OCR'd historical documents, specifically 19th-century newspapers, and compares them to human-corrected versions to assess the impact on tasks like part-of-speech tagging, named entity recognition, and topic modeling. The output provides insights into the reliability of NLP tasks when working with historical, imperfectly digitized texts.

No commits in the last 6 months.

Use this if you are working with historical documents converted from images to text via OCR and need to understand how OCR errors might be skewing your text analysis results.

Not ideal if you are looking for a general-purpose OCR tool or a solution for improving OCR quality itself, as this project focuses on evaluating its impact.

digital-humanities computational-linguistics historical-research text-analysis archive-digitization

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

CC-BY-4.0

Higher-rated alternatives

google/langfun

OO for LLMs

tanaos/artifex

Small Language Model Inference, Fine-Tuning and Observability. No GPU, no labeled data needed.

preligens-lab/textnoisr

Adding random noise to a text dataset, and controlling very accurately the quality of the result

vulnerability-lookup/VulnTrain

A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.

masakhane-io/masakhane-mt

Machine Translation for Africa

Explore NLP Tools

All categories Trending NLP directory Insights