Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks

Repository for code underlying the paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'

34
/ 100
Emerging

This project helps digital humanities researchers and computational linguists understand how the quality of Optical Character Recognition (OCR) affects text analysis results. It takes OCR'd historical documents, specifically 19th-century newspapers, and compares them to human-corrected versions to assess the impact on tasks like part-of-speech tagging, named entity recognition, and topic modeling. The output provides insights into the reliability of NLP tasks when working with historical, imperfectly digitized texts.

No commits in the last 6 months.

Use this if you are working with historical documents converted from images to text via OCR and need to understand how OCR errors might be skewing your text analysis results.

Not ideal if you are looking for a general-purpose OCR tool or a solution for improving OCR quality itself, as this project focuses on evaluating its impact.

digital-humanities computational-linguistics historical-research text-analysis archive-digitization
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

9

Forks

2

Language

Jupyter Notebook

License

CC-BY-4.0

Last pushed

Oct 16, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.