Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks
Repository for code underlying the paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'
This project helps digital humanities researchers and computational linguists understand how the quality of Optical Character Recognition (OCR) affects text analysis results. It takes OCR'd historical documents, specifically 19th-century newspapers, and compares them to human-corrected versions to assess the impact on tasks like part-of-speech tagging, named entity recognition, and topic modeling. The output provides insights into the reliability of NLP tasks when working with historical, imperfectly digitized texts.
No commits in the last 6 months.
Use this if you are working with historical documents converted from images to text via OCR and need to understand how OCR errors might be skewing your text analysis results.
Not ideal if you are looking for a general-purpose OCR tool or a solution for improving OCR quality itself, as this project focuses on evaluating its impact.
Stars
9
Forks
2
Language
Jupyter Notebook
License
CC-BY-4.0
Category
Last pushed
Oct 16, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/langfun
OO for LLMs
tanaos/artifex
Small Language Model Inference, Fine-Tuning and Observability. No GPU, no labeled data needed.
preligens-lab/textnoisr
Adding random noise to a text dataset, and controlling very accurately the quality of the result
vulnerability-lookup/VulnTrain
A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.
masakhane-io/masakhane-mt
Machine Translation for Africa