JonnoB/scrambledtext
A python library for creating synthetic corrupted OCR text using a markov process
This tool helps researchers and developers working with OCR to generate realistic synthetic corrupted text. It takes in pairs of original and OCR-scanned texts and learns common error patterns like character substitutions or deletions. You then feed it clean text, and it outputs new text that looks like it's been processed by a faulty OCR system, complete with adjustable error rates.
No commits in the last 6 months.
Use this if you need to create large datasets of text with simulated OCR errors to train language models or test OCR correction algorithms.
Not ideal if you need to fix existing OCR errors in documents; this tool only generates synthetic errors, it doesn't correct them.
Stars
9
Forks
1
Language
Python
License
MIT
Category
Last pushed
Apr 30, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/JonnoB/scrambledtext"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
google/langfun
OO for LLMs
tanaos/artifex
Small Language Model Inference, Fine-Tuning and Observability. No GPU, no labeled data needed.
preligens-lab/textnoisr
Adding random noise to a text dataset, and controlling very accurately the quality of the result
vulnerability-lookup/VulnTrain
A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.
masakhane-io/masakhane-mt
Machine Translation for Africa