JonnoB/scrambledtext

A python library for creating synthetic corrupted OCR text using a markov process

31
/ 100
Emerging

This tool helps researchers and developers working with OCR to generate realistic synthetic corrupted text. It takes in pairs of original and OCR-scanned texts and learns common error patterns like character substitutions or deletions. You then feed it clean text, and it outputs new text that looks like it's been processed by a faulty OCR system, complete with adjustable error rates.

No commits in the last 6 months.

Use this if you need to create large datasets of text with simulated OCR errors to train language models or test OCR correction algorithms.

Not ideal if you need to fix existing OCR errors in documents; this tool only generates synthetic errors, it doesn't correct them.

OCR-simulation natural-language-processing-development text-data-augmentation document-analysis
Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 8 / 25

How are scores calculated?

Stars

9

Forks

1

Language

Python

License

MIT

Last pushed

Apr 30, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/JonnoB/scrambledtext"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.