microsoft/genalog
Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.
This tool helps machine learning engineers and data scientists create realistic synthetic document images from plain text and HTML templates. It takes your text and layout designs, then applies various visual degradations to mimic scanned documents with noise, blur, and other imperfections. The output is a dataset of diverse document images that can be used for training and evaluating optical character recognition (OCR) models.
346 stars. No commits in the last 6 months. Available on PyPI.
Use this if you need to generate large, varied datasets of document images with controlled noise for training and testing OCR systems or document processing pipelines.
Not ideal if you're looking for an off-the-shelf OCR solution or simply want to extract text from existing images without needing to create synthetic data.
Stars
346
Forks
35
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Jan 18, 2024
Commits (30d)
0
Dependencies
16
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/generative-ai/microsoft/genalog"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
sdv-dev/SDV
Synthetic data generation for tabular data
sdv-dev/SDGym
Benchmarking synthetic data generation methods.
NVIDIA-NeMo/DataDesigner
🎨 NeMo Data Designer: A general library for generating high-quality synthetic data from scratch...
AlexanderVNikitin/tsgm
Generation and evaluation of synthetic time series datasets (also, augmentations,...
mostly-ai/mostlyai
Synthetic Data SDK ✨