microsoft/genalog

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

/ 100

Established

This tool helps machine learning engineers and data scientists create realistic synthetic document images from plain text and HTML templates. It takes your text and layout designs, then applies various visual degradations to mimic scanned documents with noise, blur, and other imperfections. The output is a dataset of diverse document images that can be used for training and evaluating optical character recognition (OCR) models.

346 stars. No commits in the last 6 months. Available on PyPI.

Use this if you need to generate large, varied datasets of document images with controlled noise for training and testing OCR systems or document processing pipelines.

Not ideal if you're looking for an off-the-shelf OCR solution or simply want to extract text from existing images without needing to create synthetic data.

document-processing OCR-training synthetic-data-generation computer-vision ML-data-preparation

Stale 6m

Maintenance 0 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 16 / 25

How are scores calculated?

Stars

346

Forks

Language

Jupyter Notebook

License

MIT

Related tools

sdv-dev/SDV

Synthetic data generation for tabular data

sdv-dev/SDGym

Benchmarking synthetic data generation methods.

NVIDIA-NeMo/DataDesigner

🎨 NeMo Data Designer: A general library for generating high-quality synthetic data from scratch...

AlexanderVNikitin/tsgm

Generation and evaluation of synthetic time series datasets (also, augmentations,...

mostly-ai/mostlyai

Synthetic Data SDK ✨

Explore Generative AI Tools

All categories Trending Generative AI directory Insights