CambioML/uniflow

LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!

/ 100

Established

This project helps data scientists quickly turn messy, unstructured documents like PDFs, Word files, and HTMLs into clean, usable text datasets for training large language models (LLMs). It takes your raw documents and, using various LLMs, extracts relevant information, transforms it into a structured format like question-answer pairs, and even helps create datasets for advanced LLM training techniques. Data scientists use this to efficiently prepare high-quality, privacy-preserved datasets, speeding up their LLM development.

234 stars. No commits in the last 6 months.

Use this if you are a data scientist struggling to extract and transform information from diverse document types into structured datasets suitable for LLM fine-tuning and training.

Not ideal if you primarily need to perform basic keyword searches or simple text extraction without the need for advanced LLM-based transformation or dataset generation.

LLM-training data-preparation document-processing NLP-engineering financial-data-extraction

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 22 / 25

How are scores calculated?

Stars

234

Forks

Language

Python

License

Apache-2.0

Related tools

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...

th1nhhdk/local_ai_ocr

An local, offline (after initial setup), portable OCR software that can process images and PDF...

Dicklesworthstone/llm_aided_ocr

Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking,...

emcf/thepipe

Get clean data from tricky documents, powered by vision-language models ⚡

langstruct-ai/langstruct

Extract structured data from any content using LLMs.

Explore LLM Tools

All categories Trending LLM Tool directory Insights