3DCF-Labs/doc2dataset

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

40
/ 100
Emerging

This project helps AI platform teams and data analysts convert various document types like PDFs, Markdown, HTML, or CSVs into structured datasets for large language models (LLMs). It processes your raw documents and outputs ready-to-use datasets for tasks like Q&A, summarization, or RAG, ensuring numeric accuracy. This tool is for anyone building or fine-tuning LLMs who needs reliable, pre-processed document data.

Use this if you need to reliably transform a wide range of documents into clean, structured datasets for training or operating large language models, especially if numeric integrity is critical.

Not ideal if you only need a simple PDF text extractor or a basic document viewer; this tool is designed for advanced LLM data preparation.

LLM data preparation document processing numeric integrity AI platform engineering fintech data analysis
No Package No Dependents
Maintenance 10 / 25
Adoption 8 / 25
Maturity 13 / 25
Community 9 / 25

How are scores calculated?

Stars

56

Forks

5

Language

Rust

License

Apache-2.0

Last pushed

Feb 10, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/3DCF-Labs/doc2dataset"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.