3DCF-Labs/doc2dataset

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

/ 100

Emerging

This project helps AI platform teams and data analysts convert various document types like PDFs, Markdown, HTML, or CSVs into structured datasets for large language models (LLMs). It processes your raw documents and outputs ready-to-use datasets for tasks like Q&A, summarization, or RAG, ensuring numeric accuracy. This tool is for anyone building or fine-tuning LLMs who needs reliable, pre-processed document data.

Use this if you need to reliably transform a wide range of documents into clean, structured datasets for training or operating large language models, especially if numeric integrity is critical.

Not ideal if you only need a simple PDF text extractor or a basic document viewer; this tool is designed for advanced LLM data preparation.

LLM data preparation document processing numeric integrity AI platform engineering fintech data analysis

No Package No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 13 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

Rust

License

Apache-2.0

Higher-rated alternatives

thiswillbeyourgithub/wdoc

Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype,...

Arterning/DeepParseX

DeepParseX 是一个强大的多模态文档解析与知识管理平台，支持 PDF、Word、Excel、PPT、图片、视频、音频等多种文件格式的智能解析，自动提取关键信息，并构建...

NoEdgeAI/pdfdeal

A python wrapper for the Doc2X API and comes with native texts processing (to improve PDF recall...

laxmimerit/RAGWire

Production-grade RAG toolkit — ingest PDFs, DOCX, XLSX into Qdrant with LLM metadata extraction,...

David-Lolly/ViewRAG

图文并茂的 PDF RAG 系统：支持版式感知分块、图表深度理解与精准视觉溯源。 Multimodal PDF RAG: Features layout-aware chunking,...

Explore RAG Tools

All categories Trending RAG directory Insights