3DCF-Labs/doc2dataset
3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.
This project helps AI platform teams and data analysts convert various document types like PDFs, Markdown, HTML, or CSVs into structured datasets for large language models (LLMs). It processes your raw documents and outputs ready-to-use datasets for tasks like Q&A, summarization, or RAG, ensuring numeric accuracy. This tool is for anyone building or fine-tuning LLMs who needs reliable, pre-processed document data.
Use this if you need to reliably transform a wide range of documents into clean, structured datasets for training or operating large language models, especially if numeric integrity is critical.
Not ideal if you only need a simple PDF text extractor or a basic document viewer; this tool is designed for advanced LLM data preparation.
Stars
56
Forks
5
Language
Rust
License
Apache-2.0
Category
Last pushed
Feb 10, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/3DCF-Labs/doc2dataset"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
thiswillbeyourgithub/wdoc
Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype,...
Arterning/DeepParseX
DeepParseX 是一个强大的多模态文档解析与知识管理平台,支持 PDF、Word、Excel、PPT、图片、视频、音频 等多种文件格式的智能解析,自动提取关键信息,并构建...
NoEdgeAI/pdfdeal
A python wrapper for the Doc2X API and comes with native texts processing (to improve PDF recall...
laxmimerit/RAGWire
Production-grade RAG toolkit — ingest PDFs, DOCX, XLSX into Qdrant with LLM metadata extraction,...
David-Lolly/ViewRAG
图文并茂的 PDF RAG 系统:支持版式感知分块、图表深度理解与精准视觉溯源。 Multimodal PDF RAG: Features layout-aware chunking,...