ssbuild/aigc_data
share data, prompt data , pretraining data
This project provides a collection of large-scale text datasets for training large language models. It offers raw textual data, some requiring cleaning, that can be used as input for developing AI models capable of generating human-like text. It is designed for researchers, AI engineers, and data scientists working on advanced natural language processing tasks.
No commits in the last 6 months.
Use this if you are building or fine-tuning large language models and need access to extensive pre-training datasets in English and Chinese.
Not ideal if you need ready-to-use, perfectly clean, or domain-specific datasets for simpler machine learning tasks.
Stars
36
Forks
6
Language
Python
License
Apache-2.0
Category
Last pushed
Nov 30, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/ssbuild/aigc_data"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
yarin-zhang/AI-Gist
✨ AI Gist 是一款隐私优先的 AI 提示词管理工具,致力于让个人收藏的 AI 提示词能够发挥最大价值。支持变量替换、Jinja 模板、AI 生成与调优、历史版本记录、云端备份等核心功能。
uniai-lab/uniai
AI models all in one!
win4r/AISuperDomain
Aila(AI超元域): The premier AI integration tool for Windows, macOS, and Android. Ask once, get...
NitroRCr/AIaW
AI as Workspace - An elegant AI chat client. Full-featured, lightweight. Support multiple...
Jun-Murakami/AI-Browser
Client app for ChatGPT, Gemini, Claude, Kimi, DeepSeek, Grok, Nani !?, Felo, Cody, JENOVA,...