YJiangcm/WebR

[ACL 2025] Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

29
/ 100
Experimental

This project helps AI developers create high-quality instruction-following datasets for training large language models (LLMs). By taking raw web documents as input, it automatically generates structured instruction-response pairs. AI engineers and researchers working on improving LLM performance for specific tasks would use this.

No commits in the last 6 months.

Use this if you need to synthesize large, diverse instruction-tuning datasets from unstructured web content to enhance your LLM's ability to follow complex instructions.

Not ideal if you already have curated instruction-response pairs or are looking for a pre-trained LLM without needing to generate new training data.

LLM training data synthesis natural language processing AI model development
No License Stale 6m No Package No Dependents
Maintenance 2 / 25
Adoption 5 / 25
Maturity 8 / 25
Community 14 / 25

How are scores calculated?

Stars

11

Forks

3

Language

Python

License

Last pushed

May 15, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/YJiangcm/WebR"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.