YJiangcm/WebR
[ACL 2025] Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
This project helps AI developers create high-quality instruction-following datasets for training large language models (LLMs). By taking raw web documents as input, it automatically generates structured instruction-response pairs. AI engineers and researchers working on improving LLM performance for specific tasks would use this.
No commits in the last 6 months.
Use this if you need to synthesize large, diverse instruction-tuning datasets from unstructured web content to enhance your LLM's ability to follow complex instructions.
Not ideal if you already have curated instruction-response pairs or are looking for a pre-trained LLM without needing to generate new training data.
Stars
11
Forks
3
Language
Python
License
—
Category
Last pushed
May 15, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/YJiangcm/WebR"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
n-waves/multifit
The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model...
princeton-nlp/SimCSE
[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
yxuansu/SimCTG
[NeurIPS'22 Spotlight] A Contrastive Framework for Neural Text Generation
alibaba-edu/simple-effective-text-matching
Source code of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".
Shark-NLP/OpenICL
OpenICL is an open-source framework to facilitate research, development, and prototyping of...