YJiangcm/WebR

[ACL 2025] Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

/ 100

Experimental

This project helps AI developers create high-quality instruction-following datasets for training large language models (LLMs). By taking raw web documents as input, it automatically generates structured instruction-response pairs. AI engineers and researchers working on improving LLM performance for specific tasks would use this.

No commits in the last 6 months.

Use this if you need to synthesize large, diverse instruction-tuning datasets from unstructured web content to enhance your LLM's ability to follow complex instructions.

Not ideal if you already have curated instruction-response pairs or are looking for a pre-trained LLM without needing to generate new training data.

LLM training data synthesis natural language processing AI model development

No License Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 8 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

n-waves/multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model...

princeton-nlp/SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

yxuansu/SimCTG

[NeurIPS'22 Spotlight] A Contrastive Framework for Neural Text Generation

alibaba-edu/simple-effective-text-matching

Source code of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Shark-NLP/OpenICL

OpenICL is an open-source framework to facilitate research, development, and prototyping of...

Explore NLP Tools

All categories Trending NLP directory Insights