TIGER-AI-Lab/VisualWebInstruct
The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search" [EMNLP25]
This project helps AI researchers and developers expand their multimodal reasoning datasets. It uses Google Image Search to find websites with similar images, then extracts and synthesizes question-answer pairs from over 700,000 unique web sources. The output is a large, high-quality dataset of nearly 900,000 visual and text QA pairs, which can be used to improve the reasoning abilities of Vision-Language Models.
Use this if you are an AI researcher or machine learning engineer looking to enhance your Vision-Language Models' reasoning capabilities by providing them with a diverse, large-scale multimodal dataset.
Not ideal if you are looking for a ready-to-use application or a model for direct inference, as this project focuses on dataset generation and model training.
Stars
38
Forks
1
Language
Python
License
MIT
Category
Last pushed
Feb 01, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/TIGER-AI-Lab/VisualWebInstruct"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
DaoD/INTERS
This is the repository for our paper "INTERS: Unlocking the Power of Large Language Models in...
declare-lab/instruct-eval
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca...
Haiyang-W/TokenFormer
[ICLR2025 Spotlightš„] Official Implementation of TokenFormer: Rethinking Transformer Scaling...
hkust-nlp/deita
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
kehanlu/DeSTA2
Code and model for ICASSP 2025 Paper "Developing Instruction-Following Speech Language Model...