thubZ09/vision-language-model-research
Hub for researchers exploring VLMs and Multimodal Learning:)
This hub is a resource for AI researchers and practitioners who work with multimodal data, specifically focusing on how computers understand and generate content using both images and text. It provides a curated collection of cutting-edge models, datasets, and benchmarks. Researchers can use this to quickly find information and tools to build or evaluate systems that process visual and textual inputs to produce relevant outputs.
Use this if you are an AI researcher or machine learning engineer exploring or developing systems that analyze and connect information from both images and text, such as for visual question answering or image captioning.
Not ideal if you are a non-technical end-user looking for a ready-to-use application or a high-level overview of AI without technical details.
Stars
62
Forks
5
Language
—
License
MIT
Category
Last pushed
Feb 25, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/thubZ09/vision-language-model-research"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
open-mmlab/mmpretrain
OpenMMLab Pre-training Toolbox and Benchmark
facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis
Papers, code and datasets about deep learning and multi-modal learning for video analysis
KaiyangZhou/pytorch-vsumm-reinforce
Unsupervised video summarization with deep reinforcement learning (AAAI'18)
adambielski/siamese-triplet
Siamese and triplet networks with online pair/triplet mining in PyTorch