Vision Language Models ML Frameworks

Frameworks and implementations for multimodal models that combine vision and language capabilities, including vision-language transformers, image-text generation, and visual question answering systems. Does NOT include single-modality models, general computer vision frameworks, or task-specific applications like document OCR or license plate recognition.

There are 111 vision language models frameworks tracked. 7 score above 50 (established tier). The highest-rated is open-mmlab/mmpretrain at 60/100 with 3,837 stars. 1 of the top 10 are actively maintained.

Get all 111 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Framework	Score	Tier	Stars	Language
1	open-mmlab/mmpretrain OpenMMLab Pre-training Toolbox and Benchmark	60	Established	3,837	Python
2	facebookresearch/mmf A modular framework for vision & language multimodal research from Facebook...	58	Established	5,622	Python
3	HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis Papers, code and datasets about deep learning and multi-modal learning for...	51	Established	836	—
4	KaiyangZhou/pytorch-vsumm-reinforce Unsupervised video summarization with deep reinforcement learning (AAAI'18)	51	Established	503	Python
5	adambielski/siamese-triplet Siamese and triplet networks with online pair/triplet mining in PyTorch	51	Established	3,171	Python
6	kuanghuei/SCAN PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018)	50	Established	579	Python
7	friedrichor/Awesome-Multimodal-Papers A curated list of awesome Multimodal studies.	50	Established	317	—
8	batra-mlp-lab/visdial [CVPR 2017] Torch code for Visual Dialog	49	Emerging	230	Lua
9	pliang279/awesome-multimodal-ml Reading list for research topics in multimodal machine learning	48	Emerging	6,835	—
10	kezhang-cs/Video-Summarization-with-LSTM Implementation of our ECCV 2016 Paper (Video Summarization with Long...	48	Emerging	196	Matlab
11	vbalnt/tfeat TFeat descriptor models for BMVC 2016 paper "Learning local feature...	47	Emerging	150	Jupyter Notebook
12	codebyshibsankar/image_triplet_loss Image similarity using Triplet Loss	47	Emerging	102	Jupyter Notebook
13	kyegomez/HRTX Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2	47	Emerging	15	Python
14	pliang279/MultiBench [NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning	47	Emerging	615	HTML
15	willxxy/awesome-mmps Corpus of resources for multimodal machine learning with physiological...	46	Emerging	151	—
16	kyegomez/Med-PaLM Towards Generalist Biomedical AI	45	Emerging	435	Python
17	nekhtiari/image-similarity-measures :chart_with_upwards_trend: Implementation of eight evaluation metrics to...	45	Emerging	641	Python
18	mlfoundations/open_flamingo An open-source framework for training large multimodal models.	45	Emerging	4,076	Python
19	landskape-ai/triplet-attention Official PyTorch Implementation for "Rotate to Attend: Convolutional Triplet...	44	Emerging	439	Jupyter Notebook
20	Cloud-CV/VQA CloudCV Visual Question Answering Demo	44	Emerging	67	Lua
21	OpenBioLink/ThoughtSource A central, open resource for data and tools related to chain-of-thought...	43	Emerging	1,015	Jupyter Notebook
22	Cadene/vqa.pytorch Visual Question Answering in Pytorch	43	Emerging	735	Python
23	thubZ09/vision-language-model-research Hub for researchers exploring VLMs and Multimodal Learning:)	43	Emerging	62	—
24	thuiar/MIntRec MIntRec: A New Dataset for Multimodal Intent Recognition (ACM MM 2022)	42	Emerging	129	Python
25	aioz-ai/CFR_VQA Coarse-to-Fine Reasoning for Visual Question Answering (CVPRW'22)	42	Emerging	49	Python
26	maruya24/pytorch_robotics_transformer A PyTorch re-implementation of the RT-1 (Robotics Transformer)	42	Emerging	51	Python
27	kyegomez/Fuyu Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch	41	Emerging	24	Python
28	ManifoldRG/NEKO Implementation of GATO style Generalist Multimodal model capable of image,...	41	Emerging	45	Python
29	abhshkdz/neural-vqa :grey_question: Visual Question Answering in Torch	41	Emerging	488	Lua
30	mlbio-epfl/joint-inference [ICLR 2025] Large (Vision) Language Models are Unsupervised In-Context Learners	40	Emerging	22	Python
31	thswodnjs3/CSTA The official code of "CSTA: CNN-based Spatiotemporal Attention for Video...	40	Emerging	68	Python
32	IBM/AdaMML Official implementation of AdaMML. https://arxiv.org/abs/2105.05165.	40	Emerging	51	Python
33	aioz-ai/MICCAI21_MMQ Multiple Meta-model Quantifying for Medical Visual Question Answering (MICCAI 2021)	40	Emerging	37	Python
34	monjurulkarim/DSTA This is the implementation code for the paper, "A Dynamic Spatial-temporal...	40	Emerging	37	Python
35	yuanze-lin/REVIVE [NeurIPS 2022] Official code for REVIVE: Regional Visual Representation...	39	Emerging	105	Python
36	jingyi0000/VLM_survey Collection of AWESOME vision-language models for vision tasks	39	Emerging	3,094	—
37	TIGER-AI-Lab/VideoScore official repo for "VideoScore: Building Automatic Metrics to Simulate...	38	Emerging	113	Python
38	abhshkdz/neural-vqa-attention :question: Attention-based Visual Question Answering in Torch	38	Emerging	101	Jupyter Notebook
39	zchuz/CoT-Reasoning-Survey [ACL 2024] A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future	36	Emerging	493	—
40	williamcfrancis/Visual-Question-Answering-using-Stacked-Attention-Networks Pytorch implementation of VQA using Stacked Attention Networks: Multimodal...	36	Emerging	8	Jupyter Notebook
41	subho406/OmniNet Official Pytorch implementation of "OmniNet: A unified architecture for...	36	Emerging	513	Python
42	real-stanford/semantic-abstraction [CoRL 2022] This repository contains code for generating relevancies,...	35	Emerging	115	Python
43	RManLuo/MAMDR Official code implementation for ICDE 23 paper MAMDR: A Model Agnostic...	35	Emerging	38	Python
44	pranv/ARC Code for Attentive Recurrent Comparators	34	Emerging	58	Python
45	tgxs002/wikiscenes Towers of Babel: Combining Images, Language, and 3D Geometry for Learning...	34	Emerging	43	Python
46	nerdimite/neuro-symbolic-ai-soc Neuro-Symbolic Visual Question Answering on Sort-of-CLEVR using PyTorch	34	Emerging	60	Jupyter Notebook
47	pliang279/MultiViz [ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models	34	Emerging	99	Python
48	invictus717/MiCo [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale	34	Emerging	124	Python
49	AlwaysFHao/TiM4Rec [Neurocomputing 2025] The code for the paper "TiM4Rec: An Efficient...	33	Emerging	34	Python
50	Jakobovski/decoupled-multimodal-learning A decoupled, generative, unsupervised, multimodal neural architecture.	33	Emerging	44	Python
51	neulab/CulturalGround This repository provides the official resources for EMNLP 2025 Paper...	33	Emerging	12	Python
52	imneonizer/pytorch-triplet-loss Birds 400-Species Image Classification using Pytorch Metric Learning...	32	Emerging	13	Jupyter Notebook
53	Skyyyy0920/MTNet Code implementation for our paper "Learning Time Slot Preferences via...	32	Emerging	22	Python
54	Rishit-dagli/Astroformer This repository contains the official implementation of Astroformer, an ICLR...	32	Emerging	31	Python
55	kyegomez/AutoRT Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large...	32	Emerging	42	Python
56	Soumya-Chakraborty/Unsupervised-video-summarization-with-deep-GAN-reinforcement-learning Unsupervised video summarization with deep(GAN) reinforcement learning	32	Emerging	6	Python
57	tensorpix/benchmarking-cv-models Benchmark computer vision ML models in 3 minutes	32	Emerging	33	Python
58	etornam45/vl-jepa This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper	30	Emerging	7	Python
59	cpystan/WSI-VQA [ECCV 2024] Official Implementation of 《WSI-VQA: Interpreting Whole Slide...	30	Emerging	61	Python
60	AceCHQ/MMIQ This repo contains evaluation code for MM-IQ benchmark.	30	Emerging	10	Jupyter Notebook
61	lilygeorgescu/MHCA Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for...	29	Experimental	51	Python
62	liveseongho/Awesome-Video-Language-Understanding A Survey on video and language understanding.	29	Experimental	50	—
63	ntkhoa95/multimodal-for-vision Vision Framework: A modular multi-agent system for computer vision tasks,...	29	Experimental	7	Python
64	le-liang/Multimodal-Wireless Python scripts and assets related to Multimodal-Wireless dataset. The...	29	Experimental	18	Python
65	fansunqi/VideoTool Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal...	29	Experimental	17	Python
66	Peachypie98/CBAM CBAM: Convolutional Block Attention Module for CIFAR100 on VGG19	29	Experimental	80	Jupyter Notebook
67	yousefkotp/Visual-Question-Answering A Light weight deep learning model with with a web application to answer...	29	Experimental	14	Jupyter Notebook
68	vtu81/NaiveVQA A Visual Question Answering model implemented in MindSpore and PyTorch. The...	28	Experimental	10	Jupyter Notebook
69	zamaex96/Hybrid-CNN-LSTM-with-Spatial-Attention This documents the training and evaluation of a Hybrid CNN-LSTM Attention...	28	Experimental	33	Python
70	VQA-Team/Visual-Question-Answering The project is an Android application aimed to help the visually impaired by...	27	Experimental	7	Jupyter Notebook
71	kyegomez/NeVA The open source implementation of "NeVA: NeMo Vision and Language Assistant"	27	Experimental	17	Python
72	uakarsh/med-vqa An approach for solving the problem of medical visual question answering	27	Experimental	7	Jupyter Notebook
73	RobotiXX/multimodal-fusion-network This repository contains all the code for Parsing, Transforming and Training...	27	Experimental	15	Python
74	kyegomez/MultiModal-ToT Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement	27	Experimental	17	Python
75	schwettmann/visual-vocab Pytorch-based tools for constructing a vocabulary of visual concepts in a GAN.	27	Experimental	17	Jupyter Notebook
76	naamiinepal/tunevlseg [ACCV 2024]: TuneVLSeg: Prompt Tuning Benchmark for Vision-Language...	25	Experimental	8	Jupyter Notebook
77	yuhui-zh15/VLMClassifier Official implementation of "Why are Visually-Grounded Language Models Bad at...	25	Experimental	97	Jupyter Notebook
78	projectayre/ayre Visual Question Answering with added novel Semantic Analysis approach....	24	Experimental	4	Jupyter Notebook
79	clear-nus/MuMMI Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised...	24	Experimental	13	Python
80	SriramPingali/Multi-Modal-Recommendation-System Official code for the paper "Towards developing a Multi Modal Video...	23	Experimental	18	Jupyter Notebook
81	aaaastark/hybrid-model-with-cnn-lstm-python Hybrid Model with CNN and LSTM for VMD dataset using Python	23	Experimental	4	—
82	kyegomez/VisionLLaMA Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA...	22	Experimental	16	Python
83	rkl71/MambaRec [CIKM 2025] Source code for "Modality Alignment with Multi-scale Bilateral...	22	Experimental	9	Python
84	fansunqi/AKeyS Agentic Keyframe Search for Video Question Answering	22	Experimental	16	Python
85	ankitsharma-tech/Image-Triplet-Loss Image similarity using Triplet Loss.	22	Experimental	9	Jupyter Notebook
86	iluvn01/VFMTok 🖼️ Leverage vision foundation models to transform visual data into effective...	22	Experimental	—	Python
87	guyyariv/vLMIG This repo contains the official PyTorch implementation of vLMIG: Improving...	22	Experimental	17	Python
88	anujanegi/VQA Visual Question Answering System	21	Experimental	11	Python
89	RobinDong/tiny_multimodal Tiny and simple implementation of multimodal models	20	Experimental	8	Python
90	google/crossmodal-3600 Crossmodal-3600 dataset	20	Experimental	10	HTML
91	cronenberg64/VLM-arch Systematic benchmarking of modern vision backbones under small-data...	20	Experimental	1	Jupyter Notebook
92	Dafterfly/Quick_Vilt A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model for...	19	Experimental	3	Python
93	alsaniie/Image-Similarity-Index-SSIM-analysis- In image processing, an image similarity index, also known as a similarity...	19	Experimental	3	—
94	YeLuoSuiYou/openstorypp We introduce OpenStory++, a large-scale open-domain dataset focusing on...	19	Experimental	17	Python
95	Hodasia/Awesome-Vision-Language-Finetune Awesome List of Vision Language Prompt Papers	19	Experimental	47	—
96	lyuchenyang/Semantic-aware-VideoQA Code for ACL SRW 2023 paepr "Semantic-aware Dynamic...	19	Experimental	3	Python
97	aiden200/VLM_Implementation Implementing a Video Language Model from scratch	19	Experimental	3	Python
98	ved1beta/Paligemma vision language model	19	Experimental	3	Python
99	holylovenia/awesome-multimodal-convai Paper reading list for Multimodal Conversational AI	19	Experimental	4	—
100	MohEsmail143/vizwiz-visual-question-answering An implementation of the paper "Less is More", which was used to attempt the...	17	Experimental	1	Jupyter Notebook
101	Gurumurthy30/multimodal-gpt2-demo A lightweight multimodal model combining GPT-2 and Vision Transformer for...	15	Experimental	2	Jupyter Notebook
102	Soumya-Chakraborty/VL-JEPA VL-JEPA Joint Embedding Predictive Architecture for Vision-language...	15	Experimental	2	Python
103	yuhui-zh15/drml Official Code Release for "Diagnosing and Rectifying Vision Models using...	15	Experimental	34	Jupyter Notebook
104	TAU-VAILab/isbertblind This repository is for the paper "Is BERT Blind? Exploring the Effect of...	14	Experimental	21	Python
105	jesusp1234/multimodal-benchmarks 🎯 Benchmark retrieval systems across video, image, audio, and documents with...	14	Experimental	—	Python
106	anggaumhar/dynamicvl 🌆 Benchmark multimodal large language models to enhance understanding of...	14	Experimental	—	—
107	ipoukoumondi/IWR-Bench 🌐 Evaluate LVLMs' ability to reconstruct dynamic, interactive webpages from...	14	Experimental	—	Python
108	darkmax159159357/TypeR-models ⚠️ DEPRECATED — Merged into darkmax159159357/TypeR. See main repo for all...	13	Experimental	—	—
109	MichiganNLP/wildqa WildQA website code	13	Experimental	—	HTML
110	soominmyung/Pairwise_Siamese_transformer Pairwise Preference Learning with Siamese Transformer Encoders	13	Experimental	—	Jupyter Notebook
111	retkowsky/ViLT Visual Question Answering with ViLT	12	Experimental	5	Jupyter Notebook

Comparisons in this category

Awesome-Multimodal-Papers and awesome-multimodal-ml (50 vs 48) Awsome-Deep-Learning-for-Video-Analysis and Awesome-Multimodal-Papers (51 vs 50)