Vision Language Models

Tools and implementations for multimodal AI models that combine vision and language processing for tasks like VQA, image captioning, and visual reasoning. Does NOT include general multimodal fusion, text-to-image generation, or single-modality models.

There are 66 vision language models tracked. 2 score above 50 (established tier). The highest-rated is kyegomez/RT-X at 51/100 with 237 stars.

Get all 66 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	kyegomez/RT-X Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open...	51	Established	237	Python
2	kyegomez/PALI3 Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS:...	51	Established	146	Python
3	chuanyangjin/MMToM-QA [🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind...	47	Emerging	154	Python
4	lyuchenyang/Macaw-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text...	45	Emerging	1,593	Python
5	Muennighoff/vilio 🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle	45	Emerging	91	Python
6	kyegomez/PALM-E Implementation of "PaLM-E: An Embodied Multimodal Language Model"	45	Emerging	335	Python
7	kyegomez/RT-2 Democratization of RT-2 "RT-2: New model translates vision and language into action"	45	Emerging	554	Python
8	ahmetkumass/yolo-gen Train YOLO + VLM with one command. Auto-generate vision-language training...	42	Emerging	24	Python
9	princeton-nlp/CharXiv [NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in...	41	Emerging	142	Python
10	kyegomez/SSM-As-VLM-Bridge An exploration into leveraging SSM's as Bridge/Adapter Layers for VLM	39	Emerging	2	Python
11	amazon-science/crossmodal-contrastive-learning CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video...	39	Emerging	64	Python
12	kyegomez/qformer Implementation of Qformer from BLIP2 in Zeta Lego blocks.	38	Emerging	48	Python
13	kyegomez/MGQA The open source implementation of the multi grouped query attention by the...	37	Emerging	15	Python
14	kyegomez/MM1 PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from...	37	Emerging	26	Python
15	SuyogKamble/simpleVLM building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2...	36	Emerging	7	Jupyter Notebook
16	alantess/gtrxl-torch Gated Transformer Model for Computer Vision	36	Emerging	25	Python
17	kyegomez/PALI Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"	36	Emerging	94	Python
18	deepmancer/vlm-toolbox Vision-Language Models Toolbox: Your all-in-one solution for multimodal...	35	Emerging	12	Jupyter Notebook
19	ziqipang/RandAR [CVPR 2025 (Oral)] Open implementation of "RandAR"	35	Emerging	207	Python
20	logic-OT/BobVLM BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a...	35	Emerging	11	Python
21	YeonwooSung/vision-search Image search engine	34	Emerging	6	TypeScript
22	DestroyerDarkNess/fastvlm-webgpu Real-time video captioning powered by FastVLM	34	Emerging	4	JavaScript
23	zerovl/ZeroVL [ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources	34	Emerging	46	Python
24	ola-krutrim/Chitrarth Chitrarth: Bridging Vision and Language for a Billion People	32	Emerging	13	Python
25	kyegomez/MobileVLM Implementation of the LDP module block in PyTorch and Zeta from the paper:...	31	Emerging	15	Python
26	HLTCHKUST/VG-GPLMs The code repository for EMNLP 2021 paper "Vision Guided Generative...	31	Emerging	57	Python
27	Skyline-9/Visionary-Vids Multi-modal transformer approach for natural language query based joint...	31	Emerging	17	Jupyter Notebook
28	kyegomez/MMCA The open source community's implementation of the all-new Multi-Modal Causal...	30	Emerging	11	Python
29	ViLab-UCSD/LaGTran_ICML2024 Code and models for the ICML 2024 paper "Tell, Don`t Show!: Language...	30	Emerging	6	Python
30	VectorInstitute/VLDBench VLDBench: A large-scale benchmark for evaluating Vision-Language Models...	30	Emerging	8	Python
31	SCZwangxiao/RTQ-MM2023 ACM Multimedia 2023 (Oral) - RTQ: Rethinking Video-language Understanding...	30	Emerging	16	Python
32	kyegomez/MMCA-MGQA Experiments around using Multi-Modal Casual Attention with Multi-Grouped...	29	Experimental	5	Python
33	eltoto1219/vltk A toolkit for vision-language processing to support the increasing...	29	Experimental	1	HTML
34	ChartMimic/ChartMimic [ICLR 2025] ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability...	29	Experimental	131	Python
35	declare-lab/MM-Align [EMNLP 2022] This repository contains the official implementation of the...	29	Experimental	33	Python
36	raminguyen/LLMP2 Evaluating ‘Graphical Perception’ with Multimodal Large Language Models	29	Experimental	3	Jupyter Notebook
37	krohling/nl-act Integrating Natural Language Instructions into the Action Chunking...	29	Experimental	9	Python
38	vonexel/smog Pytorch implementation of Semantic Motion Generation - 3D-motion synthesis...	28	Experimental	2	Python
39	kaylode/vqa-transformer Visual Question Answering using Transformer and Bottom-Up attention....	28	Experimental	10	Python
40	kyegomez/MultiModalCrossAttn The open source implementation of the cross attention mechanism from the...	26	Experimental	37	Python
41	o-messai/fastVLM An implementation of FastVLM/LLaVA or any llm/vlm model using FastAPI...	25	Experimental	5	TypeScript
42	Victorwz/VaLM VaLM: Visually-augmented Language Modeling. ICLR 2023.	23	Experimental	56	Python
43	AIDC-AI/Wings The code repository for "Wings: Learning Multimodal LLMs without Text-only...	23	Experimental	26	Python
44	baohuyvanba/Vision-Zephyr Vision-Zephyr: a multimodal LLM for Visual Commonsense Reasoning—CLIP-ViT +...	23	Experimental	2	Python
45	shreydan/VLM-OD experimental: finetune smolVLM on COCO (without any special tokens)	22	Experimental	9	Jupyter Notebook
46	TheMasterOfDisasters/SmolVLM SmolVLM WebUI & API – Easy-to-Run Vision-Language Model	22	Experimental	1	Python
47	wklee610/VLM-Model-fastapi A reusable FastAPI module for serving and integrating Vision-Language Models (VLM)	22	Experimental	1	Python
48	zalkklop/LVSM Official code for "LVSM: A Large View Synthesis Model with Minimal 3D...	22	Experimental	1	Python
49	rahuldevmuraleedharan/Neural-Navigator Multi-modal Transformer that fuses vision and language to generate...	21	Experimental	—	Python
50	MaxLSB/mini-paligemma2 Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch	21	Experimental	13	Python
51	michelecafagna26/VinVL Original VinVL (and Oscar) repo with API designed for an easy inference	20	Experimental	8	Python
52	PRITHIVSAKTHIUR/Doc-VLMs-exp An experimental document-focused Vision-Language Model application that...	20	Experimental	4	Python
53	tristandb8/PyTorch-PaliGemma-2 PyTorch implementation of PaliGemma 2	19	Experimental	4	Python
54	XavierSpycy/CAT-ImageTextIntegrator An innovative deep learning framework leveraging the CAT (Convolutions,...	19	Experimental	3	Python
55	telota/imagines-nummorum-vlm-data-extraction A computer vision system for automated analysis of index cards from a...	19	Experimental	2	Python
56	lyuchenyang/Efficient-VideoQA Code for ACL SustaiNLP 2023 paper "Is a Video worth n × n Images? A Highly...	18	Experimental	2	Python
57	Soheil-jafari/Language-Guided-Endoscopy-Localization Open-vocabulary temporal localization in endoscopic video with...	18	Experimental	1	Python
58	orshkuri/vqa-qformer-comparison A benchmark and analysis of QFormer, Cross Attention, and Concat models for...	18	Experimental	1	Python
59	ab3llini/Transformer-VQA Transformer-based VQA system capable of generating unconstrained, open-ended...	17	Experimental	1	Python
60	E1ims/math-vlm-finetune-pipeline 📐 Transcribe handwritten math into accurate LaTeX using a modular...	15	Experimental	—	Jupyter Notebook
61	buhsnn/Vision-Language-Model Vision-language model combining a ResNet18 vision encoder with a GPT-2...	14	Experimental	1	Jupyter Notebook
62	shreydan/simpleVLM building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2...	13	Experimental	6	Jupyter Notebook
63	praveena2j/LAVViT "ICASSP 2025" : Latent Audio-Visual Vision Transformers for Speaker Verification	13	Experimental	9	Python
64	AbdulDD/UnifiedVQA The repository host codes, link to datasets and models for our research...	11	Experimental	—	Jupyter Notebook
65	tejas-54/Visual-Search-Engine-Using-VLM Visual Search Engine using VLM (Vision-Language Model) A...	11	Experimental	—	Python
66	ycchen218/VisionQA-Llama2-OWLViT This is a multimodal model design for the Vision Question Answering (VQA)...	11	Experimental	4	Python

Comparisons in this category

PALI3 and PALI (51 vs 36)