Multimodal Vision Language Transformer Models

There are 110 multimodal vision language models tracked. 7 score above 50 (established tier). The highest-rated is KimMeen/Time-LLM at 56/100 with 2,563 stars.

Get all 110 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	KimMeen/Time-LLM [ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting...	56	Established	2,563	Python
2	om-ai-lab/VLM-R1 Solve Visual Understanding with Reinforced VLMs	54	Established	5,864	Python
3	bytedance/SALMONN SALMONN family: A suite of advanced multi-modal LLMs	54	Established	1,392	—
4	NVlabs/OmniVinci OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and...	51	Established	639	Python
5	fixie-ai/ultravox A fast multimodal LLM for real-time voice	51	Established	4,377	Python
6	bytedance/video-SALMONN-2 video-SALMONN 2 is a powerful audio-visual large language model (LLM) that...	50	Established	167	Python
7	cruiseresearchgroup/SensorLLM [EMNLP 2025] Official implementation of "SensorLLM: Aligning Large Language...	50	Established	83	Python
8	deepseek-ai/Janus Janus-Series: Unified Multimodal Understanding and Generation Models	47	Emerging	17,708	Python
9	showlab/Show-o [ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer...	47	Emerging	1,894	Python
10	ictnlp/LLaMA-Omni LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction...	47	Emerging	3,128	Python
11	THU-SI/Spatial-MLLM [NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM...	46	Emerging	447	Python
12	deepglint/unicom Large-Scale Visual Representation Model	45	Emerging	704	Python
13	JAMESYJL/ShapeLLM-Omni [NeurIPS 2025 Spotlight] A Native Multimodal LLM for 3D Generation and Understanding	44	Emerging	549	Python
14	InternLM/CapRL [ICLR 2026] An official implementation of "CapRL: Stimulating Dense Image...	43	Emerging	193	Python
15	nv-tlabs/LLaMA-Mesh Unifying 3D Mesh Generation with Language Models	42	Emerging	1,145	Python
16	tosiyuki/LLaVA-JP LLaVA-JP is a Japanese VLM trained by LLaVA method	42	Emerging	64	Python
17	jshilong/GPT4RoI (ECCVW 2025)GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	41	Emerging	551	Python
18	mlvlab/Flipped-VQA Large Language Models are Temporal and Causal Reasoners for Video Question...	41	Emerging	78	Python
19	antoyang/FrozenBiLM [NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional...	41	Emerging	158	Python
20	kohjingyu/gill 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with...	41	Emerging	471	Jupyter Notebook
21	OpenGVLab/VisionLLM VisionLLM Series	41	Emerging	1,137	Python
22	kohjingyu/fromage 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to...	41	Emerging	486	Jupyter Notebook
23	VITA-MLLM/Freeze-Omni ✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with...	41	Emerging	369	Python
24	MIV-XJTU/JanusVLN [ICLR2026] Official implementation for "JanusVLN: Decoupling Semantics and...	41	Emerging	508	Python
25	TIGER-AI-Lab/QuickVideo Quick Long Video Understanding [TMLR2025]	41	Emerging	76	Python
26	VPGTrans/VPGTrans Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA,...	40	Emerging	269	Python
27	FoundationVision/UniTok [NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding	40	Emerging	517	Python
28	Fsoft-AIC/Grasp-Anything Dataset and Code for ICRA 2024 paper "Grasp-Anything: Large-scale Grasp...	40	Emerging	219	Python
29	boheumd/MA-LMM (2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term...	40	Emerging	347	Python
30	TIGER-AI-Lab/Vamba Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid...	39	Emerging	101	Python
31	qizekun/ShapeLLM [ECCV 2024] ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	39	Emerging	228	Python
32	baaivision/EVE EVE Series: Encoder-Free Vision-Language Models from BAAI	38	Emerging	368	Python
33	sshh12/multi_token Embed arbitrary modalities (images, audio, documents, etc) into large...	38	Emerging	191	Python
34	iflytek/VLE VLE: Vision-Language Encoder (VLE: 视觉-语言多模态预训练模型)	38	Emerging	194	Python
35	JinhaoLee/WCA [ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in...	38	Emerging	19	Python
36	InnovatorLM/Innovator-VL Fully Open-source Multimodal Language Models for Science Discovery	38	Emerging	130	Python
37	JosefAlbers/VL-JEPA VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) in MLX	38	Emerging	76	Python
38	ximinng/LLM4SVG [CVPR 2025] Official implementation for "Empowering LLMs to Understand and...	37	Emerging	617	Python
39	fangyuan-ksgk/Mini-LLaVA A minimal implementation of LLaVA-style VLM with interleaved image & text &...	37	Emerging	98	Python
40	zd11024/NaviLLM [CVPR 2024] The code for paper 'Towards Learning a Generalist Model for...	37	Emerging	229	Python
41	joslefaure/HERMES [ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes...	37	Emerging	38	Python
42	SALT-NLP/LLaVAR Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for...	37	Emerging	269	Python
43	MME-Benchmarks/Video-MME ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark...	36	Emerging	732	—
44	vbdi/divprune [CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large...	35	Emerging	71	Python
45	Tanveer81/ReVisionLLM This is the official implementation of ReVisionLLM: Recursive...	35	Emerging	43	Python
46	umbertocappellazzo/Llama-AVSR Official Pytorch implementation of "Large Language Models are Strong...	35	Emerging	57	Python
47	ziqipang/LM4VisualEncoding [ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are...	35	Emerging	246	Python
48	Wangbiao2/R1-Track R1-Track: Direct Application of MLLMs to Visual Object Tracking via...	35	Emerging	66	Python
49	ExplainableML/Vision_by_Language [ICLR 2024] Official repository for "Vision-by-Language for Training-Free...	35	Emerging	84	Python
50	ExplainableML/WaffleCLIP Official repository for the ICCV 2023 paper: "Waffling around for...	35	Emerging	61	Python
51	TencentARC/ST-LLM [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language...	34	Emerging	151	Python
52	Hon-Wong/VoRA [Fully open] [Encoder-free MLLM] Vision as LoRA	34	Emerging	379	Python
53	kkahatapitiya/LangRepo Code for our ACL 2025 paper "Language Repository for Long Video Understanding"	34	Emerging	36	Python
54	xinyanghuang7/Basic-Visual-Language-Model Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖	33	Emerging	47	Python
55	haesleinhuepf/vlm-pictionary Play pictionary with Vision Language Models!	33	Emerging	6	Jupyter Notebook
56	yuecao0119/MMFuser The official implementation of the paper "MMFuser: Multimodal Multi-Layer...	33	Emerging	64	Python
57	Wang-ML-Lab/multimodal-needle-in-a-haystack [NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking...	33	Emerging	54	Python
58	YunzeMan/Lexicon3D [NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D...	33	Emerging	100	Python
59	peacelwh/VT-FSL [NeurIPS 2025] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning	32	Emerging	31	Python
60	Flagro/OmniModKit Multimodal LLM toolkit	32	Emerging	1	Python
61	AntonGuan/TimeOmni-1 [ICLR 2026] Official implementation of " 🦙 TimeOmni-1: Incentivizing Complex...	32	Emerging	18	Python
62	baldoarbol/BodyShapeGPT Fine-tuned LLMs generate accurate 3D human avatars from textual descriptions...	31	Emerging	37	Python
63	tenghuilee/ScalingCapFusedVisionLM number of tokens <=> performance to a vision language model	31	Emerging	2	Python
64	ParadoxZW/LLaVA-UHD-Better A bug-free and improved implementation of LLaVA-UHD, based on the code from...	31	Emerging	35	Python
65	mbzuai-oryx/Video-LLaVA PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models	30	Emerging	262	Python
66	HYUNJS/STTM [ICCV 2025] Multi-Granular Spatio-Temporal Token Merging for Training-Free...	30	Emerging	57	Python
67	Jacksonlark/open-mllms open llm for multimodal	30	Emerging	20	—
68	WisconsinAIVision/YoLLaVA 🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant (NeurIPS 2024)	29	Experimental	121	Python
69	Victorwz/MLM_Filter Official implementation of our paper "Finetuned Multimodal Language Models...	29	Experimental	69	Python
70	cokeshao/HoliTom [NeurIPS 2025] HoliTom: Holistic Token Merging for Fast Video Large Language Models	29	Experimental	72	Python
71	agentic-learning-ai-lab/lifelong-memory Code for LifelongMemory: Leveraging LLMs for Answering Queries in Long-form...	29	Experimental	28	Python
72	zengqunzhao/Exp-CLIP [WACV'25 Oral] Enhancing Zero-Shot Facial Expression Recognition by LLM...	29	Experimental	56	Python
73	2toinf/IVM [NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"	29	Experimental	42	Jupyter Notebook
74	astra-vision/LatteCLIP [WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts	28	Experimental	10	Jupyter Notebook
75	UCSC-VLAA/Sight-Beyond-Text [TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal...	27	Experimental	20	Python
76	lizhaoliu-Lec/CG-VLM This is the official repo for Contrastive Vision-Language Alignment Makes...	27	Experimental	20	—
77	SlytherinGe/RSTeller Vision-Language Dataset for Remote Sensing	27	Experimental	40	Python
78	fatemehpesaran310/Text2Chart31 Official PyTorch implementation of "Text2Chart31: Instruction Tuning for...	26	Experimental	24	Python
79	kyegomez/AudioFlamingo Implementation of the model "AudioFlamingo" from the paper: "Audio Flamingo:...	26	Experimental	40	Python
80	ProGamerGov/VLM-Captioning-Tools Python scripts to use for captioning images with VLMs	26	Experimental	45	Python
81	MYMY-young/DelimScaling [ICLR 2026] Official implementation of "Enhancing Multi-Image Understanding...	26	Experimental	14	Python
82	hpfield/Text2Touch CoRL 2025 - Tactile In-Hand Manipulation with LLM-Designed Reward Functions	25	Experimental	7	Jupyter Notebook
83	smsnobin77/Awesome-Multimodal-Unlearning This repo presents a survey of multimodal unlearning across vision,...	24	Experimental	3	Jupyter Notebook
84	Blinorot/ALARM Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models"	24	Experimental	3	Python
85	InternRobotics/Grounded_3D-LLM Code&Data for Grounded 3D-LLM with Referent Tokens	24	Experimental	134	Python
86	showlab/VisInContext Official implementation of Leveraging Visual Tokens for Extended Text...	24	Experimental	28	Python
87	paxnea/LLM-multimodal-nudging Zero-Shot Learning for Multimodal Nudging	23	Experimental	3	Jupyter Notebook
88	Letian2003/MM_INF An efficient multi-modal instruction-following data synthesis tool and the...	23	Experimental	39	Python
89	InternLM/Visual-ERM Official Implementation of "Visual-ERM: Reward Modeling for Visual Equivalence"	23	Experimental	25	Python
90	ChenDelong1999/polite-flamingo 🦩 Official repository of paper "Visual Instruction Tuning with Polite...	22	Experimental	65	Python
91	termehtaheri/SAR-LM Official implementation of “SAR-LM: Symbolic Audio Reasoning with Large...	22	Experimental	4	Python
92	MariyamSiddiqui/Zero-shot-image-to-text-generation-with-BLIP-2 Zero-shot image-to-text generation using Salesforce’s BLIP-2 model —...	21	Experimental	2	Jupyter Notebook
93	yueying-teng/generate-language-image-instruction-following-data Mistral assisted visual instruction data generation by following LLaVA	21	Experimental	1	Python
94	yophis/partial-yarn Partial YaRN and VLAT: techniques for efficiently extending audio context of...	21	Experimental	—	Python
95	zhudotexe/kani-vision Kani extension for supporting vision-language models (VLMs). Comes with...	20	Experimental	7	Python
96	Traffic-Alpha/VLMLight Official implementation of VLMLight	20	Experimental	30	Python
97	bagh2178/GC-VLN [CoRL 2025] GC-VLN: Instruction as Graph Constraints for Training-free...	20	Experimental	64	—
98	claws-lab/projection-in-MLLMs Code and data for ACL 2024 paper on 'Cross-Modal Projection in Multimodal...	19	Experimental	19	Python
99	Jshulgach/Grounded-SAM-2-Stream Track anything in streaming with Grounding DINO, SAM 2, and LLM	19	Experimental	4	Python
100	OpenM3D/M3DBench [ECCV 2024] M3DBench introduces a comprehensive 3D instruction-following...	19	Experimental	61	Python
101	ai4ce/LLM4VPR Can multimodal LLM help visual place recognition?	19	Experimental	45	Python
102	nkkbr/ViCA This is the official implementation of ViCA2 (Visuospatial Cognitive...	18	Experimental	12	Python
103	scb-10x/partial-yarn Partial YaRN and VLAT: techniques for efficiently extending audio context of...	18	Experimental	3	Python
104	KDEGroup/MMICT Source code for TOMM'24 paper "MMICT: Boosting Multi-Modal Fine-Tuning with...	17	Experimental	1	Python
105	egeozsoy/ORacle Official code of the paper ORacle: Large Vision-Language Models for...	14	Experimental	24	Python
106	ikun-llm/ikun-V 多模态视觉语言模型 \| Vision-Language Model 👁️	14	Experimental	—	—
107	M3-IT/YING-VLM Vision Large Language Models trained on M3IT instruction tuning dataset	14	Experimental	17	Python
108	claws-lab/MMSoc We introduce MM-Soc, a comprehensive benchmark designed to evaluate MLLMs'...	12	Experimental	8	Python
109	ExplainableML/ZS-A2T [GCPR 2023] Zero-shot Translation of Attention Patterns in VQA Models to...	11	Experimental	3	—
110	AmirMansurian/NoConceptLeftBehind [ICASSP'26] No Concept Left Behind: Test-Time Optimization for Compositional...	11	Experimental	2	Python

Comparisons in this category

SALMONN and video-SALMONN-2 (54 vs 50)