Multimodal Vision Language Transformer Models

There are 110 multimodal vision language models tracked. 7 score above 50 (established tier). The highest-rated is KimMeen/Time-LLM at 56/100 with 2,563 stars.

Get all 110 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting...

56
Established
2 om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

54
Established
3 bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

54
Established
4 NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and...

51
Established
5 fixie-ai/ultravox

A fast multimodal LLM for real-time voice

51
Established
6 bytedance/video-SALMONN-2

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that...

50
Established
7 cruiseresearchgroup/SensorLLM

[EMNLP 2025] Official implementation of "SensorLLM: Aligning Large Language...

50
Established
8 deepseek-ai/Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

47
Emerging
9 showlab/Show-o

[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer...

47
Emerging
10 ictnlp/LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction...

47
Emerging
11 THU-SI/Spatial-MLLM

[NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM...

46
Emerging
12 deepglint/unicom

Large-Scale Visual Representation Model

45
Emerging
13 JAMESYJL/ShapeLLM-Omni

[NeurIPS 2025 Spotlight] A Native Multimodal LLM for 3D Generation and Understanding

44
Emerging
14 InternLM/CapRL

[ICLR 2026] An official implementation of "CapRL: Stimulating Dense Image...

43
Emerging
15 nv-tlabs/LLaMA-Mesh

Unifying 3D Mesh Generation with Language Models

42
Emerging
16 tosiyuki/LLaVA-JP

LLaVA-JP is a Japanese VLM trained by LLaVA method

42
Emerging
17 jshilong/GPT4RoI

(ECCVW 2025)GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

41
Emerging
18 mlvlab/Flipped-VQA

Large Language Models are Temporal and Causal Reasoners for Video Question...

41
Emerging
19 antoyang/FrozenBiLM

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional...

41
Emerging
20 kohjingyu/gill

🐟 Code and models for the NeurIPS 2023 paper "Generating Images with...

41
Emerging
21 OpenGVLab/VisionLLM

VisionLLM Series

41
Emerging
22 kohjingyu/fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to...

41
Emerging
23 VITA-MLLM/Freeze-Omni

✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with...

41
Emerging
24 MIV-XJTU/JanusVLN

[ICLR2026] Official implementation for "JanusVLN: Decoupling Semantics and...

41
Emerging
25 TIGER-AI-Lab/QuickVideo

Quick Long Video Understanding [TMLR2025]

41
Emerging
26 VPGTrans/VPGTrans

Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA,...

40
Emerging
27 FoundationVision/UniTok

[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding

40
Emerging
28 Fsoft-AIC/Grasp-Anything

Dataset and Code for ICRA 2024 paper "Grasp-Anything: Large-scale Grasp...

40
Emerging
29 boheumd/MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term...

40
Emerging
30 TIGER-AI-Lab/Vamba

Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid...

39
Emerging
31 qizekun/ShapeLLM

[ECCV 2024] ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

39
Emerging
32 baaivision/EVE

EVE Series: Encoder-Free Vision-Language Models from BAAI

38
Emerging
33 sshh12/multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large...

38
Emerging
34 iflytek/VLE

VLE: Vision-Language Encoder (VLE: 视觉-语言多模态预训练模型)

38
Emerging
35 JinhaoLee/WCA

[ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in...

38
Emerging
36 InnovatorLM/Innovator-VL

Fully Open-source Multimodal Language Models for Science Discovery

38
Emerging
37 JosefAlbers/VL-JEPA

VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) in MLX

38
Emerging
38 ximinng/LLM4SVG

[CVPR 2025] Official implementation for "Empowering LLMs to Understand and...

37
Emerging
39 fangyuan-ksgk/Mini-LLaVA

A minimal implementation of LLaVA-style VLM with interleaved image & text &...

37
Emerging
40 zd11024/NaviLLM

[CVPR 2024] The code for paper 'Towards Learning a Generalist Model for...

37
Emerging
41 joslefaure/HERMES

[ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes...

37
Emerging
42 SALT-NLP/LLaVAR

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for...

37
Emerging
43 MME-Benchmarks/Video-MME

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark...

36
Emerging
44 vbdi/divprune

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large...

35
Emerging
45 Tanveer81/ReVisionLLM

This is the official implementation of ReVisionLLM: Recursive...

35
Emerging
46 umbertocappellazzo/Llama-AVSR

Official Pytorch implementation of "Large Language Models are Strong...

35
Emerging
47 ziqipang/LM4VisualEncoding

[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are...

35
Emerging
48 Wangbiao2/R1-Track

R1-Track: Direct Application of MLLMs to Visual Object Tracking via...

35
Emerging
49 ExplainableML/Vision_by_Language

[ICLR 2024] Official repository for "Vision-by-Language for Training-Free...

35
Emerging
50 ExplainableML/WaffleCLIP

Official repository for the ICCV 2023 paper: "Waffling around for...

35
Emerging
51 TencentARC/ST-LLM

[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language...

34
Emerging
52 Hon-Wong/VoRA

[Fully open] [Encoder-free MLLM] Vision as LoRA

34
Emerging
53 kkahatapitiya/LangRepo

Code for our ACL 2025 paper "Language Repository for Long Video Understanding"

34
Emerging
54 xinyanghuang7/Basic-Visual-Language-Model

Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖

33
Emerging
55 haesleinhuepf/vlm-pictionary

Play pictionary with Vision Language Models!

33
Emerging
56 yuecao0119/MMFuser

The official implementation of the paper "MMFuser: Multimodal Multi-Layer...

33
Emerging
57 Wang-ML-Lab/multimodal-needle-in-a-haystack

[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking...

33
Emerging
58 YunzeMan/Lexicon3D

[NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D...

33
Emerging
59 peacelwh/VT-FSL

[NeurIPS 2025] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

32
Emerging
60 Flagro/OmniModKit

Multimodal LLM toolkit

32
Emerging
61 AntonGuan/TimeOmni-1

[ICLR 2026] Official implementation of " 🦙 TimeOmni-1: Incentivizing Complex...

32
Emerging
62 baldoarbol/BodyShapeGPT

Fine-tuned LLMs generate accurate 3D human avatars from textual descriptions...

31
Emerging
63 tenghuilee/ScalingCapFusedVisionLM

number of tokens <=> performance to a vision language model

31
Emerging
64 ParadoxZW/LLaVA-UHD-Better

A bug-free and improved implementation of LLaVA-UHD, based on the code from...

31
Emerging
65 mbzuai-oryx/Video-LLaVA

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

30
Emerging
66 HYUNJS/STTM

[ICCV 2025] Multi-Granular Spatio-Temporal Token Merging for Training-Free...

30
Emerging
67 Jacksonlark/open-mllms

open llm for multimodal

30
Emerging
68 WisconsinAIVision/YoLLaVA

🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant (NeurIPS 2024)

29
Experimental
69 Victorwz/MLM_Filter

Official implementation of our paper "Finetuned Multimodal Language Models...

29
Experimental
70 cokeshao/HoliTom

[NeurIPS 2025] HoliTom: Holistic Token Merging for Fast Video Large Language Models

29
Experimental
71 agentic-learning-ai-lab/lifelong-memory

Code for LifelongMemory: Leveraging LLMs for Answering Queries in Long-form...

29
Experimental
72 zengqunzhao/Exp-CLIP

[WACV'25 Oral] Enhancing Zero-Shot Facial Expression Recognition by LLM...

29
Experimental
73 2toinf/IVM

[NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"

29
Experimental
74 astra-vision/LatteCLIP

[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

28
Experimental
75 UCSC-VLAA/Sight-Beyond-Text

[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal...

27
Experimental
76 lizhaoliu-Lec/CG-VLM

This is the official repo for Contrastive Vision-Language Alignment Makes...

27
Experimental
77 SlytherinGe/RSTeller

Vision-Language Dataset for Remote Sensing

27
Experimental
78 fatemehpesaran310/Text2Chart31

Official PyTorch implementation of "Text2Chart31: Instruction Tuning for...

26
Experimental
79 kyegomez/AudioFlamingo

Implementation of the model "AudioFlamingo" from the paper: "Audio Flamingo:...

26
Experimental
80 ProGamerGov/VLM-Captioning-Tools

Python scripts to use for captioning images with VLMs

26
Experimental
81 MYMY-young/DelimScaling

[ICLR 2026] Official implementation of "Enhancing Multi-Image Understanding...

26
Experimental
82 hpfield/Text2Touch

CoRL 2025 - Tactile In-Hand Manipulation with LLM-Designed Reward Functions

25
Experimental
83 smsnobin77/Awesome-Multimodal-Unlearning

This repo presents a survey of multimodal unlearning across vision,...

24
Experimental
84 Blinorot/ALARM

Official Implementation of "ALARM: Audio–Language Alignment for Reasoning Models"

24
Experimental
85 InternRobotics/Grounded_3D-LLM

Code&Data for Grounded 3D-LLM with Referent Tokens

24
Experimental
86 showlab/VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text...

24
Experimental
87 paxnea/LLM-multimodal-nudging

Zero-Shot Learning for Multimodal Nudging

23
Experimental
88 Letian2003/MM_INF

An efficient multi-modal instruction-following data synthesis tool and the...

23
Experimental
89 InternLM/Visual-ERM

Official Implementation of "Visual-ERM: Reward Modeling for Visual Equivalence"

23
Experimental
90 ChenDelong1999/polite-flamingo

🦩 Official repository of paper "Visual Instruction Tuning with Polite...

22
Experimental
91 termehtaheri/SAR-LM

Official implementation of “SAR-LM: Symbolic Audio Reasoning with Large...

22
Experimental
92 MariyamSiddiqui/Zero-shot-image-to-text-generation-with-BLIP-2

Zero-shot image-to-text generation using Salesforce’s BLIP-2 model —...

21
Experimental
93 yueying-teng/generate-language-image-instruction-following-data

Mistral assisted visual instruction data generation by following LLaVA

21
Experimental
94 yophis/partial-yarn

Partial YaRN and VLAT: techniques for efficiently extending audio context of...

21
Experimental
95 zhudotexe/kani-vision

Kani extension for supporting vision-language models (VLMs). Comes with...

20
Experimental
96 Traffic-Alpha/VLMLight

Official implementation of VLMLight

20
Experimental
97 bagh2178/GC-VLN

[CoRL 2025] GC-VLN: Instruction as Graph Constraints for Training-free...

20
Experimental
98 claws-lab/projection-in-MLLMs

Code and data for ACL 2024 paper on 'Cross-Modal Projection in Multimodal...

19
Experimental
99 Jshulgach/Grounded-SAM-2-Stream

Track anything in streaming with Grounding DINO, SAM 2, and LLM

19
Experimental
100 OpenM3D/M3DBench

[ECCV 2024] M3DBench introduces a comprehensive 3D instruction-following...

19
Experimental
101 ai4ce/LLM4VPR

Can multimodal LLM help visual place recognition?

19
Experimental
102 nkkbr/ViCA

This is the official implementation of ViCA2 (Visuospatial Cognitive...

18
Experimental
103 scb-10x/partial-yarn

Partial YaRN and VLAT: techniques for efficiently extending audio context of...

18
Experimental
104 KDEGroup/MMICT

Source code for TOMM'24 paper "MMICT: Boosting Multi-Modal Fine-Tuning with...

17
Experimental
105 egeozsoy/ORacle

Official code of the paper ORacle: Large Vision-Language Models for...

14
Experimental
106 ikun-llm/ikun-V

多模态视觉语言模型 | Vision-Language Model 👁️

14
Experimental
107 M3-IT/YING-VLM

Vision Large Language Models trained on M3IT instruction tuning dataset

14
Experimental
108 claws-lab/MMSoc

We introduce MM-Soc, a comprehensive benchmark designed to evaluate MLLMs'...

12
Experimental
109 ExplainableML/ZS-A2T

[GCPR 2023] Zero-shot Translation of Attention Patterns in VQA Models to...

11
Experimental
110 AmirMansurian/NoConceptLeftBehind

[ICASSP'26] No Concept Left Behind: Test-Time Optimization for Compositional...

11
Experimental

Comparisons in this category