Vision Language Models

Tools and implementations for multimodal AI models that combine vision and language processing for tasks like VQA, image captioning, and visual reasoning. Does NOT include general multimodal fusion, text-to-image generation, or single-modality models.

There are 66 vision language models tracked. 2 score above 50 (established tier). The highest-rated is kyegomez/RT-X at 51/100 with 237 stars.

Get all 66 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open...

51
Established
2 kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS:...

51
Established
3 chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind...

47
Emerging
4 lyuchenyang/Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text...

45
Emerging
5 Muennighoff/vilio

đŸ„¶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

45
Emerging
6 kyegomez/PALM-E

Implementation of "PaLM-E: An Embodied Multimodal Language Model"

45
Emerging
7 kyegomez/RT-2

Democratization of RT-2 "RT-2: New model translates vision and language into action"

45
Emerging
8 ahmetkumass/yolo-gen

Train YOLO + VLM with one command. Auto-generate vision-language training...

42
Emerging
9 princeton-nlp/CharXiv

[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in...

41
Emerging
10 kyegomez/SSM-As-VLM-Bridge

An exploration into leveraging SSM's as Bridge/Adapter Layers for VLM

39
Emerging
11 amazon-science/crossmodal-contrastive-learning

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video...

39
Emerging
12 kyegomez/qformer

Implementation of Qformer from BLIP2 in Zeta Lego blocks.

38
Emerging
13 kyegomez/MGQA

The open source implementation of the multi grouped query attention by the...

37
Emerging
14 kyegomez/MM1

PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from...

37
Emerging
15 SuyogKamble/simpleVLM

building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2...

36
Emerging
16 alantess/gtrxl-torch

Gated Transformer Model for Computer Vision

36
Emerging
17 kyegomez/PALI

Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"

36
Emerging
18 deepmancer/vlm-toolbox

Vision-Language Models Toolbox: Your all-in-one solution for multimodal...

35
Emerging
19 ziqipang/RandAR

[CVPR 2025 (Oral)] Open implementation of "RandAR"

35
Emerging
20 logic-OT/BobVLM

BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a...

35
Emerging
21 YeonwooSung/vision-search

Image search engine

34
Emerging
22 DestroyerDarkNess/fastvlm-webgpu

Real-time video captioning powered by FastVLM

34
Emerging
23 zerovl/ZeroVL

[ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources

34
Emerging
24 ola-krutrim/Chitrarth

Chitrarth: Bridging Vision and Language for a Billion People

32
Emerging
25 kyegomez/MobileVLM

Implementation of the LDP module block in PyTorch and Zeta from the paper:...

31
Emerging
26 HLTCHKUST/VG-GPLMs

The code repository for EMNLP 2021 paper "Vision Guided Generative...

31
Emerging
27 Skyline-9/Visionary-Vids

Multi-modal transformer approach for natural language query based joint...

31
Emerging
28 kyegomez/MMCA

The open source community's implementation of the all-new Multi-Modal Causal...

30
Emerging
29 ViLab-UCSD/LaGTran_ICML2024

Code and models for the ICML 2024 paper "Tell, Don`t Show!: Language...

30
Emerging
30 VectorInstitute/VLDBench

VLDBench: A large-scale benchmark for evaluating Vision-Language Models...

30
Emerging
31 SCZwangxiao/RTQ-MM2023

ACM Multimedia 2023 (Oral) - RTQ: Rethinking Video-language Understanding...

30
Emerging
32 kyegomez/MMCA-MGQA

Experiments around using Multi-Modal Casual Attention with Multi-Grouped...

29
Experimental
33 eltoto1219/vltk

A toolkit for vision-language processing to support the increasing...

29
Experimental
34 ChartMimic/ChartMimic

[ICLR 2025] ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability...

29
Experimental
35 declare-lab/MM-Align

[EMNLP 2022] This repository contains the official implementation of the...

29
Experimental
36 raminguyen/LLMP2

Evaluating ‘Graphical Perception’ with Multimodal Large Language Models

29
Experimental
37 krohling/nl-act

Integrating Natural Language Instructions into the Action Chunking...

29
Experimental
38 vonexel/smog

Pytorch implementation of Semantic Motion Generation - 3D-motion synthesis...

28
Experimental
39 kaylode/vqa-transformer

Visual Question Answering using Transformer and Bottom-Up attention....

28
Experimental
40 kyegomez/MultiModalCrossAttn

The open source implementation of the cross attention mechanism from the...

26
Experimental
41 o-messai/fastVLM

An implementation of FastVLM/LLaVA or any llm/vlm model using FastAPI...

25
Experimental
42 Victorwz/VaLM

VaLM: Visually-augmented Language Modeling. ICLR 2023.

23
Experimental
43 AIDC-AI/Wings

The code repository for "Wings: Learning Multimodal LLMs without Text-only...

23
Experimental
44 baohuyvanba/Vision-Zephyr

Vision-Zephyr: a multimodal LLM for Visual Commonsense Reasoning—CLIP-ViT +...

23
Experimental
45 shreydan/VLM-OD

experimental: finetune smolVLM on COCO (without any special tokens)

22
Experimental
46 TheMasterOfDisasters/SmolVLM

SmolVLM WebUI & API – Easy-to-Run Vision-Language Model

22
Experimental
47 wklee610/VLM-Model-fastapi

A reusable FastAPI module for serving and integrating Vision-Language Models (VLM)

22
Experimental
48 zalkklop/LVSM

Official code for "LVSM: A Large View Synthesis Model with Minimal 3D...

22
Experimental
49 rahuldevmuraleedharan/Neural-Navigator

Multi-modal Transformer that fuses vision and language to generate...

21
Experimental
50 MaxLSB/mini-paligemma2

Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch

21
Experimental
51 michelecafagna26/VinVL

Original VinVL (and Oscar) repo with API designed for an easy inference

20
Experimental
52 PRITHIVSAKTHIUR/Doc-VLMs-exp

An experimental document-focused Vision-Language Model application that...

20
Experimental
53 tristandb8/PyTorch-PaliGemma-2

PyTorch implementation of PaliGemma 2

19
Experimental
54 XavierSpycy/CAT-ImageTextIntegrator

An innovative deep learning framework leveraging the CAT (Convolutions,...

19
Experimental
55 telota/imagines-nummorum-vlm-data-extraction

A computer vision system for automated analysis of index cards from a...

19
Experimental
56 lyuchenyang/Efficient-VideoQA

Code for ACL SustaiNLP 2023 paper "Is a Video worth n × n Images? A Highly...

18
Experimental
57 Soheil-jafari/Language-Guided-Endoscopy-Localization

Open-vocabulary temporal localization in endoscopic video with...

18
Experimental
58 orshkuri/vqa-qformer-comparison

A benchmark and analysis of QFormer, Cross Attention, and Concat models for...

18
Experimental
59 ab3llini/Transformer-VQA

Transformer-based VQA system capable of generating unconstrained, open-ended...

17
Experimental
60 E1ims/math-vlm-finetune-pipeline

📐 Transcribe handwritten math into accurate LaTeX using a modular...

15
Experimental
61 buhsnn/Vision-Language-Model

Vision-language model combining a ResNet18 vision encoder with a GPT-2...

14
Experimental
62 shreydan/simpleVLM

building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2...

13
Experimental
63 praveena2j/LAVViT

"ICASSP 2025" : Latent Audio-Visual Vision Transformers for Speaker Verification

13
Experimental
64 AbdulDD/UnifiedVQA

The repository host codes, link to datasets and models for our research...

11
Experimental
65 tejas-54/Visual-Search-Engine-Using-VLM

Visual Search Engine using VLM (Vision-Language Model) A...

11
Experimental
66 ycchen218/VisionQA-Llama2-OWLViT

This is a multimodal model design for the Vision Question Answering (VQA)...

11
Experimental

Comparisons in this category