Vision Language Models ML Frameworks

Frameworks and implementations for multimodal models that combine vision and language capabilities, including vision-language transformers, image-text generation, and visual question answering systems. Does NOT include single-modality models, general computer vision frameworks, or task-specific applications like document OCR or license plate recognition.

There are 111 vision language models frameworks tracked. 7 score above 50 (established tier). The highest-rated is open-mmlab/mmpretrain at 60/100 with 3,837 stars. 1 of the top 10 are actively maintained.

Get all 111 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=ml-frameworks&subcategory=vision-language-models&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Framework Score Tier
1 open-mmlab/mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

60
Established
2 facebookresearch/mmf

A modular framework for vision & language multimodal research from Facebook...

58
Established
3 HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis

Papers, code and datasets about deep learning and multi-modal learning for...

51
Established
4 KaiyangZhou/pytorch-vsumm-reinforce

Unsupervised video summarization with deep reinforcement learning (AAAI'18)

51
Established
5 adambielski/siamese-triplet

Siamese and triplet networks with online pair/triplet mining in PyTorch

51
Established
6 kuanghuei/SCAN

PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018)

50
Established
7 friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

50
Established
8 batra-mlp-lab/visdial

[CVPR 2017] Torch code for Visual Dialog

49
Emerging
9 pliang279/awesome-multimodal-ml

Reading list for research topics in multimodal machine learning

48
Emerging
10 kezhang-cs/Video-Summarization-with-LSTM

Implementation of our ECCV 2016 Paper (Video Summarization with Long...

48
Emerging
11 vbalnt/tfeat

TFeat descriptor models for BMVC 2016 paper "Learning local feature...

47
Emerging
12 codebyshibsankar/image_triplet_loss

Image similarity using Triplet Loss

47
Emerging
13 kyegomez/HRTX

Multi-Modal Multi-Embodied Hivemind-like Iteration of RTX-2

47
Emerging
14 pliang279/MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning

47
Emerging
15 willxxy/awesome-mmps

Corpus of resources for multimodal machine learning with physiological...

46
Emerging
16 kyegomez/Med-PaLM

Towards Generalist Biomedical AI

45
Emerging
17 nekhtiari/image-similarity-measures

:chart_with_upwards_trend: Implementation of eight evaluation metrics to...

45
Emerging
18 mlfoundations/open_flamingo

An open-source framework for training large multimodal models.

45
Emerging
19 landskape-ai/triplet-attention

Official PyTorch Implementation for "Rotate to Attend: Convolutional Triplet...

44
Emerging
20 Cloud-CV/VQA

CloudCV Visual Question Answering Demo

44
Emerging
21 OpenBioLink/ThoughtSource

A central, open resource for data and tools related to chain-of-thought...

43
Emerging
22 Cadene/vqa.pytorch

Visual Question Answering in Pytorch

43
Emerging
23 thubZ09/vision-language-model-research

Hub for researchers exploring VLMs and Multimodal Learning:)

43
Emerging
24 thuiar/MIntRec

MIntRec: A New Dataset for Multimodal Intent Recognition (ACM MM 2022)

42
Emerging
25 aioz-ai/CFR_VQA

Coarse-to-Fine Reasoning for Visual Question Answering (CVPRW'22)

42
Emerging
26 maruya24/pytorch_robotics_transformer

A PyTorch re-implementation of the RT-1 (Robotics Transformer)

42
Emerging
27 kyegomez/Fuyu

Implementation of Adepts Fuyu all-new Multi-Modality model in pytorch

41
Emerging
28 ManifoldRG/NEKO

Implementation of GATO style Generalist Multimodal model capable of image,...

41
Emerging
29 abhshkdz/neural-vqa

:grey_question: Visual Question Answering in Torch

41
Emerging
30 mlbio-epfl/joint-inference

[ICLR 2025] Large (Vision) Language Models are Unsupervised In-Context Learners

40
Emerging
31 thswodnjs3/CSTA

The official code of "CSTA: CNN-based Spatiotemporal Attention for Video...

40
Emerging
32 IBM/AdaMML

Official implementation of AdaMML. https://arxiv.org/abs/2105.05165.

40
Emerging
33 aioz-ai/MICCAI21_MMQ

Multiple Meta-model Quantifying for Medical Visual Question Answering (MICCAI 2021)

40
Emerging
34 monjurulkarim/DSTA

This is the implementation code for the paper, "A Dynamic Spatial-temporal...

40
Emerging
35 yuanze-lin/REVIVE

[NeurIPS 2022] Official code for REVIVE: Regional Visual Representation...

39
Emerging
36 jingyi0000/VLM_survey

Collection of AWESOME vision-language models for vision tasks

39
Emerging
37 TIGER-AI-Lab/VideoScore

official repo for "VideoScore: Building Automatic Metrics to Simulate...

38
Emerging
38 abhshkdz/neural-vqa-attention

:question: Attention-based Visual Question Answering in Torch

38
Emerging
39 zchuz/CoT-Reasoning-Survey

[ACL 2024] A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future

36
Emerging
40 williamcfrancis/Visual-Question-Answering-using-Stacked-Attention-Networks

Pytorch implementation of VQA using Stacked Attention Networks: Multimodal...

36
Emerging
41 subho406/OmniNet

Official Pytorch implementation of "OmniNet: A unified architecture for...

36
Emerging
42 real-stanford/semantic-abstraction

[CoRL 2022] This repository contains code for generating relevancies,...

35
Emerging
43 RManLuo/MAMDR

Official code implementation for ICDE 23 paper MAMDR: A Model Agnostic...

35
Emerging
44 pranv/ARC

Code for Attentive Recurrent Comparators

34
Emerging
45 tgxs002/wikiscenes

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning...

34
Emerging
46 nerdimite/neuro-symbolic-ai-soc

Neuro-Symbolic Visual Question Answering on Sort-of-CLEVR using PyTorch

34
Emerging
47 pliang279/MultiViz

[ICLR 2023] MultiViz: Towards Visualizing and Understanding Multimodal Models

34
Emerging
48 invictus717/MiCo

[ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale

34
Emerging
49 AlwaysFHao/TiM4Rec

[Neurocomputing 2025] The code for the paper "TiM4Rec: An Efficient...

33
Emerging
50 Jakobovski/decoupled-multimodal-learning

A decoupled, generative, unsupervised, multimodal neural architecture.

33
Emerging
51 neulab/CulturalGround

This repository provides the official resources for EMNLP 2025 Paper...

33
Emerging
52 imneonizer/pytorch-triplet-loss

Birds 400-Species Image Classification using Pytorch Metric Learning...

32
Emerging
53 Skyyyy0920/MTNet

Code implementation for our paper "Learning Time Slot Preferences via...

32
Emerging
54 Rishit-dagli/Astroformer

This repository contains the official implementation of Astroformer, an ICLR...

32
Emerging
55 kyegomez/AutoRT

Implementation of AutoRT: "AutoRT: Embodied Foundation Models for Large...

32
Emerging
56 Soumya-Chakraborty/Unsupervised-video-summarization-with-deep-GAN-reinforcement-learning

Unsupervised video summarization with deep(GAN) reinforcement learning

32
Emerging
57 tensorpix/benchmarking-cv-models

Benchmark computer vision ML models in 3 minutes

32
Emerging
58 etornam45/vl-jepa

This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper

30
Emerging
59 cpystan/WSI-VQA

[ECCV 2024] Official Implementation of 《WSI-VQA: Interpreting Whole Slide...

30
Emerging
60 AceCHQ/MMIQ

This repo contains evaluation code for MM-IQ benchmark.

30
Emerging
61 lilygeorgescu/MHCA

Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for...

29
Experimental
62 liveseongho/Awesome-Video-Language-Understanding

A Survey on video and language understanding.

29
Experimental
63 ntkhoa95/multimodal-for-vision

Vision Framework: A modular multi-agent system for computer vision tasks,...

29
Experimental
64 le-liang/Multimodal-Wireless

Python scripts and assets related to Multimodal-Wireless dataset. The...

29
Experimental
65 fansunqi/VideoTool

Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal...

29
Experimental
66 Peachypie98/CBAM

CBAM: Convolutional Block Attention Module for CIFAR100 on VGG19

29
Experimental
67 yousefkotp/Visual-Question-Answering

A Light weight deep learning model with with a web application to answer...

29
Experimental
68 vtu81/NaiveVQA

A Visual Question Answering model implemented in MindSpore and PyTorch. The...

28
Experimental
69 zamaex96/Hybrid-CNN-LSTM-with-Spatial-Attention

This documents the training and evaluation of a Hybrid CNN-LSTM Attention...

28
Experimental
70 VQA-Team/Visual-Question-Answering

The project is an Android application aimed to help the visually impaired by...

27
Experimental
71 kyegomez/NeVA

The open source implementation of "NeVA: NeMo Vision and Language Assistant"

27
Experimental
72 uakarsh/med-vqa

An approach for solving the problem of medical visual question answering

27
Experimental
73 RobotiXX/multimodal-fusion-network

This repository contains all the code for Parsing, Transforming and Training...

27
Experimental
74 kyegomez/MultiModal-ToT

Multi-Modal Tree of thoughts for DALLE-3 like auto self improvement

27
Experimental
75 schwettmann/visual-vocab

Pytorch-based tools for constructing a vocabulary of visual concepts in a GAN.

27
Experimental
76 naamiinepal/tunevlseg

[ACCV 2024]: TuneVLSeg: Prompt Tuning Benchmark for Vision-Language...

25
Experimental
77 yuhui-zh15/VLMClassifier

Official implementation of "Why are Visually-Grounded Language Models Bad at...

25
Experimental
78 projectayre/ayre

Visual Question Answering with added novel Semantic Analysis approach....

24
Experimental
79 clear-nus/MuMMI

Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised...

24
Experimental
80 SriramPingali/Multi-Modal-Recommendation-System

Official code for the paper "Towards developing a Multi Modal Video...

23
Experimental
81 aaaastark/hybrid-model-with-cnn-lstm-python

Hybrid Model with CNN and LSTM for VMD dataset using Python

23
Experimental
82 kyegomez/VisionLLaMA

Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA...

22
Experimental
83 rkl71/MambaRec

[CIKM 2025] Source code for "Modality Alignment with Multi-scale Bilateral...

22
Experimental
84 fansunqi/AKeyS

Agentic Keyframe Search for Video Question Answering

22
Experimental
85 ankitsharma-tech/Image-Triplet-Loss

Image similarity using Triplet Loss.

22
Experimental
86 iluvn01/VFMTok

🖼️ Leverage vision foundation models to transform visual data into effective...

22
Experimental
87 guyyariv/vLMIG

This repo contains the official PyTorch implementation of vLMIG: Improving...

22
Experimental
88 anujanegi/VQA

Visual Question Answering System

21
Experimental
89 RobinDong/tiny_multimodal

Tiny and simple implementation of multimodal models

20
Experimental
90 google/crossmodal-3600

Crossmodal-3600 dataset

20
Experimental
91 cronenberg64/VLM-arch

Systematic benchmarking of modern vision backbones under small-data...

20
Experimental
92 Dafterfly/Quick_Vilt

A CLI and GUI for using the Vision-and-Language Transformer (ViLT) model for...

19
Experimental
93 alsaniie/Image-Similarity-Index-SSIM-analysis-

In image processing, an image similarity index, also known as a similarity...

19
Experimental
94 YeLuoSuiYou/openstorypp

We introduce OpenStory++, a large-scale open-domain dataset focusing on...

19
Experimental
95 Hodasia/Awesome-Vision-Language-Finetune

Awesome List of Vision Language Prompt Papers

19
Experimental
96 lyuchenyang/Semantic-aware-VideoQA

Code for ACL SRW 2023 paepr "Semantic-aware Dynamic...

19
Experimental
97 aiden200/VLM_Implementation

Implementing a Video Language Model from scratch

19
Experimental
98 ved1beta/Paligemma

vision language model

19
Experimental
99 holylovenia/awesome-multimodal-convai

Paper reading list for Multimodal Conversational AI

19
Experimental
100 MohEsmail143/vizwiz-visual-question-answering

An implementation of the paper "Less is More", which was used to attempt the...

17
Experimental
101 Gurumurthy30/multimodal-gpt2-demo

A lightweight multimodal model combining GPT-2 and Vision Transformer for...

15
Experimental
102 Soumya-Chakraborty/VL-JEPA

VL-JEPA Joint Embedding Predictive Architecture for Vision-language...

15
Experimental
103 yuhui-zh15/drml

Official Code Release for "Diagnosing and Rectifying Vision Models using...

15
Experimental
104 TAU-VAILab/isbertblind

This repository is for the paper "Is BERT Blind? Exploring the Effect of...

14
Experimental
105 jesusp1234/multimodal-benchmarks

🎯 Benchmark retrieval systems across video, image, audio, and documents with...

14
Experimental
106 anggaumhar/dynamicvl

🌆 Benchmark multimodal large language models to enhance understanding of...

14
Experimental
107 ipoukoumondi/IWR-Bench

🌐 Evaluate LVLMs' ability to reconstruct dynamic, interactive webpages from...

14
Experimental
108 darkmax159159357/TypeR-models

⚠️ DEPRECATED — Merged into darkmax159159357/TypeR. See main repo for all...

13
Experimental
109 MichiganNLP/wildqa

WildQA website code

13
Experimental
110 soominmyung/Pairwise_Siamese_transformer

Pairwise Preference Learning with Siamese Transformer Encoders

13
Experimental
111 retkowsky/ViLT

Visual Question Answering with ViLT

12
Experimental