Multimodal Vision Language LLM Tools

LLMs designed for understanding and generating content across vision, audio, video, and temporal modalities. Includes models that process images, videos, 3D shapes, and audio alongside text. Does NOT include single-modality tools, general text-only LLMs, or tools that only caption/describe without deeper reasoning.

There are 74 multimodal vision language tools tracked. 3 score above 50 (established tier). The highest-rated is jingyaogong/minimind-v at 63/100 with 6,712 stars. 2 of the top 10 are actively maintained.

Get all 74 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 jingyaogong/minimind-v

🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in...

63
Established
2 SkyworkAI/Skywork-R1V

Skywork-R1V is an advanced multimodal AI model series developed by Skywork...

51
Established
3 roboflow/vision-ai-checkup

Take your LLM to the optometrist.

51
Established
4 zai-org/GLM-TTS

GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward...

49
Emerging
5 NExT-GPT/NExT-GPT

Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large...

48
Emerging
6 EvolvingLMMs-Lab/NEO

NEO Series: Native Vision-Language Models from First Principles

48
Emerging
7 OpenGVLab/InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to...

47
Emerging
8 EvolvingLMMs-Lab/LLaVA-OneVision-1.5

Fully Open Framework for Democratized Multimodal Training

47
Emerging
9 huangwl18/VoxPoser

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

47
Emerging
10 InternLM/InternLM-XComposer

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for...

46
Emerging
11 OpenGVLab/Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And...

45
Emerging
12 ihp-lab/Face-LLaVA

[WACV 2026] Face-LLaVA: Facial Expression and Attribute Understanding...

45
Emerging
13 JIA-Lab-research/MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality...

45
Emerging
14 EvolvingLMMs-Lab/Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of...

44
Emerging
15 connorkapoor/Palmetto

A simple web-based CAD workbench for discovering and creating DFM (Design...

44
Emerging
16 OceanGPT/OceanGPT

[沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks

43
Emerging
17 bagh2178/SG-Nav

[NeurIPS 2024] SG-Nav: Online 3D Scene Graph Prompting for LLM-based...

42
Emerging
18 thuml/iVideoGPT

Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World...

42
Emerging
19 LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

42
Emerging
20 JIA-Lab-research/LLMGA

This project is the official implementation of 'LLMGA: Multimodal Large...

41
Emerging
21 FusionBrainLab/OmniFusion

OmniFusion — a multimodal model to communicate using text and images

41
Emerging
22 YvanYin/DrivingWorld

Code for "DrivingWorld: Constructing World Model for Autonomous Driving via...

41
Emerging
23 tincans-ai/gazelle

Joint speech-language model - respond directly to audio!

41
Emerging
24 yuanze-lin/Olympus

[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router...

41
Emerging
25 PKU-YuanGroup/Chat-UniVi

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers...

41
Emerging
26 SALT-NLP/Sketch2Code

Code for the paper: Sketch2Code: Evaluating Vision-Language Models for...

40
Emerging
27 dimitrismallis/CAD-Assistant

Code for our ICCV 2025 paper "CAD-Assistant: Tool-Augmented VLLMs as Generic...

39
Emerging
28 MooreThreads/MooER

MooER: Moore-threads Open Omni model for speech-to-speech intERaction....

39
Emerging
29 Pointcept/GPT4Point

[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language...

39
Emerging
30 H-Freax/ThinkGrasp

[CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping...

38
Emerging
31 greenland-dream/video-understanding

This repository provides core code for managing large volumes of video...

38
Emerging
32 wgcyeo/WorldMM

[CVPR 2026] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

36
Emerging
33 mbzuai-oryx/LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

36
Emerging
34 Open3DA/LL3DA

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D...

36
Emerging
35 isjinghao/OralGPT

[NeurIPS'25 | CVPR'26] The official repo of OralGPT & MMOral Bench.

35
Emerging
36 om-ai-lab/ZoomEye

[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming...

35
Emerging
37 worldbench/VideoLucy

[NeurIPS 2025] Deep Memory Backtracking for Long Video Understanding

34
Emerging
38 FuxiaoLiu/MMC

[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM...

33
Emerging
39 luxus180/LLaVA-OneVision-1.5

🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an...

32
Emerging
40 Hiram31/CADialogue

Official implementation of "CADialogue: A Multimodal LLM-Powered...

32
Emerging
41 WisconsinAIVision/YoChameleon

🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)

32
Emerging
42 bigai-nlco/VideoTGB

[EMNLP 2024] A Video Chat Agent with Temporal Prior

32
Emerging
43 nuldertien/PathBLIP-2

This repository contains all code to support the paper: "On the Importance...

31
Emerging
44 showlab/VLog

[CVPR 2025] Video Narration as Vocabulary & Video as Long Document

31
Emerging
45 XduSyL/EventGPT

🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large...

31
Emerging
46 yifanlu0227/ChatSim

[CVPR2024 Highlight] Editable Scene Simulation for Autonomous Driving via...

31
Emerging
47 Piero24/VLM-Object-Detection

A pipeline for object detection and segmentation using a Vision-Language...

31
Emerging
48 ShareGPT4Omni/ShareGPT4Video

[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving...

31
Emerging
49 Hyeongkeun/LAVCap

Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual...

30
Emerging
50 ZPider0/Multimodal

🎤 Transform speech and text with this lightweight Python toolkit for...

28
Experimental
51 OmniMMI/OmniMMI

[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in...

28
Experimental
52 DonaldTrump-coder/Informative-Scene-Reconstruction-App

A local software and cloud service system that integrates 3D functionalities...

28
Experimental
53 anymodality/anymodality

AnyModality is an open-source library to simplify MultiModal LLM inference...

27
Experimental
54 ShareGPT4Omni/ShareGPT4V

[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions

27
Experimental
55 whwu95/FreeVA

FreeVA: Offline MLLM as Training-Free Video Assistant

27
Experimental
56 timmylucy/GLM-ASR

🔊 Enhance speech recognition with GLM-ASR-Nano-2512, a high-performance...

27
Experimental
57 hamedR96/User-VLM

Personalized Vision Language Models for Social Human-Robot Interactions

26
Experimental
58 SiyuWang0906/CAD-GPT

[AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial...

26
Experimental
59 Toommo2/Text2CAD

🚀 Convert natural language to real CAD artifacts with Text2CAD, an...

25
Experimental
60 InternRobotics/VLM-Grounder

[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

24
Experimental
61 alexander-moore/vlm

Composition of Multimodal Language Models From Scratch

24
Experimental
62 Pittawat2542/driving-assessment-distillation

This repository contains the code and data for the paper "Speed Up!...

23
Experimental
63 Atomic-man007/blip-vision-language

BLIP is a novel Vision-Language Pre-training (VLP) framework designed to...

22
Experimental
64 OpenShapeLab/ShapeGPT

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a...

21
Experimental
65 engindeniz/vitis

[ICCV 2023 CLVL Workshop] Zero-Shot and Few-Shot Video Question Answering...

21
Experimental
66 Jeremyyny/Value-Spectrum

Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value...

20
Experimental
67 PrateekJannu/Vision-GPT

Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by...

17
Experimental
68 oncescuandreea/audio_egovlp

This is the official codebase used for obtaining the results in the ICASSP...

17
Experimental
69 sonkd/Visual-Question-Answering-on-VizWiz

Visual Question Answering on VizWiz, A Generative CLIP + LSTM Approach with...

14
Experimental
70 david-s-martinez/Dex-GAN-Grasp

DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for...

13
Experimental
71 ShareGPT4Omni/ShareGPT4Omni

ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with...

13
Experimental
72 luisrui/Modality-Interference-in-MLLMs

The source code for the paper "Diagnosing and Mitigating Modality...

13
Experimental
73 lemonmindyes/ThinkCLIP

Lightweight CLIP framework built with ViT + GPT encoders for vision-language...

12
Experimental
74 RajGothi/Visual-Entities-Empowered-Zero-Shot-Image-to-Text-Generation-Transfer-Across-Domains

Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains

11
Experimental