Multimodal Vision Language LLM Tools
LLMs designed for understanding and generating content across vision, audio, video, and temporal modalities. Includes models that process images, videos, 3D shapes, and audio alongside text. Does NOT include single-modality tools, general text-only LLMs, or tools that only caption/describe without deeper reasoning.
There are 74 multimodal vision language tools tracked. 3 score above 50 (established tier). The highest-rated is jingyaogong/minimind-v at 63/100 with 6,712 stars. 2 of the top 10 are actively maintained.
Get all 74 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=multimodal-vision-language&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
jingyaogong/minimind-v
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in... |
|
Established |
| 2 |
SkyworkAI/Skywork-R1V
Skywork-R1V is an advanced multimodal AI model series developed by Skywork... |
|
Established |
| 3 |
roboflow/vision-ai-checkup
Take your LLM to the optometrist. |
|
Established |
| 4 |
zai-org/GLM-TTS
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward... |
|
Emerging |
| 5 |
NExT-GPT/NExT-GPT
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large... |
|
Emerging |
| 6 |
EvolvingLMMs-Lab/NEO
NEO Series: Native Vision-Language Models from First Principles |
|
Emerging |
| 7 |
OpenGVLab/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to... |
|
Emerging |
| 8 |
EvolvingLMMs-Lab/LLaVA-OneVision-1.5
Fully Open Framework for Democratized Multimodal Training |
|
Emerging |
| 9 |
huangwl18/VoxPoser
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models |
|
Emerging |
| 10 |
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for... |
|
Emerging |
| 11 |
OpenGVLab/Ask-Anything
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And... |
|
Emerging |
| 12 |
ihp-lab/Face-LLaVA
[WACV 2026] Face-LLaVA: Facial Expression and Attribute Understanding... |
|
Emerging |
| 13 |
JIA-Lab-research/MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality... |
|
Emerging |
| 14 |
EvolvingLMMs-Lab/Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of... |
|
Emerging |
| 15 |
connorkapoor/Palmetto
A simple web-based CAD workbench for discovering and creating DFM (Design... |
|
Emerging |
| 16 |
OceanGPT/OceanGPT
[沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks |
|
Emerging |
| 17 |
bagh2178/SG-Nav
[NeurIPS 2024] SG-Nav: Online 3D Scene Graph Prompting for LLM-based... |
|
Emerging |
| 18 |
thuml/iVideoGPT
Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World... |
|
Emerging |
| 19 |
LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills |
|
Emerging |
| 20 |
JIA-Lab-research/LLMGA
This project is the official implementation of 'LLMGA: Multimodal Large... |
|
Emerging |
| 21 |
FusionBrainLab/OmniFusion
OmniFusion — a multimodal model to communicate using text and images |
|
Emerging |
| 22 |
YvanYin/DrivingWorld
Code for "DrivingWorld: Constructing World Model for Autonomous Driving via... |
|
Emerging |
| 23 |
tincans-ai/gazelle
Joint speech-language model - respond directly to audio! |
|
Emerging |
| 24 |
yuanze-lin/Olympus
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router... |
|
Emerging |
| 25 |
PKU-YuanGroup/Chat-UniVi
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers... |
|
Emerging |
| 26 |
SALT-NLP/Sketch2Code
Code for the paper: Sketch2Code: Evaluating Vision-Language Models for... |
|
Emerging |
| 27 |
dimitrismallis/CAD-Assistant
Code for our ICCV 2025 paper "CAD-Assistant: Tool-Augmented VLLMs as Generic... |
|
Emerging |
| 28 |
MooreThreads/MooER
MooER: Moore-threads Open Omni model for speech-to-speech intERaction.... |
|
Emerging |
| 29 |
Pointcept/GPT4Point
[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language... |
|
Emerging |
| 30 |
H-Freax/ThinkGrasp
[CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping... |
|
Emerging |
| 31 |
greenland-dream/video-understanding
This repository provides core code for managing large volumes of video... |
|
Emerging |
| 32 |
wgcyeo/WorldMM
[CVPR 2026] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning |
|
Emerging |
| 33 |
mbzuai-oryx/LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3) |
|
Emerging |
| 34 |
Open3DA/LL3DA
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D... |
|
Emerging |
| 35 |
isjinghao/OralGPT
[NeurIPS'25 | CVPR'26] The official repo of OralGPT & MMOral Bench. |
|
Emerging |
| 36 |
om-ai-lab/ZoomEye
[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming... |
|
Emerging |
| 37 |
worldbench/VideoLucy
[NeurIPS 2025] Deep Memory Backtracking for Long Video Understanding |
|
Emerging |
| 38 |
FuxiaoLiu/MMC
[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM... |
|
Emerging |
| 39 |
luxus180/LLaVA-OneVision-1.5
🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an... |
|
Emerging |
| 40 |
Hiram31/CADialogue
Official implementation of "CADialogue: A Multimodal LLM-Powered... |
|
Emerging |
| 41 |
WisconsinAIVision/YoChameleon
🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025) |
|
Emerging |
| 42 |
bigai-nlco/VideoTGB
[EMNLP 2024] A Video Chat Agent with Temporal Prior |
|
Emerging |
| 43 |
nuldertien/PathBLIP-2
This repository contains all code to support the paper: "On the Importance... |
|
Emerging |
| 44 |
showlab/VLog
[CVPR 2025] Video Narration as Vocabulary & Video as Long Document |
|
Emerging |
| 45 |
XduSyL/EventGPT
🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large... |
|
Emerging |
| 46 |
yifanlu0227/ChatSim
[CVPR2024 Highlight] Editable Scene Simulation for Autonomous Driving via... |
|
Emerging |
| 47 |
Piero24/VLM-Object-Detection
A pipeline for object detection and segmentation using a Vision-Language... |
|
Emerging |
| 48 |
ShareGPT4Omni/ShareGPT4Video
[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving... |
|
Emerging |
| 49 |
Hyeongkeun/LAVCap
Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual... |
|
Emerging |
| 50 |
ZPider0/Multimodal
🎤 Transform speech and text with this lightweight Python toolkit for... |
|
Experimental |
| 51 |
OmniMMI/OmniMMI
[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in... |
|
Experimental |
| 52 |
DonaldTrump-coder/Informative-Scene-Reconstruction-App
A local software and cloud service system that integrates 3D functionalities... |
|
Experimental |
| 53 |
anymodality/anymodality
AnyModality is an open-source library to simplify MultiModal LLM inference... |
|
Experimental |
| 54 |
ShareGPT4Omni/ShareGPT4V
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions |
|
Experimental |
| 55 |
whwu95/FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant |
|
Experimental |
| 56 |
timmylucy/GLM-ASR
🔊 Enhance speech recognition with GLM-ASR-Nano-2512, a high-performance... |
|
Experimental |
| 57 |
hamedR96/User-VLM
Personalized Vision Language Models for Social Human-Robot Interactions |
|
Experimental |
| 58 |
SiyuWang0906/CAD-GPT
[AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial... |
|
Experimental |
| 59 |
Toommo2/Text2CAD
🚀 Convert natural language to real CAD artifacts with Text2CAD, an... |
|
Experimental |
| 60 |
InternRobotics/VLM-Grounder
[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding |
|
Experimental |
| 61 |
alexander-moore/vlm
Composition of Multimodal Language Models From Scratch |
|
Experimental |
| 62 |
Pittawat2542/driving-assessment-distillation
This repository contains the code and data for the paper "Speed Up!... |
|
Experimental |
| 63 |
Atomic-man007/blip-vision-language
BLIP is a novel Vision-Language Pre-training (VLP) framework designed to... |
|
Experimental |
| 64 |
OpenShapeLab/ShapeGPT
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a... |
|
Experimental |
| 65 |
engindeniz/vitis
[ICCV 2023 CLVL Workshop] Zero-Shot and Few-Shot Video Question Answering... |
|
Experimental |
| 66 |
Jeremyyny/Value-Spectrum
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value... |
|
Experimental |
| 67 |
PrateekJannu/Vision-GPT
Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by... |
|
Experimental |
| 68 |
oncescuandreea/audio_egovlp
This is the official codebase used for obtaining the results in the ICASSP... |
|
Experimental |
| 69 |
sonkd/Visual-Question-Answering-on-VizWiz
Visual Question Answering on VizWiz, A Generative CLIP + LSTM Approach with... |
|
Experimental |
| 70 |
david-s-martinez/Dex-GAN-Grasp
DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for... |
|
Experimental |
| 71 |
ShareGPT4Omni/ShareGPT4Omni
ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with... |
|
Experimental |
| 72 |
luisrui/Modality-Interference-in-MLLMs
The source code for the paper "Diagnosing and Mitigating Modality... |
|
Experimental |
| 73 |
lemonmindyes/ThinkCLIP
Lightweight CLIP framework built with ViT + GPT encoders for vision-language... |
|
Experimental |
| 74 |
RajGothi/Visual-Entities-Empowered-Zero-Shot-Image-to-Text-Generation-Transfer-Across-Domains
Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains |
|
Experimental |