Multimodal Vision Language LLM Tools

LLMs designed for understanding and generating content across vision, audio, video, and temporal modalities. Includes models that process images, videos, 3D shapes, and audio alongside text. Does NOT include single-modality tools, general text-only LLMs, or tools that only caption/describe without deeper reasoning.

There are 74 multimodal vision language tools tracked. 3 score above 50 (established tier). The highest-rated is jingyaogong/minimind-v at 63/100 with 6,712 stars. 2 of the top 10 are actively maintained.

Get all 74 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	jingyaogong/minimind-v 🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in...	63	Established	6,712	Python
2	SkyworkAI/Skywork-R1V Skywork-R1V is an advanced multimodal AI model series developed by Skywork...	51	Established	3,158	Python
3	roboflow/vision-ai-checkup Take your LLM to the optometrist.	51	Established	46	Python
4	zai-org/GLM-TTS GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward...	49	Emerging	949	Python
5	NExT-GPT/NExT-GPT Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large...	48	Emerging	3,618	Python
6	EvolvingLMMs-Lab/NEO NEO Series: Native Vision-Language Models from First Principles	48	Emerging	675	Python
7	OpenGVLab/InternVL [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to...	47	Emerging	9,879	Python
8	EvolvingLMMs-Lab/LLaVA-OneVision-1.5 Fully Open Framework for Democratized Multimodal Training	47	Emerging	762	Python
9	huangwl18/VoxPoser VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models	47	Emerging	786	Python
10	InternLM/InternLM-XComposer InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for...	46	Emerging	2,922	Python
11	OpenGVLab/Ask-Anything [CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And...	45	Emerging	3,335	Python
12	ihp-lab/Face-LLaVA [WACV 2026] Face-LLaVA: Facial Expression and Attribute Understanding...	45	Emerging	11	Python
13	JIA-Lab-research/MGM Official repo for "Mini-Gemini: Mining the Potential of Multi-modality...	45	Emerging	3,334	Python
14	EvolvingLMMs-Lab/Otter 🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of...	44	Emerging	3,344	Python
15	connorkapoor/Palmetto A simple web-based CAD workbench for discovering and creating DFM (Design...	44	Emerging	19	C++
16	OceanGPT/OceanGPT [沧渊] [ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks	43	Emerging	101	Python
17	bagh2178/SG-Nav [NeurIPS 2024] SG-Nav: Online 3D Scene Graph Prompting for LLM-based...	42	Emerging	323	Jupyter Notebook
18	thuml/iVideoGPT Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World...	42	Emerging	172	Python
19	LLaVA-VL/LLaVA-Plus-Codebase LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills	42	Emerging	763	Python
20	JIA-Lab-research/LLMGA This project is the official implementation of 'LLMGA: Multimodal Large...	41	Emerging	398	Python
21	FusionBrainLab/OmniFusion OmniFusion — a multimodal model to communicate using text and images	41	Emerging	235	Python
22	YvanYin/DrivingWorld Code for "DrivingWorld: Constructing World Model for Autonomous Driving via...	41	Emerging	238	Python
23	tincans-ai/gazelle Joint speech-language model - respond directly to audio!	41	Emerging	373	Python
24	yuanze-lin/Olympus [CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router...	41	Emerging	427	Python
25	PKU-YuanGroup/Chat-UniVi [CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers...	41	Emerging	946	Python
26	SALT-NLP/Sketch2Code Code for the paper: Sketch2Code: Evaluating Vision-Language Models for...	40	Emerging	37	Python
27	dimitrismallis/CAD-Assistant Code for our ICCV 2025 paper "CAD-Assistant: Tool-Augmented VLLMs as Generic...	39	Emerging	47	Python
28	MooreThreads/MooER MooER: Moore-threads Open Omni model for speech-to-speech intERaction....	39	Emerging	218	Python
29	Pointcept/GPT4Point [CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language...	39	Emerging	441	Python
30	H-Freax/ThinkGrasp [CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping...	38	Emerging	113	Python
31	greenland-dream/video-understanding This repository provides core code for managing large volumes of video...	38	Emerging	20	Python
32	wgcyeo/WorldMM [CVPR 2026] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning	36	Emerging	61	Python
33	mbzuai-oryx/LLaVA-pp 🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)	36	Emerging	848	Python
34	Open3DA/LL3DA [CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D...	36	Emerging	311	Python
35	isjinghao/OralGPT [NeurIPS'25 \| CVPR'26] The official repo of OralGPT & MMOral Bench.	35	Emerging	75	Python
36	om-ai-lab/ZoomEye [EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming...	35	Emerging	77	Python
37	worldbench/VideoLucy [NeurIPS 2025] Deep Memory Backtracking for Long Video Understanding	34	Emerging	64	Python
38	FuxiaoLiu/MMC [NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM...	33	Emerging	95	Python
39	luxus180/LLaVA-OneVision-1.5 🛠️ Build and train multimodal models easily with LLaVA-OneVision 1.5, an...	32	Emerging	3	Python
40	Hiram31/CADialogue Official implementation of "CADialogue: A Multimodal LLM-Powered...	32	Emerging	12	Python
41	WisconsinAIVision/YoChameleon 🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)	32	Emerging	151	Python
42	bigai-nlco/VideoTGB [EMNLP 2024] A Video Chat Agent with Temporal Prior	32	Emerging	32	Python
43	nuldertien/PathBLIP-2 This repository contains all code to support the paper: "On the Importance...	31	Emerging	2	Jupyter Notebook
44	showlab/VLog [CVPR 2025] Video Narration as Vocabulary & Video as Long Document	31	Emerging	590	Python
45	XduSyL/EventGPT 🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large...	31	Emerging	104	Python
46	yifanlu0227/ChatSim [CVPR2024 Highlight] Editable Scene Simulation for Autonomous Driving via...	31	Emerging	419	Python
47	Piero24/VLM-Object-Detection A pipeline for object detection and segmentation using a Vision-Language...	31	Emerging	1	Jupyter Notebook
48	ShareGPT4Omni/ShareGPT4Video [NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving...	31	Emerging	1,088	Python
49	Hyeongkeun/LAVCap Official Pytorch Implementation of 'LAVCap: LLM-based Audio-Visual...	30	Emerging	10	Python
50	ZPider0/Multimodal 🎤 Transform speech and text with this lightweight Python toolkit for...	28	Experimental	2	Jupyter Notebook
51	OmniMMI/OmniMMI [CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in...	28	Experimental	21	Python
52	DonaldTrump-coder/Informative-Scene-Reconstruction-App A local software and cloud service system that integrates 3D functionalities...	28	Experimental	3	Python
53	anymodality/anymodality AnyModality is an open-source library to simplify MultiModal LLM inference...	27	Experimental	2	Python
54	ShareGPT4Omni/ShareGPT4V [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions	27	Experimental	251	Python
55	whwu95/FreeVA FreeVA: Offline MLLM as Training-Free Video Assistant	27	Experimental	69	Python
56	timmylucy/GLM-ASR 🔊 Enhance speech recognition with GLM-ASR-Nano-2512, a high-performance...	27	Experimental	1	Python
57	hamedR96/User-VLM Personalized Vision Language Models for Social Human-Robot Interactions	26	Experimental	5	JavaScript
58	SiyuWang0906/CAD-GPT [AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial...	26	Experimental	48	—
59	Toommo2/Text2CAD 🚀 Convert natural language to real CAD artifacts with Text2CAD, an...	25	Experimental	—	Python
60	InternRobotics/VLM-Grounder [CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	24	Experimental	129	Python
61	alexander-moore/vlm Composition of Multimodal Language Models From Scratch	24	Experimental	15	Jupyter Notebook
62	Pittawat2542/driving-assessment-distillation This repository contains the code and data for the paper "Speed Up!...	23	Experimental	3	Jupyter Notebook
63	Atomic-man007/blip-vision-language BLIP is a novel Vision-Language Pre-training (VLP) framework designed to...	22	Experimental	2	Jupyter Notebook
64	OpenShapeLab/ShapeGPT ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, a...	21	Experimental	99	—
65	engindeniz/vitis [ICCV 2023 CLVL Workshop] Zero-Shot and Few-Shot Video Question Answering...	21	Experimental	14	Python
66	Jeremyyny/Value-Spectrum Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value...	20	Experimental	2	—
67	PrateekJannu/Vision-GPT Coding a Multi-Modal vision model like GPT-4o from scratch, inspired by...	17	Experimental	1	Python
68	oncescuandreea/audio_egovlp This is the official codebase used for obtaining the results in the ICASSP...	17	Experimental	3	Python
69	sonkd/Visual-Question-Answering-on-VizWiz Visual Question Answering on VizWiz, A Generative CLIP + LSTM Approach with...	14	Experimental	—	Jupyter Notebook
70	david-s-martinez/Dex-GAN-Grasp DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for...	13	Experimental	11	Python
71	ShareGPT4Omni/ShareGPT4Omni ShareGPT4Omni: Towards Building Omni Large Multi-modal Models with...	13	Experimental	9	—
72	luisrui/Modality-Interference-in-MLLMs The source code for the paper "Diagnosing and Mitigating Modality...	13	Experimental	7	Python
73	lemonmindyes/ThinkCLIP Lightweight CLIP framework built with ViT + GPT encoders for vision-language...	12	Experimental	1	Python
74	RajGothi/Visual-Entities-Empowered-Zero-Shot-Image-to-Text-Generation-Transfer-Across-Domains Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains	11	Experimental	3	Jupyter Notebook