The Multimodal Directory
Quality-scored directory of 39 multimodal ai tools, updated daily. Every tool scored on maintenance, adoption, maturity, and community signals.
Vision-language models, cross-modal retrieval, and multimodal learning tools — combining text, image, audio, and video understanding in unified systems.
1
70–100
9
50–69
16
30–49
13
10–29
Top tools by quality score
| # | Tool | Score |
|---|---|---|
| 1 |
starVLA/starVLA
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing |
|
| 2 |
vortex-data/vortex
An extensible, state-of-the-art framework for columnar compression, and the... |
|
| 3 |
motis-project/motis
multimodal routing, geocoding, and map tiles |
|
| 4 |
zai-org/GLM-V
GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with... |
|
| 5 |
neka-nat/cad3dify
2D to 3D CAD Conversion Using VLM |
|
| 6 |
batmanlab/Mammo-CLIP
[MICCAI 2024, top 11%] Official Pytorch implementation of Mammo-CLIP: A... |
|
| 7 |
opendatalab/mineru-vl-utils
A Python package for interacting with the MinerU Vision-Language Model. |
|
| 8 |
EMob-Lab/MnMS
Agent-based Multimodal Urban Moblity Simulator resulting from the ERC MAGnUM project |
|
| 9 |
GerrySant/multimodalhugs
MultimodalHugs is an extension of Hugging Face that offers a generalized... |
|
| 10 |
withceleste/celeste-python
Open source, type-safe primitives for multi-modal AI. All modelities, all... |
|
| 11 |
cloudglue/cloudglue-js
Official JavaScript / TypeScript SDK for Cloudglue API |
|
| 12 |
EvolvingLMMs-Lab/LongVT
[CVPR 2026] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling |
|
| 13 |
om-ai-lab/GroundVLP
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language... |
|
| 14 |
Jinfeng-Xu/Awesome-Multimodal-Recommender-Systems
[TMM'26] Continuously Updated Awesome Multimodal Recommendation Paper List |
|
| 15 |
anam-org/metaxy
Pluggable sample-level metadata versioning for incremental multimodal pipelines. |
|
| 16 |
eduardosanzb/escribano
AI-powered session intelligence tool - transcribes Cap recordings with Whisper |
|
| 17 |
yunncheng/MMRL
[CVPR 2025 & IJCV2026] Official PyTorch Code for "MMRL: Multi-Modal... |
|
| 18 |
Mellow-Artificial-Intelligence/open-xtract
Extract structured data from documents, images, audio, and video using LLMs. |
|
| 19 |
ComfyUI-Kelin/ComfyUI-LLMs-Toolkit
ComfyUI custom nodes for DeepSeek, Qwen, GPT, and other OpenAI-compatible... |
|
| 20 |
MING-ZCH/CII-Bench
[ACL 2025] Can MLLMs Understand the Deep Implication Behind Chinese Images? |
|