kyegomez/MultiModalMamba
A novel implementation of fusing ViT with Mamba into a fast, agile, and high performance Multi-Modal Model. Powered by Zeta, the simplest AI framework ever.
This project offers an advanced AI model that can understand and process both text and images simultaneously, like a person who can read and see at the same time. You can feed it a combination of written information and visual data, and it will generate an integrated interpretation. It's designed for data scientists and AI researchers who need to build systems that make sense of different types of information together.
465 stars.
Use this if you are building an AI application that needs to understand context from both text and images, such as for content analysis or advanced search engines.
Not ideal if your task only involves a single data type (like just text or just images) or if you need a simpler, less customizable model.
Stars
465
Forks
25
Language
Python
License
MIT
Category
Last pushed
Feb 13, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/kyegomez/MultiModalMamba"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
NVlabs/MambaVision
[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
sign-language-translator/sign-language-translator
Python library & framework to build custom translators for the hearing-impaired and translate...
kyegomez/Jamba
PyTorch Implementation of Jamba: "Jamba: A Hybrid Transformer-Mamba Language Model"
autonomousvision/transfuser
[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving;...
dali92002/DocEnTR
DocEnTr: An end-to-end document image enhancement transformer - ICPR 2022