kyegomez/VLM-Mamba

We introduce VLM-Mamba, the first Vision-Language Model built entirely on State Space Models (SSMs), specifically leveraging the Mamba architecture.

/ 100

Emerging

This project offers a new way for AI developers to build Vision-Language Models (VLMs) that understand both images and text. It takes images and text tokens as input and produces integrated vision-language outputs, allowing for tasks like image captioning or visual question answering. AI researchers and machine learning engineers looking to develop more efficient multi-modal AI systems would use this.

Use this if you are an AI developer looking to build vision-language models with significantly reduced memory footprint and faster inference speeds compared to traditional Transformer-based models.

Not ideal if you are a practitioner looking for a ready-to-use, pre-trained VLM for end-user applications without needing to engage in model development.

AI-development multi-modal-AI machine-learning-engineering vision-language-modeling

No Package No Dependents

Maintenance 6 / 25

Adoption 5 / 25

Maturity 15 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

NVlabs/MambaVision

[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone

sign-language-translator/sign-language-translator

Python library & framework to build custom translators for the hearing-impaired and translate...

kyegomez/Jamba

PyTorch Implementation of Jamba: "Jamba: A Hybrid Transformer-Mamba Language Model"

autonomousvision/transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving;...

kyegomez/MultiModalMamba

A novel implementation of fusing ViT with Mamba into a fast, agile, and high performance...

Explore Transformer Models

All categories Trending Transformer directory Insights