yuecao0119/MMFuser
The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". MMFuser addresses the limitations of current MLLMs in capturing complex image details by simply yet efficiently integrating multi-layer features from ViTs.
This project helps researchers and developers working with Large Multimodal Models (LMMs) to improve how these models understand complex image details and text together. It takes an LMM's existing visual features and processes them to produce more detailed and semantically aligned image representations. The primary users are AI/ML researchers or engineers who are developing or fine-tuning LMMs for tasks requiring precise image interpretation.
No commits in the last 6 months.
Use this if you need to enhance the ability of your multimodal AI models to interpret fine-grained visual details while maintaining strong textual alignment.
Not ideal if you are looking for an off-the-shelf application or a solution that does not involve modifying an existing LMM's architecture.
Stars
64
Forks
5
Language
Python
License
Apache-2.0
Category
Last pushed
Nov 05, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/yuecao0119/MMFuser"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice