claws-lab/projection-in-MLLMs
Code and data for ACL 2024 paper on 'Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space'
This project helps AI researchers and practitioners understand how Multimodal Large Language Models (MLLMs) process visual information. It provides code and datasets to analyze how visual attributes are translated into textual space within these models, using datasets from domains like agriculture, dermatology, and humanitarian response. Users input images and their associated labels to fine-tune and evaluate MLLMs.
No commits in the last 6 months.
Use this if you are a researcher or advanced practitioner working with MLLMs and want to rigorously evaluate and understand how these models integrate visual data with language.
Not ideal if you are looking for a ready-to-use application or a general-purpose MLLM for immediate deployment, as this is a research-focused toolkit.
Stars
19
Forks
1
Language
Python
License
—
Category
Last pushed
Jul 21, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/claws-lab/projection-in-MLLMs"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice