ziqipang/LM4VisualEncoding
[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are Effective Visual Encoder Layers"
This project offers a novel way for machine learning researchers and practitioners to improve the performance of visual AI models. It uses pre-trained language model components to enhance how visual data, like images or video frames, are understood. By integrating these 'frozen' language model parts into existing visual encoders, it helps AI models better identify and focus on important visual features, leading to more accurate classification of images, point clouds, and actions.
246 stars. No commits in the last 6 months.
Use this if you are a machine learning researcher or engineer working on visual recognition tasks and want to explore innovative methods to boost model accuracy by leveraging language model capabilities.
Not ideal if you are looking for a plug-and-play solution for non-visual tasks or if you are not comfortable with modifying existing deep learning architectures.
Stars
246
Forks
8
Language
Python
License
MIT
Category
Last pushed
Jan 17, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ziqipang/LM4VisualEncoding"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice