ziqipang/LM4VisualEncoding

[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are Effective Visual Encoder Layers"

/ 100

Emerging

This project offers a novel way for machine learning researchers and practitioners to improve the performance of visual AI models. It uses pre-trained language model components to enhance how visual data, like images or video frames, are understood. By integrating these 'frozen' language model parts into existing visual encoders, it helps AI models better identify and focus on important visual features, leading to more accurate classification of images, point clouds, and actions.

246 stars. No commits in the last 6 months.

Use this if you are a machine learning researcher or engineer working on visual recognition tasks and want to explore innovative methods to boost model accuracy by leveraging language model capabilities.

Not ideal if you are looking for a plug-and-play solution for non-visual tasks or if you are not comfortable with modifying existing deep learning architectures.

image-classification video-analysis 3d-data-processing deep-learning-research computer-vision

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

246

Forks

Language

Python

License

MIT

Higher-rated alternatives

KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

fixie-ai/ultravox

A fast multimodal LLM for real-time voice

Explore Transformer Models

All categories Trending Transformer directory Insights