Yxxxb/VoCo-LLaMA

[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".

/ 100

Emerging

This project helps researchers and developers who work with large language models to process visual data more efficiently. It takes hundreds of visual data inputs, like image sequences from videos, and compresses them into a single, compact "VoCo token." This allows existing large language models to understand and interpret visual information without being overwhelmed by its size, making it suitable for those building advanced vision-language systems.

203 stars. No commits in the last 6 months.

Use this if you are a researcher or developer aiming to integrate large visual datasets into large language models while significantly reducing computational overhead.

Not ideal if you need a user-friendly tool for general image compression or video editing for non-technical purposes.

vision-language-models visual-ai-research machine-learning-engineering video-understanding model-optimization

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

203

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

zjunlp/EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

NVlabs/Eagle

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

Explore Transformer Models

All categories Trending Transformer directory Insights