ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

/ 100

Emerging

This project offers a unified large multimodal model that efficiently processes and understands both images and videos. It takes visual inputs (still images or video clips) and provides detailed descriptions or answers to questions about the content. Researchers and developers working with large language models to analyze visual data will find this tool useful for high-performance applications.

562 stars. No commits in the last 6 months.

Use this if you need a highly efficient model to integrate visual understanding into your large language model applications, particularly for scenarios requiring fast processing of images and long videos with minimal computational resources.

Not ideal if you are looking for a simple, out-of-the-box application for everyday visual analysis without needing to integrate it into a larger language model system or manage computational resources.

multimodal-AI computer-vision video-analysis image-understanding AI-development

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

562

Forks

Language

Python

License

Apache-2.0

Compare

LLaVA-Mini and TinyLLaVA_Factory LLaVA-Mini and LLaVA

Higher-rated alternatives

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

zjunlp/EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

NVlabs/Eagle

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

Explore Transformer Models

All categories Trending Transformer directory Insights