fangyuan-ksgk/Mini-LLaVA

A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.

/ 100

Emerging

This tool helps AI developers and researchers quickly adapt a large language model like Llama 3.1 to understand and process information from images, videos, and text simultaneously. You provide a language model and various media inputs (images, video clips, text), and it produces a model capable of multimodal reasoning, understanding the relationships between all these different data types. It's designed for those building or experimenting with advanced AI applications that need to interpret complex visual and textual information together.

No commits in the last 6 months.

Use this if you are an AI developer looking to give a language model the ability to 'see' and interpret images and videos alongside text, using a streamlined and easy-to-understand implementation.

Not ideal if you are looking for an off-the-shelf application to directly analyze your visual and text data without needing to integrate or customize a language model.

AI model development multimodal AI large language models computer vision video understanding

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

fixie-ai/ultravox

A fast multimodal LLM for real-time voice

Explore Transformer Models

All categories Trending Transformer directory Insights