fangyuan-ksgk/Mini-LLaVA
A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.
This tool helps AI developers and researchers quickly adapt a large language model like Llama 3.1 to understand and process information from images, videos, and text simultaneously. You provide a language model and various media inputs (images, video clips, text), and it produces a model capable of multimodal reasoning, understanding the relationships between all these different data types. It's designed for those building or experimenting with advanced AI applications that need to interpret complex visual and textual information together.
No commits in the last 6 months.
Use this if you are an AI developer looking to give a language model the ability to 'see' and interpret images and videos alongside text, using a streamlined and easy-to-understand implementation.
Not ideal if you are looking for an off-the-shelf application to directly analyze your visual and text data without needing to integrate or customize a language model.
Stars
98
Forks
9
Language
Python
License
MIT
Category
Last pushed
Dec 17, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/fangyuan-ksgk/Mini-LLaVA"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice