fangyuan-ksgk/Mini-LLaVA

A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.

37
/ 100
Emerging

This tool helps AI developers and researchers quickly adapt a large language model like Llama 3.1 to understand and process information from images, videos, and text simultaneously. You provide a language model and various media inputs (images, video clips, text), and it produces a model capable of multimodal reasoning, understanding the relationships between all these different data types. It's designed for those building or experimenting with advanced AI applications that need to interpret complex visual and textual information together.

No commits in the last 6 months.

Use this if you are an AI developer looking to give a language model the ability to 'see' and interpret images and videos alongside text, using a streamlined and easy-to-understand implementation.

Not ideal if you are looking for an off-the-shelf application to directly analyze your visual and text data without needing to integrate or customize a language model.

AI model development multimodal AI large language models computer vision video understanding
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 9 / 25
Maturity 16 / 25
Community 12 / 25

How are scores calculated?

Stars

98

Forks

9

Language

Python

License

MIT

Last pushed

Dec 17, 2024

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/fangyuan-ksgk/Mini-LLaVA"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.