LLaVA and llama-multimodal-vqa

LLaVA is a foundational vision-language instruction-tuning framework that llama-multimodal-vqa builds upon by adapting its techniques specifically for Llama 3 architecture and VQA tasks.

LLaVA
47
Emerging
llama-multimodal-vqa
41
Emerging
Maintenance 0/25
Adoption 10/25
Maturity 16/25
Community 21/25
Maintenance 0/25
Adoption 8/25
Maturity 16/25
Community 17/25
Stars: 24,554
Forks: 2,745
Downloads:
Commits (30d): 0
Language: Python
License: Apache-2.0
Stars: 51
Forks: 11
Downloads:
Commits (30d): 0
Language: Python
License: MIT
Stale 6m No Package No Dependents
Stale 6m No Package No Dependents

About LLaVA

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

LLaVA helps you understand and interact with images using natural language. You provide an image and ask questions or give instructions about its content, and it generates descriptive text, answers, or performs tasks like segmentation. This is ideal for anyone needing to extract insights from visuals, such as researchers analyzing images, content creators generating descriptions, or operations teams monitoring visual data.

image-analysis visual-intelligence content-description multimodal-interaction visual-question-answering

About llama-multimodal-vqa

AdrianBZG/llama-multimodal-vqa

Multimodal Instruction Tuning for Llama 3

This project helps AI developers adapt the Llama 3 language model to understand and respond to questions that require both text and image input. You provide a dataset containing image-text pairs and corresponding question-answer conversations. The output is a fine-tuned Llama 3 model capable of visual question answering. This is for AI engineers or researchers building custom multimodal AI applications.

AI model training multimodal AI visual question answering large language models custom AI development

Scores updated daily from GitHub, PyPI, and npm data. How scores work