xinyanghuang7/Basic-Visual-Language-Model

Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖

/ 100

Emerging

This project allows AI researchers and machine learning engineers to build a custom multimodal large language model from scratch. You provide image datasets (like COCO or AI Challenger) and corresponding textual annotations, and the project outputs a trained model capable of understanding and generating responses about images. This is for professionals who want to develop new vision-language AI capabilities for specific applications.

No commits in the last 6 months.

Use this if you need to train your own vision-language model with specialized datasets to achieve domain-specific visual comprehension and dialogue capabilities.

Not ideal if you're looking for an off-the-shelf tool to simply use a multimodal model without any training or model architecture modifications.

AI research machine learning engineering multimodal AI computer vision natural language processing

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 8 / 25

Maturity 8 / 25

Community 17 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

fixie-ai/ultravox

A fast multimodal LLM for real-time voice

Explore Transformer Models

All categories Trending Transformer directory Insights