xinyanghuang7/Basic-Visual-Language-Model
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
This project allows AI researchers and machine learning engineers to build a custom multimodal large language model from scratch. You provide image datasets (like COCO or AI Challenger) and corresponding textual annotations, and the project outputs a trained model capable of understanding and generating responses about images. This is for professionals who want to develop new vision-language AI capabilities for specific applications.
No commits in the last 6 months.
Use this if you need to train your own vision-language model with specialized datasets to achieve domain-specific visual comprehension and dialogue capabilities.
Not ideal if you're looking for an off-the-shelf tool to simply use a multimodal model without any training or model architecture modifications.
Stars
47
Forks
9
Language
Python
License
—
Category
Last pushed
Jun 19, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/xinyanghuang7/Basic-Visual-Language-Model"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice