logic-OT/BobVLM
BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a single P100 GPU capable of image descriptions and moderate question answering. 🤗🎉
BobVLM helps you understand what's in an image and answer questions about it. You provide an image (from a file, URL, or program) and a question or request, and it outputs a detailed text description or an answer. This tool is for developers who need to integrate image understanding capabilities into their applications.
No commits in the last 6 months.
Use this if you are a developer looking for an open-source, resource-efficient vision-language model to add image description and basic question-answering features to your applications.
Not ideal if you need highly detailed answers to complex questions or reliable analysis of close-up images, animations, or images outside of general scene descriptions.
Stars
11
Forks
3
Language
Python
License
MIT
Category
Last pushed
Feb 17, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/logic-OT/BobVLM"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kyegomez/RT-X
Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...
kyegomez/PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
chuanyangjin/MMToM-QA
[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering
lyuchenyang/Macaw-LLM
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
Muennighoff/vilio
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle