logic-OT/BobVLM

BobVLM – A 1.5B multimodal model built from scratch and pre-trained on a single P100 GPU capable of image descriptions and moderate question answering. 🤗🎉

/ 100

Emerging

BobVLM helps you understand what's in an image and answer questions about it. You provide an image (from a file, URL, or program) and a question or request, and it outputs a detailed text description or an answer. This tool is for developers who need to integrate image understanding capabilities into their applications.

No commits in the last 6 months.

Use this if you are a developer looking for an open-source, resource-efficient vision-language model to add image description and basic question-answering features to your applications.

Not ideal if you need highly detailed answers to complex questions or reliable analysis of close-up images, animations, or images outside of general scene descriptions.

image-analysis computer-vision multimodal-ai natural-language-processing software-development

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...

kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

lyuchenyang/Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

Muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

Explore Transformer Models

All categories Trending Transformer directory Insights