MaxLSB/mini-paligemma2
Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch
This project provides a direct way to use Google's PaliGemma 2 and PaliGemma models for understanding images and text together. You feed it an image and a text prompt (like 'Caption' or 'Detect tiger'), and it outputs a description, an answer to a question about the image, or highlights detected objects. This tool is for researchers or practitioners who need to quickly integrate advanced multimodal AI capabilities for image analysis into their workflows.
No commits in the last 6 months.
Use this if you need to perform tasks like image captioning, visual question answering, or object detection by combining image and text inputs.
Not ideal if you need a conversational AI that remembers previous interactions or if you need to fine-tune a model without a pre-built pipeline.
Stars
13
Forks
—
Language
Python
License
MIT
Category
Last pushed
Feb 22, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/MaxLSB/mini-paligemma2"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kyegomez/RT-X
Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...
kyegomez/PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
chuanyangjin/MMToM-QA
[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering
lyuchenyang/Macaw-LLM
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
Muennighoff/vilio
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle