MaxLSB/mini-paligemma2

Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch

/ 100

Experimental

This project provides a direct way to use Google's PaliGemma 2 and PaliGemma models for understanding images and text together. You feed it an image and a text prompt (like 'Caption' or 'Detect tiger'), and it outputs a description, an answer to a question about the image, or highlights detected objects. This tool is for researchers or practitioners who need to quickly integrate advanced multimodal AI capabilities for image analysis into their workflows.

No commits in the last 6 months.

Use this if you need to perform tasks like image captioning, visual question answering, or object detection by combining image and text inputs.

Not ideal if you need a conversational AI that remembers previous interactions or if you need to fine-tune a model without a pre-built pipeline.

image-captioning visual-question-answering object-detection multimodal-ai computer-vision

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 0 / 25

How are scores calculated?

Stars

Forks

—

Language

Python

License

MIT

Higher-rated alternatives

kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...

kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

lyuchenyang/Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

Muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

Explore Transformer Models

All categories Trending Transformer directory Insights