kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

/ 100

Established

This project helps researchers and developers explore and utilize a powerful vision-language model. It takes an image and a text prompt as input and generates relevant text output, combining visual and textual information. This is designed for AI researchers and practitioners building applications that require advanced understanding of both images and language.

146 stars. Available on PyPI.

Use this if you are a researcher or developer who needs to experiment with or integrate a state-of-the-art vision-language model for tasks like image captioning or visual question answering.

Not ideal if you are a non-technical user looking for a ready-to-use application, as this requires programming knowledge to implement.

vision-language-models image-captioning visual-question-answering multimodal-ai deep-learning-research

Maintenance 10 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 6 / 25

How are scores calculated?

Stars

146

Forks

Language

Python

License

MIT

Compare

PALI3 and PALI

Related models

kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...

chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

lyuchenyang/Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

Muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

kyegomez/PALM-E

Implementation of "PaLM-E: An Embodied Multimodal Language Model"

Explore Transformer Models

All categories Trending Transformer directory Insights