kaylode/vqa-transformer
Visual Question Answering using Transformer and Bottom-Up attention. Implemented in Pytorch
This project helps systems understand images by answering natural language questions about them. It takes an image and a text question as input, then outputs a concise, one-word answer. It's designed for AI researchers and machine learning engineers developing or evaluating advanced image understanding capabilities.
No commits in the last 6 months.
Use this if you are an AI researcher or machine learning engineer exploring how Transformer architectures and bottom-up attention perform on Visual Question Answering tasks.
Not ideal if you need a production-ready, highly accurate VQA system for a broad range of real-world applications or multi-word answers.
Stars
10
Forks
1
Language
Python
License
MIT
Category
Last pushed
Oct 11, 2021
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/kaylode/vqa-transformer"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kyegomez/RT-X
Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...
kyegomez/PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
chuanyangjin/MMToM-QA
[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering
lyuchenyang/Macaw-LLM
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
Muennighoff/vilio
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle