mlpc-ucsd/BLIVA
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
Need to extract information or answer questions from images containing a lot of text, like charts, documents, or social media posters? BLIVA processes an image and a text question to give you accurate answers, even when the image is packed with words. This is ideal for anyone working with visual data that includes complex textual elements, such as researchers analyzing charts or marketers reviewing ad creatives.
260 stars. No commits in the last 6 months.
Use this if you need to reliably get answers to specific questions by 'reading' both the visual and textual content within an image.
Not ideal if your primary need is general image description without any text-based querying, or if you only process images with minimal to no text.
Stars
260
Forks
25
Language
Python
License
BSD-3-Clause
Category
Last pushed
Apr 14, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/mlpc-ucsd/BLIVA"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
TinyLLaVA/TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
zjunlp/EasyInstruct
[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.
rese1f/MovieChat
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
NVlabs/Eagle
Eagle: Frontier Vision-Language Models with Data-Centric Strategies