mlpc-ucsd/BLIVA

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

/ 100

Emerging

Need to extract information or answer questions from images containing a lot of text, like charts, documents, or social media posters? BLIVA processes an image and a text question to give you accurate answers, even when the image is packed with words. This is ideal for anyone working with visual data that includes complex textual elements, such as researchers analyzing charts or marketers reviewing ad creatives.

260 stars. No commits in the last 6 months.

Use this if you need to reliably get answers to specific questions by 'reading' both the visual and textual content within an image.

Not ideal if your primary need is general image description without any text-based querying, or if you only process images with minimal to no text.

document-analysis chart-interpretation visual-content-analysis market-research data-extraction

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

260

Forks

Language

Python

License

BSD-3-Clause

Higher-rated alternatives

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

zjunlp/EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

NVlabs/Eagle

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

Explore Transformer Models

All categories Trending Transformer directory Insights