iflytek/VLE

VLE: Vision-Language Encoder (VLE: 视觉-语言多模态预训练模型)

/ 100

Emerging

VLE is a powerful tool for businesses, researchers, or anyone needing to understand the relationship between images and text. You provide an image and a question about it, or an image and a piece of text, and VLE will tell you the answer to your question or how well the image and text match. This is ideal for tasks like visual question answering or retrieving images based on text descriptions.

194 stars. No commits in the last 6 months.

Use this if you need to analyze images and text together to answer questions, understand visual context, or match visual content with descriptions.

Not ideal if your task only involves text or only images, without needing to understand their combined meaning.

visual-question-answering image-text-retrieval visual-reasoning content-understanding multimodal-search

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

194

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

fixie-ai/ultravox

A fast multimodal LLM for real-time voice

Explore Transformer Models

All categories Trending Transformer directory Insights