showlab/VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

/ 100

Experimental

This tool helps researchers and AI practitioners enhance their multi-modal AI models by significantly extending the amount of text context these models can process. It takes existing multi-modal models and datasets, then integrates visual tokens to effectively expand the textual input capacity. The result is a model capable of understanding and generating responses based on much longer text inputs, which is particularly useful for those working with large language models combined with images.

No commits in the last 6 months.

Use this if you are building or evaluating multi-modal AI models and frequently encounter limitations due to short text context windows.

Not ideal if your primary goal is to improve image generation quality rather than extending textual understanding within multi-modal models.

multi-modal AI large language models AI model training context window extension AI research

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

fixie-ai/ultravox

A fast multimodal LLM for real-time voice

Explore Transformer Models

All categories Trending Transformer directory Insights