showlab/VisInContext
Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
This tool helps researchers and AI practitioners enhance their multi-modal AI models by significantly extending the amount of text context these models can process. It takes existing multi-modal models and datasets, then integrates visual tokens to effectively expand the textual input capacity. The result is a model capable of understanding and generating responses based on much longer text inputs, which is particularly useful for those working with large language models combined with images.
No commits in the last 6 months.
Use this if you are building or evaluating multi-modal AI models and frequently encounter limitations due to short text context windows.
Not ideal if your primary goal is to improve image generation quality rather than extending textual understanding within multi-modal models.
Stars
28
Forks
3
Language
Python
License
—
Category
Last pushed
Oct 30, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/showlab/VisInContext"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice