NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
OmniVinci helps you understand and reason about video, audio, and text information together, much like humans do. You provide it with various media — videos, audio clips, or text — and it generates descriptive analyses and answers based on all the input. This is ideal for researchers, AI developers, and anyone building applications that need to interpret complex, real-world multimedia.
639 stars.
Use this if you need an AI model that can jointly analyze and respond to queries involving intertwined visual, auditory, and textual information.
Not ideal if your task only involves processing a single type of media (e.g., just text or just images) or if you require an extremely lightweight model for edge devices.
Stars
639
Forks
51
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 26, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/NVlabs/OmniVinci"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
fixie-ai/ultravox
A fast multimodal LLM for real-time voice
bytedance/video-SALMONN-2
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates...