bytedance/video-SALMONN-2
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.
This project helps content creators, marketers, and educators by automatically generating high-quality captions for videos, taking into account both what is seen and heard. You provide video files, and it outputs detailed, accurate captions that enhance accessibility and understanding. It's designed for anyone needing to quickly and efficiently caption video content.
167 stars.
Use this if you need to generate descriptive captions for video content, leveraging both visual and audio cues for better accuracy and detail.
Not ideal if you primarily need to transcribe spoken dialogue without needing detailed descriptions of on-screen actions or sounds.
Stars
167
Forks
19
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 23, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/bytedance/video-SALMONN-2"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related models
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice