SALMONN and video-SALMONN-2
These are ecosystem siblings where SALMONN is a foundational multi-modal LLM framework and video-SALMONN-2 is a specialized extension that applies the same architecture specifically to audio-visual video understanding tasks.
About SALMONN
bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
This project offers tools to build or use advanced AI models that can understand and generate text from various types of input, including audio and video. It helps with tasks like creating detailed captions for videos, answering questions about video content, or evaluating the quality of spoken audio. People who need to process and interpret complex multimedia data for tasks such as content analysis, media management, or accessibility will find this useful.
About video-SALMONN-2
bytedance/video-SALMONN-2
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.
This project helps content creators, marketers, and educators by automatically generating high-quality captions for videos, taking into account both what is seen and heard. You provide video files, and it outputs detailed, accurate captions that enhance accessibility and understanding. It's designed for anyone needing to quickly and efficiently caption video content.
Scores updated daily from GitHub, PyPI, and npm data. How scores work