bytedance/SALMONN
SALMONN family: A suite of advanced multi-modal LLMs
This project offers tools to build or use advanced AI models that can understand and generate text from various types of input, including audio and video. It helps with tasks like creating detailed captions for videos, answering questions about video content, or evaluating the quality of spoken audio. People who need to process and interpret complex multimedia data for tasks such as content analysis, media management, or accessibility will find this useful.
1,392 stars.
Use this if you need to develop or implement AI systems that can accurately process and respond to information presented in video, audio, and text formats.
Not ideal if you are looking for a simple, off-the-shelf application for basic text-only processing or image recognition.
Stars
1,392
Forks
112
Language
—
License
Apache-2.0
Category
Last pushed
Feb 03, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/bytedance/SALMONN"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related models
KimMeen/Time-LLM
[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...
om-ai-lab/VLM-R1
Solve Visual Understanding with Reinforced VLMs
NVlabs/OmniVinci
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
fixie-ai/ultravox
A fast multimodal LLM for real-time voice
bytedance/video-SALMONN-2
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates...