bytedance/SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

/ 100

Established

This project offers tools to build or use advanced AI models that can understand and generate text from various types of input, including audio and video. It helps with tasks like creating detailed captions for videos, answering questions about video content, or evaluating the quality of spoken audio. People who need to process and interpret complex multimedia data for tasks such as content analysis, media management, or accessibility will find this useful.

1,392 stars.

Use this if you need to develop or implement AI systems that can accurately process and respond to information presented in video, audio, and text formats.

Not ideal if you are looking for a simple, off-the-shelf application for basic text-only processing or image recognition.

video-captioning audio-analysis multimedia-content-understanding speech-quality-assessment AI-development

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

1,392

Forks

112

Language

—

License

Apache-2.0

Compare

SALMONN and video-SALMONN-2

Related models

KimMeen/Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming...

om-ai-lab/VLM-R1

Solve Visual Understanding with Reinforced VLMs

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

fixie-ai/ultravox

A fast multimodal LLM for real-time voice

bytedance/video-SALMONN-2

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates...

Explore Transformer Models

All categories Trending Transformer directory Insights