MOSS-TTSD and MOSS-Speech
MOSS-TTSD handles the output side (text-to-speech synthesis) while MOSS-Speech handles the input side (speech-to-speech processing), making them complementary components of an end-to-end voice conversation pipeline.
About MOSS-TTSD
OpenMOSS/MOSS-TTSD
MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flexible speaker control, and multilingual support, while enabling zero-shot voice cloning from short audio references.
This project helps content creators transform dialogue scripts into dynamic, expressive spoken conversations with multiple distinct speakers. You provide a script and short audio references for each speaker, and it generates natural-sounding, long-form spoken dialogue up to 60 minutes. It's ideal for producers of podcasts, audiobooks, commentary, and dubbed content.
About MOSS-Speech
OpenMOSS/MOSS-Speech
MOSS-Speech is a true speech-to-speech large language model without text guidance.
This project helps create direct, natural voice-to-voice interactions for spoken applications. You provide spoken input, and it responds directly with spoken output, without ever converting to text in between. It's designed for anyone building interactive voice assistants, dialogue systems, or real-time spoken translation tools.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work