stepfun-ai/Step-Audio-EditX

A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech

54
/ 100
Established

This tool helps content creators, voice actors, and marketers refine spoken audio. You input text and define desired emotions, speaking styles, or paralinguistic elements. The output is natural-sounding synthetic speech that precisely conveys the intended tone, ideal for generating expressive voiceovers or dialogue. It also supports zero-shot text-to-speech for various languages.

884 stars. Actively maintained with 1 commit in the last 30 days.

Use this if you need fine-grained control over the emotional tone, speaking style, and specific human sounds (like laughter or sighs) in your synthetic speech or voiceovers.

Not ideal if you're looking for simple, unedited text-to-speech without needing to adjust nuanced emotional or stylistic elements.

audio-production voiceover-creation digital-content-creation marketing-audio e-learning-narration
No Package No Dependents
Maintenance 16 / 25
Adoption 10 / 25
Maturity 13 / 25
Community 15 / 25

How are scores calculated?

Stars

884

Forks

61

Language

Python

License

Apache-2.0

Last pushed

Mar 16, 2026

Commits (30d)

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/stepfun-ai/Step-Audio-EditX"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.