ExplainableML/ZerAuCap

[NeurIPS 2023 - ML for Audio Workshop (Oral)] Zero-shot audio captioning with audio-language model guidance and audio context keywords

/ 100

Experimental

This project helps audio content creators and analysts automatically generate descriptive text captions for sound events, like ambient noise or human actions, without needing to manually label extensive datasets. It takes raw audio files as input and outputs concise, descriptive text captions, making it ideal for anyone who needs to quickly understand or catalog large collections of audio recordings.

No commits in the last 6 months.

Use this if you need to automatically generate clear, descriptive text summaries for various non-speech audio clips, significantly reducing manual effort in audio annotation or content understanding.

Not ideal if your primary goal is transcribing spoken language, as this tool is specifically designed for environmental sounds and actions, not speech-to-text conversion.

audio-analysis sound-event-detection content-cataloging multimedia-annotation audio-indexing

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 8 / 25

Community 5 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

canopyai/Orpheus-TTS

Towards Human-Sounding Speech

lifeiteng/vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo...

Plachtaa/VALL-E-X

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in...

umbertocappellazzo/Omni-AVSR

Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal Speech Recognition...

primepake/learnable-speech

This repo is text to speech with learnable audio encoder without alignment with transcript reference

Explore Voice AI Tools

All categories Trending Voice AI directory Insights