ExplainableML/ZerAuCap
[NeurIPS 2023 - ML for Audio Workshop (Oral)] Zero-shot audio captioning with audio-language model guidance and audio context keywords
This project helps audio content creators and analysts automatically generate descriptive text captions for sound events, like ambient noise or human actions, without needing to manually label extensive datasets. It takes raw audio files as input and outputs concise, descriptive text captions, making it ideal for anyone who needs to quickly understand or catalog large collections of audio recordings.
No commits in the last 6 months.
Use this if you need to automatically generate clear, descriptive text summaries for various non-speech audio clips, significantly reducing manual effort in audio annotation or content understanding.
Not ideal if your primary goal is transcribing spoken language, as this tool is specifically designed for environmental sounds and actions, not speech-to-text conversion.
Stars
18
Forks
1
Language
Python
License
—
Category
Last pushed
Nov 30, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/ExplainableML/ZerAuCap"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
canopyai/Orpheus-TTS
Towards Human-Sounding Speech
lifeiteng/vall-e
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo...
Plachtaa/VALL-E-X
An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in...
umbertocappellazzo/Omni-AVSR
Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal Speech Recognition...
primepake/learnable-speech
This repo is text to speech with learnable audio encoder without alignment with transcript reference