atosystem/SpeechCLIP
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022
This project helps integrate spoken audio with both images and text, creating a shared understanding across these different types of information. You provide spoken audio, images, or text, and it generates numerical representations (embeddings) that capture their meaning. This is useful for researchers and developers working on AI systems that need to understand and connect spoken language with visual content or written text.
119 stars. No commits in the last 6 months.
Use this if you are a researcher or AI developer working on understanding how spoken words relate to images and text, or creating systems that search across these different data types.
Not ideal if you are looking for a ready-to-use application for transcribing audio, captioning images, or doing direct speech recognition without integrating across modalities.
Stars
119
Forks
8
Language
Python
License
BSD-3-Clause
Category
Last pushed
Nov 25, 2022
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/atosystem/SpeechCLIP"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.