atosystem/SpeechCLIP

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022

36
/ 100
Emerging

This project helps integrate spoken audio with both images and text, creating a shared understanding across these different types of information. You provide spoken audio, images, or text, and it generates numerical representations (embeddings) that capture their meaning. This is useful for researchers and developers working on AI systems that need to understand and connect spoken language with visual content or written text.

119 stars. No commits in the last 6 months.

Use this if you are a researcher or AI developer working on understanding how spoken words relate to images and text, or creating systems that search across these different data types.

Not ideal if you are looking for a ready-to-use application for transcribing audio, captioning images, or doing direct speech recognition without integrating across modalities.

multimodal-AI speech-processing computer-vision natural-language-processing AI-research
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 10 / 25

How are scores calculated?

Stars

119

Forks

8

Language

Python

License

BSD-3-Clause

Last pushed

Nov 25, 2022

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/atosystem/SpeechCLIP"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.