atosystem/SpeechCLIP

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022

/ 100

Emerging

This project helps integrate spoken audio with both images and text, creating a shared understanding across these different types of information. You provide spoken audio, images, or text, and it generates numerical representations (embeddings) that capture their meaning. This is useful for researchers and developers working on AI systems that need to understand and connect spoken language with visual content or written text.

119 stars. No commits in the last 6 months.

Use this if you are a researcher or AI developer working on understanding how spoken words relate to images and text, or creating systems that search across these different data types.

Not ideal if you are looking for a ready-to-use application for transcribing audio, captioning images, or doing direct speech recognition without integrating across modalities.

multimodal-AI speech-processing computer-vision natural-language-processing AI-research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 10 / 25

How are scores calculated?

Stars

119

Forks

Language

Python

License

BSD-3-Clause

Related tools

ShampooWang/SpeechCLIP_plus

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and...

Explore Voice AI Tools

All categories Trending Voice AI directory Insights