tristan-mcinnis/Multimodal-voice-assistant

This project is a multi-modal AI voice assistant that uses LM Studio, OpenAI API or Claude Code, audio processing with WhisperModel, speech recognition, clipboard extraction, and image processing to respond to user prompts.

40
/ 100
Emerging

This multi-modal voice assistant allows you to interact with an AI using your voice, getting responses that incorporate context from your screen, webcam, or clipboard. You speak your request, and the AI processes it by listening, looking at screenshots, analyzing webcam input, or searching the web, then speaks a comprehensive answer back to you. This is ideal for anyone who needs to quickly get information or perform tasks using spoken commands and rich visual context, like a researcher, student, or busy professional.

Use this if you need a hands-free way to interact with an AI that can 'see' what's on your screen or through your webcam, understand your voice, and use web search to provide highly contextual answers.

Not ideal if you prefer purely text-based interactions or are looking for a simple chatbot without any visual or audio input capabilities.

personal-assistant information-retrieval hands-free-computing contextual-search productivity-tool
No Package No Dependents
Maintenance 6 / 25
Adoption 5 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

9

Forks

2

Language

Python

License

MIT

Last pushed

Dec 15, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/tristan-mcinnis/Multimodal-voice-assistant"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.