tristan-mcinnis/Multimodal-voice-assistant
This project is a multi-modal AI voice assistant that uses LM Studio, OpenAI API or Claude Code, audio processing with WhisperModel, speech recognition, clipboard extraction, and image processing to respond to user prompts.
This multi-modal voice assistant allows you to interact with an AI using your voice, getting responses that incorporate context from your screen, webcam, or clipboard. You speak your request, and the AI processes it by listening, looking at screenshots, analyzing webcam input, or searching the web, then speaks a comprehensive answer back to you. This is ideal for anyone who needs to quickly get information or perform tasks using spoken commands and rich visual context, like a researcher, student, or busy professional.
Use this if you need a hands-free way to interact with an AI that can 'see' what's on your screen or through your webcam, understand your voice, and use web search to provide highly contextual answers.
Not ideal if you prefer purely text-based interactions or are looking for a simple chatbot without any visual or audio input capabilities.
Stars
9
Forks
2
Language
Python
License
MIT
Category
Last pushed
Dec 15, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/voice-ai/tristan-mcinnis/Multimodal-voice-assistant"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
asiff00/On-Device-Speech-to-Speech-Conversational-AI
This is an on-CPU real-time conversational system for two-way speech communication with AI...
VideotronicMaker/LM-Studio-Voice-Conversation
Python app for LM Studio-enhanced voice conversations with local LLMs. Uses Whisper for...
syntithenai/hermod
voice services stack from audio hardware through hotword, ASR, NLU, AI routing and TTS bound by...
bold-ronin/lira
A Voice-First AI Companion
voice-engine/make-a-smart-speaker
A collection of resources to make a smart speaker