ntkhoa95/multimodal-for-vision
Vision Framework: A modular multi-agent system for computer vision tasks, featuring natural language queries, intelligent task routing, and specialized agents for classification, detection, and more. Built with PyTorch and modern deep learning models.
This framework helps you automatically analyze images and videos by simply asking questions in natural language. You can input an image or video and ask "What's in this image?" or "Detect objects in this scene" to get detailed classifications, identified objects with bounding boxes, or descriptive captions. It's designed for anyone needing quick visual insights without manual tagging, such as content moderators, quality control inspectors, or security analysts.
No commits in the last 6 months.
Use this if you need to rapidly classify, detect objects in, or generate descriptions for large collections of images or video footage using plain English prompts.
Not ideal if you require highly specialized vision tasks beyond classification, detection, or captioning, or if you need to train custom models from scratch for unique visual data.
Stars
7
Forks
1
Language
Python
License
MIT
Category
Last pushed
Nov 07, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/ntkhoa95/multimodal-for-vision"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
open-mmlab/mmpretrain
OpenMMLab Pre-training Toolbox and Benchmark
facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis
Papers, code and datasets about deep learning and multi-modal learning for video analysis
KaiyangZhou/pytorch-vsumm-reinforce
Unsupervised video summarization with deep reinforcement learning (AAAI'18)
adambielski/siamese-triplet
Siamese and triplet networks with online pair/triplet mining in PyTorch