ruohaoguo/ovavss
Official Implementation of "Open-Vocabulary Audio-Visual Semantic Segmentation" [ACM MM 2024 Oral].
This project helps video editors, content creators, or media analysts identify and categorize every sound-producing object in a video, even if they've never seen or heard that specific type of object before. You provide a video, and it outputs a segmented video where each sounding object (like a dog barking, a car engine, or a person speaking) is highlighted and labeled. This is useful for anyone needing to precisely isolate or understand the auditory and visual components of complex video scenes.
No commits in the last 6 months.
Use this if you need to accurately identify and segment all sounding objects in videos, including those not explicitly trained on.
Not ideal if your primary goal is simple object detection or if you only need to segment a pre-defined, small set of known objects.
Stars
35
Forks
2
Language
Python
License
—
Category
Last pushed
Nov 02, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/computer-vision/ruohaoguo/ovavss"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
BR-IDL/PaddleViT
:robot: PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+
pathak22/unsupervised-video
[CVPR 2017] Unsupervised deep learning using unlabelled videos on the web
IBM/CrossViT
Official implementation of CrossViT. https://arxiv.org/abs/2103.14899
NVlabs/GCVit
[ICML 2023] Official PyTorch implementation of Global Context Vision Transformers
ViTAE-Transformer/ViTDet
Unofficial implementation for [ECCV'22] "Exploring Plain Vision Transformer Backbones for Object...