ruohaoguo/ovavss

Official Implementation of "Open-Vocabulary Audio-Visual Semantic Segmentation" [ACM MM 2024 Oral].

/ 100

Experimental

This project helps video editors, content creators, or media analysts identify and categorize every sound-producing object in a video, even if they've never seen or heard that specific type of object before. You provide a video, and it outputs a segmented video where each sounding object (like a dog barking, a car engine, or a person speaking) is highlighted and labeled. This is useful for anyone needing to precisely isolate or understand the auditory and visual components of complex video scenes.

No commits in the last 6 months.

Use this if you need to accurately identify and segment all sounding objects in videos, including those not explicitly trained on.

Not ideal if your primary goal is simple object detection or if you only need to segment a pre-defined, small set of known objects.

video-analysis content-moderation media-production scene-understanding sound-source-localization

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 8 / 25

Community 6 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

BR-IDL/PaddleViT

:robot: PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

pathak22/unsupervised-video

[CVPR 2017] Unsupervised deep learning using unlabelled videos on the web

IBM/CrossViT

Official implementation of CrossViT. https://arxiv.org/abs/2103.14899

NVlabs/GCVit

[ICML 2023] Official PyTorch implementation of Global Context Vision Transformers

ViTAE-Transformer/ViTDet

Unofficial implementation for [ECCV'22] "Exploring Plain Vision Transformer Backbones for Object...

Explore Computer Vision Tools

All categories Trending Computer Vision directory Insights