zinengtang/TVLT

PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)

/ 100

Emerging

This project helps researchers and developers create AI models that understand video content by analyzing both visual and audio signals simultaneously, without needing transcribed text. It takes raw video and audio inputs and produces a unified representation of the content, which can then be used for tasks like identifying emotions or sentiment in videos. This tool is for AI researchers and machine learning engineers who are building advanced multimodal understanding systems.

126 stars. No commits in the last 6 months.

Use this if you are building AI models that need to understand videos and their accompanying sounds, especially in situations where text transcripts or speech recognition aren't available or suitable.

Not ideal if your primary data source is text-based or if you only need to analyze visual information without considering the audio component.

multimodal-AI video-analysis audio-analysis sentiment-analysis emotion-detection

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

126

Forks

Language

Jupyter Notebook

License

MIT

Higher-rated alternatives

dorarad/gansformer

Generative Adversarial Transformers

j-min/VL-T5

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)

invictus717/MetaTransformer

Meta-Transformer for Unified Multimodal Learning

rkansal47/MPGAN

The message passing GAN https://arxiv.org/abs/2106.11535 and generative adversarial particle...

Yachay-AI/byt5-geotagging

Confidence and Byt5 - based geotagging model predicting coordinates from text alone.

Explore Transformer Models

All categories Trending Transformer directory Insights