zinengtang/TVLT
PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)
This project helps researchers and developers create AI models that understand video content by analyzing both visual and audio signals simultaneously, without needing transcribed text. It takes raw video and audio inputs and produces a unified representation of the content, which can then be used for tasks like identifying emotions or sentiment in videos. This tool is for AI researchers and machine learning engineers who are building advanced multimodal understanding systems.
126 stars. No commits in the last 6 months.
Use this if you are building AI models that need to understand videos and their accompanying sounds, especially in situations where text transcripts or speech recognition aren't available or suitable.
Not ideal if your primary data source is text-based or if you only need to analyze visual information without considering the audio component.
Stars
126
Forks
12
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Feb 24, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/zinengtang/TVLT"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
dorarad/gansformer
Generative Adversarial Transformers
j-min/VL-T5
PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)
invictus717/MetaTransformer
Meta-Transformer for Unified Multimodal Learning
rkansal47/MPGAN
The message passing GAN https://arxiv.org/abs/2106.11535 and generative adversarial particle...
Yachay-AI/byt5-geotagging
Confidence and Byt5 - based geotagging model predicting coordinates from text alone.