zinengtang/TVLT

PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)

39
/ 100
Emerging

This project helps researchers and developers create AI models that understand video content by analyzing both visual and audio signals simultaneously, without needing transcribed text. It takes raw video and audio inputs and produces a unified representation of the content, which can then be used for tasks like identifying emotions or sentiment in videos. This tool is for AI researchers and machine learning engineers who are building advanced multimodal understanding systems.

126 stars. No commits in the last 6 months.

Use this if you are building AI models that need to understand videos and their accompanying sounds, especially in situations where text transcripts or speech recognition aren't available or suitable.

Not ideal if your primary data source is text-based or if you only need to analyze visual information without considering the audio component.

multimodal-AI video-analysis audio-analysis sentiment-analysis emotion-detection
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 13 / 25

How are scores calculated?

Stars

126

Forks

12

Language

Jupyter Notebook

License

MIT

Last pushed

Feb 24, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/zinengtang/TVLT"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.