linjieli222/HERO
Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
This project helps AI researchers train and evaluate models that understand video content alongside spoken dialogue or text descriptions. It takes video files and their associated subtitles or text queries as input, and outputs trained models capable of tasks like retrieving specific video moments based on text, answering questions about video content, or generating captions. It is designed for researchers working on advanced video and language understanding.
236 stars. No commits in the last 6 months.
Use this if you are an AI researcher looking to fine-tune a pre-trained model for tasks involving video understanding with accompanying text or dialogue, such as video question answering or moment retrieval.
Not ideal if you are an end-user without a technical background in deep learning, or if you need a plug-and-play solution without model training or specific hardware.
Stars
236
Forks
35
Language
Python
License
MIT
Category
Last pushed
Sep 16, 2021
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/linjieli222/HERO"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
NVlabs/MambaVision
[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
sign-language-translator/sign-language-translator
Python library & framework to build custom translators for the hearing-impaired and translate...
kyegomez/Jamba
PyTorch Implementation of Jamba: "Jamba: A Hybrid Transformer-Mamba Language Model"
autonomousvision/transfuser
[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving;...
kyegomez/MultiModalMamba
A novel implementation of fusing ViT with Mamba into a fast, agile, and high performance...