linjieli222/HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

/ 100

Emerging

This project helps AI researchers train and evaluate models that understand video content alongside spoken dialogue or text descriptions. It takes video files and their associated subtitles or text queries as input, and outputs trained models capable of tasks like retrieving specific video moments based on text, answering questions about video content, or generating captions. It is designed for researchers working on advanced video and language understanding.

236 stars. No commits in the last 6 months.

Use this if you are an AI researcher looking to fine-tune a pre-trained model for tasks involving video understanding with accompanying text or dialogue, such as video question answering or moment retrieval.

Not ideal if you are an end-user without a technical background in deep learning, or if you need a plug-and-play solution without model training or specific hardware.

video-understanding natural-language-processing multimodal-ai deep-learning-research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

236

Forks

Language

Python

License

MIT

Higher-rated alternatives

NVlabs/MambaVision

[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone

sign-language-translator/sign-language-translator

Python library & framework to build custom translators for the hearing-impaired and translate...

kyegomez/Jamba

PyTorch Implementation of Jamba: "Jamba: A Hybrid Transformer-Mamba Language Model"

autonomousvision/transfuser

[PAMI'23] TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving;...

kyegomez/MultiModalMamba

A novel implementation of fusing ViT with Mamba into a fast, agile, and high performance...

Explore Transformer Models

All categories Trending Transformer directory Insights