NVlabs/Eagle

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

/ 100

Emerging

This helps AI/ML engineers and researchers who are building advanced AI models that need to understand long videos and high-resolution images. It takes raw video footage, high-resolution images, multi-page documents, and long text as input. It then processes these inputs to provide comprehensive understanding and reasoning, allowing you to build applications that can summarize long videos, answer questions about complex visual data, or analyze detailed documents.

931 stars.

Use this if you are a machine learning engineer or researcher developing advanced vision-language AI models that require processing very long videos or extremely detailed high-resolution images.

Not ideal if you are a non-technical user looking for an off-the-shelf application to analyze images or videos without deep technical integration.

AI model development vision-language understanding long-form video analysis high-resolution image processing multimodal AI

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

931

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

zjunlp/EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

haotian-liu/LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

DAMO-NLP-SG/Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Explore Transformer Models

All categories Trending Transformer directory Insights