microsoft/XPretrain

Multi-modality pre-training

/ 100

Emerging

This project provides advanced pre-trained models that help AI developers build systems capable of understanding and generating content from both video and language, or image and language, simultaneously. It takes large datasets of diverse videos and their descriptions, or images and their descriptions, and outputs powerful models ready for specialized tasks. AI researchers and machine learning engineers working on cutting-edge applications would use this.

510 stars. No commits in the last 6 months.

Use this if you are an AI developer looking to leverage state-of-the-art multi-modal pre-trained models for tasks involving understanding or generating content from combined video and text, or image and text.

Not ideal if you are looking for a ready-to-use application or a tool for general content creation without deep machine learning expertise.

AI development video analysis image analysis natural language processing machine learning research

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

510

Forks

Language

Python

License

—

Higher-rated alternatives

TheShadow29/awesome-grounding

awesome grounding: A curated list of research papers in visual grounding

TheShadow29/zsgnet-pytorch

Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects from Natural...

TheShadow29/VidSitu

[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

zeyofu/BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can...

gicheonkang/sglkt-visdial

🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph...

Explore NLP Tools

All categories Trending NLP directory Insights