microsoft/XPretrain
Multi-modality pre-training
This project provides advanced pre-trained models that help AI developers build systems capable of understanding and generating content from both video and language, or image and language, simultaneously. It takes large datasets of diverse videos and their descriptions, or images and their descriptions, and outputs powerful models ready for specialized tasks. AI researchers and machine learning engineers working on cutting-edge applications would use this.
510 stars. No commits in the last 6 months.
Use this if you are an AI developer looking to leverage state-of-the-art multi-modal pre-trained models for tasks involving understanding or generating content from combined video and text, or image and text.
Not ideal if you are looking for a ready-to-use application or a tool for general content creation without deep machine learning expertise.
Stars
510
Forks
36
Language
Python
License
—
Category
Last pushed
May 08, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/microsoft/XPretrain"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
TheShadow29/awesome-grounding
awesome grounding: A curated list of research papers in visual grounding
TheShadow29/zsgnet-pytorch
Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects from Natural...
TheShadow29/VidSitu
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)
zeyofu/BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can...
gicheonkang/sglkt-visdial
🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph...