facebookresearch/mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
This framework helps AI researchers quickly set up new projects that combine visual information (like images or videos) with text information (like captions or questions). It takes in datasets containing both images and related text, and outputs trained models capable of understanding and generating insights from this combined data. Researchers and machine learning engineers working on cutting-edge AI problems would use this.
5,622 stars. Actively maintained with 3 commits in the last 30 days.
Use this if you are an AI researcher starting a new project that involves analyzing or generating content from both images and text, and you need a robust, scalable foundation.
Not ideal if you are a practitioner looking for a ready-to-use application or a developer working on a non-AI project.
Stars
5,622
Forks
944
Language
Python
License
—
Category
Last pushed
Jan 12, 2026
Commits (30d)
3
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/facebookresearch/mmf"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
open-mmlab/mmpretrain
OpenMMLab Pre-training Toolbox and Benchmark
HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis
Papers, code and datasets about deep learning and multi-modal learning for video analysis
KaiyangZhou/pytorch-vsumm-reinforce
Unsupervised video summarization with deep reinforcement learning (AAAI'18)
adambielski/siamese-triplet
Siamese and triplet networks with online pair/triplet mining in PyTorch
kuanghuei/SCAN
PyTorch source code for "Stacked Cross Attention for Image-Text Matching" (ECCV 2018)