kyegomez/MM1
PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"
This project provides a foundational PyTorch implementation for exploring how large language models can understand and generate content based on both text and images. It takes an image and a sequence of text as input, processes them through a multimodal architecture, and outputs a refined set of tokens for further text generation or analysis. This is primarily for researchers and AI practitioners who are building or experimenting with advanced AI models that interpret and respond to visual and textual information.
Use this if you are an AI researcher or developer focusing on multimodal AI architectures and want to experiment with the core mechanisms of integrating image and text data into a unified model.
Not ideal if you are looking for a ready-to-use application or a fully trained model for immediate deployment in a specific business context.
Stars
26
Forks
1
Language
Python
License
MIT
Category
Last pushed
Mar 09, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/kyegomez/MM1"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kyegomez/RT-X
Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...
kyegomez/PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
chuanyangjin/MMToM-QA
[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering
lyuchenyang/Macaw-LLM
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
Muennighoff/vilio
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle