kyegomez/MM1

PyTorch Implementation of the paper "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training"

/ 100

Emerging

This project provides a foundational PyTorch implementation for exploring how large language models can understand and generate content based on both text and images. It takes an image and a sequence of text as input, processes them through a multimodal architecture, and outputs a refined set of tokens for further text generation or analysis. This is primarily for researchers and AI practitioners who are building or experimenting with advanced AI models that interpret and respond to visual and textual information.

Use this if you are an AI researcher or developer focusing on multimodal AI architectures and want to experiment with the core mechanisms of integrating image and text data into a unified model.

Not ideal if you are looking for a ready-to-use application or a fully trained model for immediate deployment in a specific business context.

Multimodal AI research Large Language Models Computer Vision Natural Language Processing AI model development

No Package No Dependents

Maintenance 10 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 4 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

MIT

Higher-rated alternatives

kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...

kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

lyuchenyang/Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

Muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

Explore Transformer Models

All categories Trending Transformer directory Insights