AviSoori1x/seemore

From scratch implementation of a vision language model in pure PyTorch

/ 100

Emerging

This is a detailed, from-scratch implementation of a vision language model (VLM) in PyTorch. It takes an image and a text prompt as input and generates human-like text outputs, similar to how advanced AI models understand both images and text. It's designed for machine learning researchers, students, or practitioners who want to deeply understand how these multimodal AI models work by building one from its fundamental components.

255 stars. No commits in the last 6 months.

Use this if you are a machine learning researcher or student who wants to learn the foundational principles of vision language models by examining a complete, transparent, and hackable implementation.

Not ideal if you are looking for an off-the-shelf, production-ready vision language model for immediate application, as this project prioritizes educational value and readability over performance.

deep-learning-research ai-model-development multimodal-ai natural-language-processing computer-vision

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 16 / 25

How are scores calculated?

Stars

255

Forks

Language

Jupyter Notebook

License

MIT

Higher-rated alternatives

AI-Hypercomputer/maxtext

A simple, performant and scalable Jax LLM!

rasbt/reasoning-from-scratch

Implement a reasoning LLM in PyTorch from scratch, step by step

mindspore-lab/mindnlp

MindSpore + 🤗Huggingface: Run any Transformers/Diffusers model on MindSpore with seamless...

mosaicml/llm-foundry

LLM training code for Databricks foundation models

rickiepark/llm-from-scratch

<밑바닥부터 만들면서 공부하는 LLM>(길벗, 2025)의 코드 저장소

Explore Transformer Models

All categories Trending Transformer directory Insights