bytedance/Sa2VA
Official Repo For Pixel-LLM Codebase
This tool helps creative professionals and analysts understand and interact with the content of images and videos. You provide an image or video, along with a natural language instruction or question, and it can identify and highlight specific objects (like 'the girl in the yellow dress') or provide a description of the scene. This is useful for anyone needing to precisely locate elements or extract detailed information from visual media.
1,558 stars.
Use this if you need to precisely segment objects within images or videos based on descriptive text, or if you want to ask questions about visual content and receive detailed, grounded answers.
Not ideal if your primary need is general image classification, simple object detection, or basic video summarization without dense, interactive understanding.
Stars
1,558
Forks
114
Language
Python
License
Apache-2.0
Category
Last pushed
Feb 27, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/bytedance/Sa2VA"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
jncraton/languagemodels
Explore large language models in 512MB of RAM
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
haizelabs/verdict
Inference-time scaling for LLMs-as-a-judge.
albertan017/LLM4Decompile
Reverse Engineering: Decompiling Binary Code with Large Language Models
Cardinal-Operations/ORLM
ORLM: Training Large Language Models for Optimization Modeling