bytedance/Sa2VA

Official Repo For Pixel-LLM Codebase

/ 100

Established

This tool helps creative professionals and analysts understand and interact with the content of images and videos. You provide an image or video, along with a natural language instruction or question, and it can identify and highlight specific objects (like 'the girl in the yellow dress') or provide a description of the scene. This is useful for anyone needing to precisely locate elements or extract detailed information from visual media.

1,558 stars.

Use this if you need to precisely segment objects within images or videos based on descriptive text, or if you want to ask questions about visual content and receive detailed, grounded answers.

Not ideal if your primary need is general image classification, simple object detection, or basic video summarization without dense, interactive understanding.

video-analysis image-segmentation content-understanding visual-search media-asset-management

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

1,558

Forks

114

Language

Python

License

Apache-2.0

Related models

jncraton/languagemodels

Explore large language models in 512MB of RAM

microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

haizelabs/verdict

Inference-time scaling for LLMs-as-a-judge.

albertan017/LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models

Cardinal-Operations/ORLM

ORLM: Training Large Language Models for Optimization Modeling

Explore Transformer Models

All categories Trending Transformer directory Insights