unum-cloud/UForm
Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
This helps you quickly understand and generate content from a mix of multilingual text and images, with upcoming support for video. You input texts, images, or both, and it outputs concise descriptions, answers to questions about images, or numerical representations that help with search and classification. Marketers, content strategists, or anyone needing to analyze and create multimedia content efficiently would use this.
1,221 stars. Available on PyPI.
Use this if you need to rapidly process and generate insights from diverse content formats like images and text, especially across multiple languages, or want to build smart search features into your applications.
Not ideal if your primary need is extremely deep, nuanced analysis of a single modality (e.g., complex legal text analysis or high-fidelity image editing) rather than multimodal understanding.
Stars
1,221
Forks
76
Language
Python
License
Apache-2.0
Category
Last pushed
Oct 30, 2025
Commits (30d)
0
Dependencies
4
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/unum-cloud/UForm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
rom1504/clip-retrieval
Easily compute clip embeddings and build a clip retrieval system with them
mazzzystar/Queryable
Run OpenAI's CLIP and Apple's MobileCLIP model on iOS to search photos.
s-emanuilov/litepali
LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing,...
slavabarkov/tidy
Offline semantic Text-to-Image and Image-to-Image search on Android powered by quantized...
cloudera/CML_AMP_Image_Analysis
Build a semantic search application with deep learning models.