unum-cloud/UForm

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

/ 100

Established

This helps you quickly understand and generate content from a mix of multilingual text and images, with upcoming support for video. You input texts, images, or both, and it outputs concise descriptions, answers to questions about images, or numerical representations that help with search and classification. Marketers, content strategists, or anyone needing to analyze and create multimedia content efficiently would use this.

1,221 stars. Available on PyPI.

Use this if you need to rapidly process and generate insights from diverse content formats like images and text, especially across multiple languages, or want to build smart search features into your applications.

Not ideal if your primary need is extremely deep, nuanced analysis of a single modality (e.g., complex legal text analysis or high-fidelity image editing) rather than multimodal understanding.

content-analysis multimedia-search image-captioning visual-question-answering cross-lingual-content

Maintenance 6 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 16 / 25

How are scores calculated?

Stars

1,221

Forks

Language

Python

License

Apache-2.0

Related tools

rom1504/clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

mazzzystar/Queryable

Run OpenAI's CLIP and Apple's MobileCLIP model on iOS to search photos.

s-emanuilov/litepali

LitePali is a minimal, efficient implementation of ColPali for image retrieval and indexing,...

slavabarkov/tidy

Offline semantic Text-to-Image and Image-to-Image search on Android powered by quantized...

cloudera/CML_AMP_Image_Analysis

Build a semantic search application with deep learning models.

Explore Embedding Tools

All categories Trending Embeddings directory Insights