H-EmbodVis/MERGE
[NeurIPS 2025] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models
This project helps computer vision practitioners and 3D artists generate realistic images and accurately estimate depth from those images. You input text prompts or existing images, and it outputs a new image along with a detailed depth map, showing how far away different objects are. This is useful for anyone creating 3D scenes, visual effects, or analyzing spatial relationships in images.
215 stars.
Use this if you need to generate high-quality images from text descriptions and simultaneously obtain precise depth information for 3D modeling or scene understanding.
Not ideal if your primary goal is basic image generation without any need for depth estimation, or if you require an extremely lightweight solution for real-time applications.
Stars
215
Forks
18
Language
Python
License
Apache-2.0
Category
Last pushed
Oct 31, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/diffusion/H-EmbodVis/MERGE"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
UCSC-VLAA/story-iter
[ICLR 2026] A Training-free Iterative Framework for Long Story Visualization
PaddlePaddle/PaddleMIX
Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks,...
keivalya/mini-vla
a minimal, beginner-friendly VLA to show how robot policies can fuse images, text, and states to...
adobe-research/custom-diffusion
Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion (CVPR 2023)
byliutao/1Prompt1Story
🔥ICLR 2025 (Spotlight) One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation...