H-EmbodVis/MERGE

[NeurIPS 2025] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

/ 100

Emerging

This project helps computer vision practitioners and 3D artists generate realistic images and accurately estimate depth from those images. You input text prompts or existing images, and it outputs a new image along with a detailed depth map, showing how far away different objects are. This is useful for anyone creating 3D scenes, visual effects, or analyzing spatial relationships in images.

215 stars.

Use this if you need to generate high-quality images from text descriptions and simultaneously obtain precise depth information for 3D modeling or scene understanding.

Not ideal if your primary goal is basic image generation without any need for depth estimation, or if you require an extremely lightweight solution for real-time applications.

3D-modeling computer-vision image-synthesis visual-effects scene-understanding

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 13 / 25

Community 13 / 25

How are scores calculated?

Stars

215

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

UCSC-VLAA/story-iter

[ICLR 2026] A Training-free Iterative Framework for Long Story Visualization

PaddlePaddle/PaddleMIX

Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks,...

keivalya/mini-vla

a minimal, beginner-friendly VLA to show how robot policies can fuse images, text, and states to...

adobe-research/custom-diffusion

Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion (CVPR 2023)

byliutao/1Prompt1Story

🔥ICLR 2025 (Spotlight) One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation...

Explore Diffusion Models

All categories Trending Diffusion directory Insights