Multimodal Visual Grounding NLP Tools

Tools for grounding natural language in visual content (images, video, 3D scenes), including visual question answering, object localization, and cross-modal retrieval. Does NOT include general image captioning, multimodal pretraining without grounding focus, or speech-only cross-modal tasks.

There are 20 multimodal visual grounding tools tracked. The highest-rated is TheShadow29/awesome-grounding at 47/100 with 1,125 stars.

Get all 20 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=multimodal-visual-grounding&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	TheShadow29/awesome-grounding awesome grounding: A curated list of research papers in visual grounding	47	Emerging	1,125	—
2	microsoft/XPretrain Multi-modality pre-training	41	Emerging	510	Python
3	TheShadow29/zsgnet-pytorch Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects...	41	Emerging	72	Python
4	TheShadow29/VidSitu [CVPR21] Visual Semantic Role Labeling for Video Understanding...	38	Emerging	61	Python
5	zeyofu/BLINK_Benchmark This repo contains evaluation code for the paper "BLINK: Multimodal Large...	37	Emerging	164	Python
6	gicheonkang/sglkt-visdial 🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with...	36	Emerging	13	Python
7	qaixerabbas/awesome-multimodal-learning-with-imperfect-data Multimodal Representation Learning under Imperfect Data Conditions: A Survey	36	Emerging	2	—
8	princeton-nlp/XTX [ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration...	34	Emerging	15	Python
9	MiuLab/DuaLUG The implementation of the papers on dual learning of natural language...	34	Emerging	67	Python
10	SkalskiP/awesome-foundation-and-multimodal-models 👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper...	33	Emerging	638	Python
11	fork123aniket/Graph-Neural-Network-based-Visual-Question-Answering Implementation of GNNs for Visual Question Answering task in PyTorch	32	Emerging	3	Python
12	1989Ryan/paragon [ICRA 2023] Differentiable parsing and visual grounding of natural language...	30	Emerging	6	Python
13	tim-dickey/multi-modal-neural-network Multi-modal neural network with double-loop learning that fuses vision and...	24	Experimental	1	Python
14	aistairc/VDAct A Video-grounded Dialogue Dataset and Metric for Event-driven Activities	22	Experimental	5	Python
15	candacelax/grounded-vision-parser semantic parser trained by using videos only instead of labeled logical forms	22	Experimental	6	Java
16	psunlpgroup/MPlanner ACL2025-Findings paper "Enhance Multimodal Consistency and Coherence for...	20	Experimental	3	Python
17	nmhongtram/gnn-surgical-understanding Graph Reasoning for Visual Question Answering in Laparoscopic Scene Understanding	16	Experimental	3	Jupyter Notebook
18	huckiyang/Interspeech23-Tutorial-Para-Efficient-Cross-Modal-Tutorial Interspeech Tutorial - Resource Efficient and Cross-Modal Learning Toward...	14	Experimental	15	—
19	iral-lab/gold Multimodal grounded language dataset	13	Experimental	11	—
20	zoppellarielena/Paper-Presentation-for-Natural-Language-Processing This presentation, conducted for the "Natural Language Processing" course,...	11	Experimental	—	—