Multimodal Visual Grounding NLP Tools

Tools for grounding natural language in visual content (images, video, 3D scenes), including visual question answering, object localization, and cross-modal retrieval. Does NOT include general image captioning, multimodal pretraining without grounding focus, or speech-only cross-modal tasks.

There are 20 multimodal visual grounding tools tracked. The highest-rated is TheShadow29/awesome-grounding at 47/100 with 1,125 stars.

Get all 20 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=multimodal-visual-grounding&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 TheShadow29/awesome-grounding

awesome grounding: A curated list of research papers in visual grounding

47
Emerging
2 microsoft/XPretrain

Multi-modality pre-training

41
Emerging
3 TheShadow29/zsgnet-pytorch

Official implementation of ICCV19 oral paper Zero-Shot grounding of Objects...

41
Emerging
4 TheShadow29/VidSitu

[CVPR21] Visual Semantic Role Labeling for Video Understanding...

38
Emerging
5 zeyofu/BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large...

37
Emerging
6 gicheonkang/sglkt-visdial

🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with...

36
Emerging
7 qaixerabbas/awesome-multimodal-learning-with-imperfect-data

Multimodal Representation Learning under Imperfect Data Conditions: A Survey

36
Emerging
8 princeton-nlp/XTX

[ICLR 2022 Spotlight] Multi-Stage Episodic Control for Strategic Exploration...

34
Emerging
9 MiuLab/DuaLUG

The implementation of the papers on dual learning of natural language...

34
Emerging
10 SkalskiP/awesome-foundation-and-multimodal-models

👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper...

33
Emerging
11 fork123aniket/Graph-Neural-Network-based-Visual-Question-Answering

Implementation of GNNs for Visual Question Answering task in PyTorch

32
Emerging
12 1989Ryan/paragon

[ICRA 2023] Differentiable parsing and visual grounding of natural language...

30
Emerging
13 tim-dickey/multi-modal-neural-network

Multi-modal neural network with double-loop learning that fuses vision and...

24
Experimental
14 aistairc/VDAct

A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

22
Experimental
15 candacelax/grounded-vision-parser

semantic parser trained by using videos only instead of labeled logical forms

22
Experimental
16 psunlpgroup/MPlanner

ACL2025-Findings paper "Enhance Multimodal Consistency and Coherence for...

20
Experimental
17 nmhongtram/gnn-surgical-understanding

Graph Reasoning for Visual Question Answering in Laparoscopic Scene Understanding

16
Experimental
18 huckiyang/Interspeech23-Tutorial-Para-Efficient-Cross-Modal-Tutorial

Interspeech Tutorial - Resource Efficient and Cross-Modal Learning Toward...

14
Experimental
19 iral-lab/gold

Multimodal grounded language dataset

13
Experimental
20 zoppellarielena/Paper-Presentation-for-Natural-Language-Processing

This presentation, conducted for the "Natural Language Processing" course,...

11
Experimental