LLM Domain Datasets LLM Tools

Datasets, benchmarks, and evaluation tools for domain-specific LLM applications (geoscience, entity matching, information extraction, etc.). Does NOT include general-purpose LLM datasets, training frameworks, or model architecture code.

There are 37 llm domain datasets tools tracked. 2 score above 50 (established tier). The highest-rated is monarch-initiative/ontogpt at 66/100 with 811 stars. 2 of the top 10 are actively maintained.

Get all 37 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-domain-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 monarch-initiative/ontogpt

LLM-based ontological extraction tools, including SPIRES

66
Established
2 weAIDB/awesome-data-llm

Official Repository of "LLM × DATA" Survey Paper

52
Established
3 AXYZdong/AMchat

AM (Advanced Mathematics) Chat is a large language model that integrates...

47
Emerging
4 skywalker023/sodaverse

🥤🧑🏻‍🚀Code and dataset for our EMNLP 2023 paper - "SODA: Million-scale...

47
Emerging
5 Y-Research-SBU/TimeSeriesScientist

Official Repository for TimeSeriesScientist

45
Emerging
6 open-chinese/poetry-collection

中文《诗歌总集》,距今为止最全面,最系统的中文诗词数据集,统一数据建模.

44
Emerging
7 Jeryi-Sun/LLM-and-Law

This repository is dedicated to summarizing papers related to large language...

43
Emerging
8 SysNetS/SPEC5G

This repository contains the code and data of the paper titled "SPEC5G: A...

40
Emerging
9 davendw49/k2

Code and datasets for paper "K2: A Foundation Language Model for Geoscience...

39
Emerging
10 sciknoworg/llms4subjects

The official GermEval 2025 Task - LLMs4Subjects - Shared Task Dataset Repository

38
Emerging
11 microsoft/clinical-self-verification

Self-verification for LLMs.

38
Emerging
12 falensiazmi/IndoSafety

A dataset for LLM safety evaluation in Indonesian and major local languages...

36
Emerging
13 night-chen/ToolQA

ToolQA, a new dataset to evaluate the capabilities of LLMs in answering...

36
Emerging
14 SpursGoZmy/Tabular-LLM

本项目旨在收集开源的表格智能任务数据集(比如表格问答、表格-文本生成等),将原始数据整理为指令微调格式的数据并微调LLM,进而增强LLM对于表格数据的理解...

32
Emerging
15 jd-coderepos/sota

The official training/validation/test dataset repository for the SOTA? task...

32
Emerging
16 CharlesPikachu/ToolBridge

ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities

32
Emerging
17 abcsys/libem-sample-data

Libem sample datasets.

31
Emerging
18 bioepic-data/bervo

BERVO, the Biological and Environmental Research Variable Ontology

31
Emerging
19 dsfsi/edu-assessment-llm-prompt

Educational Assesement using LLMs

29
Experimental
20 zjunlp/Data2Behavior

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

26
Experimental
21 hitz-zentroa/lm-contamination

The LM Contamination Index is a manually created database of contamination...

24
Experimental
22 GS-Uni-Heidelberg/Paper-WhoPlaysWhichRole

🤖 Phrase-level protagonist detection and role classification in moral discourse.

23
Experimental
23 MIKUAFANS/SciTopic

[IEEE BigData 2025] SciTopic: Enhancing Topic Discovery in Scientific...

23
Experimental
24 willxxy/ECG-Byte

[MLHC 2025] ECG-Byte: A Tokenizer for End-to-End Generative...

23
Experimental
25 NLP-Research-Insights/SciTables

Enhance Table-to-Text by LLMs using scientific tables

22
Experimental
26 Iamsdt/awesome-bengali-ai

A curated collection of resources for Bengali AI, LLMs, Generative AI, and...

22
Experimental
27 RenzeLou/AAAR-1.0

The source code for running LLMs on the AAAR-1.0 benchmark.

22
Experimental
28 JustinMuecke/GLaMoR

This repository provides a framework for transforming OWL ontologies into a...

19
Experimental
29 mhmoslemi2338/Heterogeneity_EM_Survey

Official implementation of the paper "Heterogeneity in Entity Matching: A...

19
Experimental
30 eugeniusms/textgrad-TextualVerifier

TextualVerifier: Verify Step by Step in TextGrad Automated "Differentiation"...

19
Experimental
31 mehedihasanbijoy/BanglaLLMs

A collection of fine-tuned LLMs for Bangla language processing.

17
Experimental
32 s-m-hashemi/llms4ol-2024-challenge

Data and implementations of the paper "SKH-NLP at LLMs4OL 2024 Task B:...

17
Experimental
33 WangJingyao07/LLM-Papers-with-Code

🎉🎨 Papers, Code, Datasets for LLM and MLLM

14
Experimental
34 shiqinghuayi19/LLMforEvent

This is the public repository of AAAI 2024 paper "Is a Large Language Model...

13
Experimental
35 sohaamir/ttm

Replicating Cohn et al., (2015) from the 'Talking to Machines' project

12
Experimental
36 davidbroska/MixedSubjects

Replication code and data for the journal article titled "The Mixed Subjects...

11
Experimental
37 Mehreen1103/LLMs4OL-2025

Official participation in the 2nd LLMs4OL Challenge @ ISWC 2025, Nara,...

11
Experimental