LLM Domain Datasets LLM Tools
Datasets, benchmarks, and evaluation tools for domain-specific LLM applications (geoscience, entity matching, information extraction, etc.). Does NOT include general-purpose LLM datasets, training frameworks, or model architecture code.
There are 37 llm domain datasets tools tracked. 2 score above 50 (established tier). The highest-rated is monarch-initiative/ontogpt at 66/100 with 811 stars. 2 of the top 10 are actively maintained.
Get all 37 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-domain-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
monarch-initiative/ontogpt
LLM-based ontological extraction tools, including SPIRES |
|
Established |
| 2 |
weAIDB/awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper |
|
Established |
| 3 |
AXYZdong/AMchat
AM (Advanced Mathematics) Chat is a large language model that integrates... |
|
Emerging |
| 4 |
skywalker023/sodaverse
🥤🧑🏻🚀Code and dataset for our EMNLP 2023 paper - "SODA: Million-scale... |
|
Emerging |
| 5 |
Y-Research-SBU/TimeSeriesScientist
Official Repository for TimeSeriesScientist |
|
Emerging |
| 6 |
open-chinese/poetry-collection
中文《诗歌总集》,距今为止最全面,最系统的中文诗词数据集,统一数据建模. |
|
Emerging |
| 7 |
Jeryi-Sun/LLM-and-Law
This repository is dedicated to summarizing papers related to large language... |
|
Emerging |
| 8 |
SysNetS/SPEC5G
This repository contains the code and data of the paper titled "SPEC5G: A... |
|
Emerging |
| 9 |
davendw49/k2
Code and datasets for paper "K2: A Foundation Language Model for Geoscience... |
|
Emerging |
| 10 |
sciknoworg/llms4subjects
The official GermEval 2025 Task - LLMs4Subjects - Shared Task Dataset Repository |
|
Emerging |
| 11 |
microsoft/clinical-self-verification
Self-verification for LLMs. |
|
Emerging |
| 12 |
falensiazmi/IndoSafety
A dataset for LLM safety evaluation in Indonesian and major local languages... |
|
Emerging |
| 13 |
night-chen/ToolQA
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering... |
|
Emerging |
| 14 |
SpursGoZmy/Tabular-LLM
本项目旨在收集开源的表格智能任务数据集(比如表格问答、表格-文本生成等),将原始数据整理为指令微调格式的数据并微调LLM,进而增强LLM对于表格数据的理解... |
|
Emerging |
| 15 |
jd-coderepos/sota
The official training/validation/test dataset repository for the SOTA? task... |
|
Emerging |
| 16 |
CharlesPikachu/ToolBridge
ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities |
|
Emerging |
| 17 |
abcsys/libem-sample-data
Libem sample datasets. |
|
Emerging |
| 18 |
bioepic-data/bervo
BERVO, the Biological and Environmental Research Variable Ontology |
|
Emerging |
| 19 |
dsfsi/edu-assessment-llm-prompt
Educational Assesement using LLMs |
|
Experimental |
| 20 |
zjunlp/Data2Behavior
From Data to Behavior: Predicting Unintended Model Behaviors Before Training |
|
Experimental |
| 21 |
hitz-zentroa/lm-contamination
The LM Contamination Index is a manually created database of contamination... |
|
Experimental |
| 22 |
GS-Uni-Heidelberg/Paper-WhoPlaysWhichRole
🤖 Phrase-level protagonist detection and role classification in moral discourse. |
|
Experimental |
| 23 |
MIKUAFANS/SciTopic
[IEEE BigData 2025] SciTopic: Enhancing Topic Discovery in Scientific... |
|
Experimental |
| 24 |
willxxy/ECG-Byte
[MLHC 2025] ECG-Byte: A Tokenizer for End-to-End Generative... |
|
Experimental |
| 25 |
NLP-Research-Insights/SciTables
Enhance Table-to-Text by LLMs using scientific tables |
|
Experimental |
| 26 |
Iamsdt/awesome-bengali-ai
A curated collection of resources for Bengali AI, LLMs, Generative AI, and... |
|
Experimental |
| 27 |
RenzeLou/AAAR-1.0
The source code for running LLMs on the AAAR-1.0 benchmark. |
|
Experimental |
| 28 |
JustinMuecke/GLaMoR
This repository provides a framework for transforming OWL ontologies into a... |
|
Experimental |
| 29 |
mhmoslemi2338/Heterogeneity_EM_Survey
Official implementation of the paper "Heterogeneity in Entity Matching: A... |
|
Experimental |
| 30 |
eugeniusms/textgrad-TextualVerifier
TextualVerifier: Verify Step by Step in TextGrad Automated "Differentiation"... |
|
Experimental |
| 31 |
mehedihasanbijoy/BanglaLLMs
A collection of fine-tuned LLMs for Bangla language processing. |
|
Experimental |
| 32 |
s-m-hashemi/llms4ol-2024-challenge
Data and implementations of the paper "SKH-NLP at LLMs4OL 2024 Task B:... |
|
Experimental |
| 33 |
WangJingyao07/LLM-Papers-with-Code
🎉🎨 Papers, Code, Datasets for LLM and MLLM |
|
Experimental |
| 34 |
shiqinghuayi19/LLMforEvent
This is the public repository of AAAI 2024 paper "Is a Large Language Model... |
|
Experimental |
| 35 |
sohaamir/ttm
Replicating Cohn et al., (2015) from the 'Talking to Machines' project |
|
Experimental |
| 36 |
davidbroska/MixedSubjects
Replication code and data for the journal article titled "The Mixed Subjects... |
|
Experimental |
| 37 |
Mehreen1103/LLMs4OL-2025
Official participation in the 2nd LLMs4OL Challenge @ ISWC 2025, Nara,... |
|
Experimental |