J0nasW/science-datalake
Unified data lake of 293M scientific papers from 8 scholarly sources + 13 ontologies (960 GB Parquet, queryable via DuckDB)
This project provides a comprehensive database of scientific papers, including full text, citations, and specialized metadata like retraction notices or funding links. It combines information from eight major scholarly sources and thirteen scientific ontologies, making it easier to analyze scientific trends or conduct literature reviews. Researchers, data scientists in academia, or those building AI applications for science can use this to quickly query across millions of publications.
Use this if you need a unified and queryable collection of scientific literature, complete with rich metadata and ontologies, for large-scale analysis or AI model training.
Not ideal if you only need to search for a few papers or if you prefer using web-based search engines for literature discovery.
Stars
8
Forks
—
Language
Jupyter Notebook
License
—
Category
Last pushed
Mar 12, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/J0nasW/science-datalake"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
huggingface/datasets
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient...
allenporter/home-assistant-datasets
This package is a collection of datasets for evaluating AI Models in the context of Home Assistant.
little1d/SpectrumLab
A pioneering unified platform designed to systematize and accelerate deep learning research in...