src-d/datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
This provides ready-to-use datasets for analyzing source code and applying machine learning techniques directly to codebases. It takes raw source code, commit history, and development metadata as input, and outputs structured collections of code, identifiers, commit messages, or even labeled examples of code duplicates. It's designed for researchers, data scientists, and engineers working on software engineering analytics or developing AI for code.
343 stars. No commits in the last 6 months.
Use this if you are a researcher or data scientist needing large, pre-processed datasets of source code and related metadata for machine learning on code or software engineering research.
Not ideal if you are a developer looking for an SDK or library to embed code analysis into an application, or if you only need to analyze a single, specific repository.
Stars
343
Forks
85
Language
Jupyter Notebook
License
—
Category
Last pushed
Nov 27, 2019
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/src-d/datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
open-edge-platform/datumaro
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage...
explosion/ml-datasets
🌊 Machine learning dataset loaders for testing and example scripts
webdataset/webdataset
A high-performance Python-based I/O system for large (and small) deep learning problems, with...
tensorflow/datasets
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
mlcommons/croissant
Croissant is a high-level format for machine learning datasets that brings together four rich layers.