src-d/datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code

/ 100

Emerging

This provides ready-to-use datasets for analyzing source code and applying machine learning techniques directly to codebases. It takes raw source code, commit history, and development metadata as input, and outputs structured collections of code, identifiers, commit messages, or even labeled examples of code duplicates. It's designed for researchers, data scientists, and engineers working on software engineering analytics or developing AI for code.

343 stars. No commits in the last 6 months.

Use this if you are a researcher or data scientist needing large, pre-processed datasets of source code and related metadata for machine learning on code or software engineering research.

Not ideal if you are a developer looking for an SDK or library to embed code analysis into an application, or if you only need to analyze a single, specific repository.

software-engineering-research code-analytics machine-learning-on-code static-code-analysis developer-tooling

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 23 / 25

How are scores calculated?

Stars

343

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

open-edge-platform/datumaro

Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage...

explosion/ml-datasets

🌊 Machine learning dataset loaders for testing and example scripts

webdataset/webdataset

A high-performance Python-based I/O system for large (and small) deep learning problems, with...

tensorflow/datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

mlcommons/croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.

Explore ML Frameworks

All categories Trending ML Framework directory Insights