src-d/datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code

49
/ 100
Emerging

This provides ready-to-use datasets for analyzing source code and applying machine learning techniques directly to codebases. It takes raw source code, commit history, and development metadata as input, and outputs structured collections of code, identifiers, commit messages, or even labeled examples of code duplicates. It's designed for researchers, data scientists, and engineers working on software engineering analytics or developing AI for code.

343 stars. No commits in the last 6 months.

Use this if you are a researcher or data scientist needing large, pre-processed datasets of source code and related metadata for machine learning on code or software engineering research.

Not ideal if you are a developer looking for an SDK or library to embed code analysis into an application, or if you only need to analyze a single, specific repository.

software-engineering-research code-analytics machine-learning-on-code static-code-analysis developer-tooling
Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 23 / 25

How are scores calculated?

Stars

343

Forks

85

Language

Jupyter Notebook

License

Last pushed

Nov 27, 2019

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/src-d/datasets"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.