seart-group/DL4SE

Building Training Datasets for Deep Learning Models in Software Engineering and Empirical Software Engineering Research

/ 100

Emerging

The SEART Data Hub helps software engineering researchers and practitioners create extensive datasets from GitHub source code. It takes raw code from repositories and processes it to identify specific elements like test code or boilerplate, outputting structured datasets suitable for empirical studies or training deep learning models for software development tasks. This tool is designed for academics and industry researchers focused on improving software engineering through data-driven approaches.

No commits in the last 6 months.

Use this if you need to build large-scale, specialized datasets from public GitHub repositories for software engineering research or to train AI models for coding tasks.

Not ideal if you are looking for a general-purpose code analysis tool or if your data sources are not GitHub repositories.

empirical-software-engineering software-research dataset-generation code-analysis deep-learning-for-code

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 7 / 25

Maturity 16 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Java

License

MIT

Higher-rated alternatives

open-edge-platform/datumaro

Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage...

explosion/ml-datasets

🌊 Machine learning dataset loaders for testing and example scripts

webdataset/webdataset

A high-performance Python-based I/O system for large (and small) deep learning problems, with...

tensorflow/datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

mlcommons/croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.

Explore ML Frameworks

All categories Trending ML Framework directory Insights