kalininalab/DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.

52
/ 100
Established

When preparing data for machine learning, DataSAIL helps you divide your dataset into training, validation, and test sets in a way that prevents information from one set 'leaking' into another. This ensures your model's performance estimates are realistic and not overly optimistic. It takes your raw data, like biological sequences or general experimental observations, and outputs carefully partitioned subsets for robust model evaluation. Scientists, data analysts, and machine learning practitioners who need to build reliable predictive models will find this useful.

Available on PyPI.

Use this if you need to split datasets for machine learning, especially with biological or chemical data, and want to ensure that your model's performance isn't artificially inflated by similarities between your training and test data.

Not ideal if your primary goal is simple random sampling for data splitting, or if you are not concerned about inter-sample similarities causing information leakage.

bioinformatics drug-discovery materials-science cheminformatics predictive-modeling
Maintenance 10 / 25
Adoption 8 / 25
Maturity 25 / 25
Community 9 / 25

How are scores calculated?

Stars

48

Forks

4

Language

Python

License

MIT

Last pushed

Mar 12, 2026

Commits (30d)

0

Dependencies

16

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/data-engineering/kalininalab/DataSAIL"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.