kalininalab/DataSAIL
DataSAIL is a tool to split datasets while reducing information leakage.
When preparing data for machine learning, DataSAIL helps you divide your dataset into training, validation, and test sets in a way that prevents information from one set 'leaking' into another. This ensures your model's performance estimates are realistic and not overly optimistic. It takes your raw data, like biological sequences or general experimental observations, and outputs carefully partitioned subsets for robust model evaluation. Scientists, data analysts, and machine learning practitioners who need to build reliable predictive models will find this useful.
Available on PyPI.
Use this if you need to split datasets for machine learning, especially with biological or chemical data, and want to ensure that your model's performance isn't artificially inflated by similarities between your training and test data.
Not ideal if your primary goal is simple random sampling for data splitting, or if you are not concerned about inter-sample similarities causing information leakage.
Stars
48
Forks
4
Language
Python
License
MIT
Category
Last pushed
Mar 12, 2026
Commits (30d)
0
Dependencies
16
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/data-engineering/kalininalab/DataSAIL"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.