msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
When working with large datasets for machine learning, you often encounter corrupted files or data that doesn't meet your criteria. This tool helps data scientists and ML engineers efficiently handle these 'bad' samples dynamically without crashing their pipelines. It takes your existing PyTorch datasets and automatically skips or filters out problematic entries, letting you continue training with valid data.
377 stars. No commits in the last 6 months. Available on PyPI.
Use this if you're a data scientist or ML engineer using PyTorch and frequently encounter corrupted files, unreadable images, or need to filter data based on content (like language) during the data loading process without pre-filtering the entire dataset.
Not ideal if your datasets are perfectly clean and don't require dynamic error handling or content-based filtering during loading.
Stars
377
Forks
27
Language
Python
License
MIT
Category
Last pushed
Sep 22, 2022
Commits (30d)
0
Dependencies
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/msamogh/nonechucks"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
skrub-data/skrub
Machine learning with dataframes
biolab/orange3
🍊 :bar_chart: :bulb: Orange: Interactive data analysis
root-project/root
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
cleanlab/cleanlab
Cleanlab's open-source library is the standard data-centric AI package for data quality and...
drivendataorg/deon
A command line tool to easily add an ethics checklist to your data science projects.